Update - A further fix to the routing configuration within the eRI's high-performance network was implemented by our L2 network support at approx 1440hrs. Connectivity between the eRI compute and filesystem (GPFS) has since been stable and performant.
All test have completed successfully, Nix is now running as expected.

We will continue to monitor for any further degradation.

Sep 12, 2025 - 15:44 NZST
Update - We continue to suffer packet loss on the network causing slow response and disruption to Nix. Our L3 support partner has been engaged again and we hope for further progress tonight.
Nix clients have just been restarted on all compute nodes and login-0, and Nix is working for the moment.

Sep 12, 2025 - 14:00 NZST
Update - Network packet loss has occurred again at 0930 causing further disruptions and slow response. We are investigating.
Sep 12, 2025 - 10:10 NZST
Monitoring - A fix to routing configuration with the eRI's high-performance network was implemented by our technology partner at approx 1800hrs. Connectivity between the eRI compute and filesystem (GPFS) has since been stable and performant.

We recognise the impact to users from this issue was major and have upgraded the incident here as a result (rest assured we were treating it as such regardless). If you observe any further issues following on from the maintenance work this week, please reach out to support.

Sep 11, 2025 - 22:25 NZST
Update - We are still experiencing network issues but we have our L3 support engaged and expect some progress overnight.
Apologies for the ongoing frustrations.

Sep 11, 2025 - 17:27 NZST
Update - We have now identified an underlying network issue causing packet loss and retransmissions between the compute and storage clusters. This will result in periodic slow responses, and at worst, a login or compute node being expelled from the cluster, which results in a longer period of recovery (15 - 30 mins). We are working hard to identify the exact fault, and are engaging our third-party network experts.
Sep 11, 2025 - 11:56 NZST
Identified - We have identified a GPFS issue occurred at the time slowness was reported. This has recovered and for now Nix test times are back to normal. Investigation continues
Sep 11, 2025 - 10:42 NZST
Investigating - We are currently investigating this issue.
Sep 11, 2025 - 10:08 NZST
Monitoring - Kia ora all, here is a belated update on this incident.

From approx 2200hrs on Thurs 4th September, tenant network and associated floating IP connectivity to all FlexiHPC VM instances started going offline. A subset of instances also went into the SHUTOFF state. This was the result of a config and automation regression in our OpenStack infrastructure. A config fix was rolled out approx 0400hrs on Friday the 5th which resolved network connectivity issues for impacted instances.

However, a subset of instances that were SHUTOFF by the original issue were additionally impacted by a serious corner case in the OpenStack deployment tooling that we use. This resulted in duplicate VM instances being launched, which in turn meant that some instance root drives and attached volumes were inadvertently multi-attached, which could lead to instance availability and potential data corruption issues. Our team have since worked tirelessly with impacted instance owners to address the follow-on issues and recover services. If you are still experiencing any issues, please contact support. We apologise the disruption to service and are making adjustments to our processes to minimise the possibility of similar problems in future.

Sep 11, 2025 - 22:09 NZST
Identified - Peaks Online has been restored but is currently awaiting the installation of a license. All other production services are available.

There are numerous Openstack instances (VMs) that my be impacted by having duplicate virtual machine processes. These require operator intervention to attempt restoration of service. If you have an active instance that you can no longer log onto, or an instance that is shutoff and will not start up, please open a support ticket.

We appreciate your patience over this difficult problem and apologise for any inconvenience it has caused.

Sep 05, 2025 - 15:42 NZST
Investigating - We had some serious issues from about 10pm last night that interrupted numerous Openstack instances. If you have VMs that were shutdown please try restarting them, if this is not successful please log a support ticket. We are aware that Peaks Online is down.
compute-0 is drained, we are working on it, but otherwise the compute cluster appears ok. Slurm is available and jobs are running

Sep 05, 2025 - 08:51 NZST
Update - The filesystems are now available. The Slurm reservation has been removed and jobs are running. Nix has been mounted and verified
Sep 10, 2025 - 14:41 NZST
Verifying - Verification is currently underway for the maintenance items.
Sep 10, 2025 - 14:09 NZST
Update - HPE continues to troubleshoot the storage system. We are still working to bring the filesystem (projects/datasets) back online. A further update will be posted around 1400 hrs; maintenance window is extended to 1500 hrs.
Sep 10, 2025 - 13:23 NZST
Update - The maintenance window has been extended until 1400 hrs. A further update will be posted around 1300 hrs.
Reason: network upgrade is complete, but we are experiencing issues bringing the filesystem (projects/datasets) back online. Urgent vendor support has been engaged.

Sep 10, 2025 - 11:20 NZST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Sep 09, 2025 - 16:00 NZST
Scheduled - The eRI cluster and associated services will be shutdown and unavailable whilst network upgrades are undertaken. Slurm jobs will not run, and VMs will not be available during the maintenance. This is scheduled to start at 1600hrs on Tuesday Sep 9th and be completed by 1200hrs Wed Sep 10th.
Sep 9, 2025 16:00 - Sep 10, 2025 15:00 NZST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Sep 01, 2025 - 12:15 NZST
Scheduled - A slurm reservation is now in place for next weeks maintenance window, starting on Sep 9th at 4pm. Any jobs that won't complete by this time will not be started.
Sep 1, 2025 12:15 - Sep 10, 2025 12:15 NZST

About This Site

AgResearch eRI status

Identity Broker Service ? Operational
90 days ago
100.0 % uptime
Today
Managed Storage Service ? Operational
90 days ago
100.0 % uptime
Today
General Flexi HPC Platform ? Operational
90 days ago
98.56 % uptime
Today
Network connectivity ? Operational
90 days ago
100.0 % uptime
Today
Compute cluster ? Operational
90 days ago
99.63 % uptime
Today
Login nodes ? Operational
90 days ago
99.66 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Sep 14, 2025

No incidents reported today.

Sep 13, 2025

No incidents reported.

Sep 12, 2025

Unresolved incident: login nodes and Nix running slow.

Sep 11, 2025

Unresolved incident: Overnight problems.

Sep 10, 2025

Unresolved incident: eRI Full outage for network maintenance on Sep 9th and 10th.

Sep 9, 2025
Sep 8, 2025

No incidents reported.

Sep 7, 2025

No incidents reported.

Sep 6, 2025

No incidents reported.

Sep 5, 2025
Sep 4, 2025

No incidents reported.

Sep 3, 2025

No incidents reported.

Sep 2, 2025

No incidents reported.

Sep 1, 2025

Unresolved incident: Slurm reservation in place for Sep 9/10 maintenance.

Aug 31, 2025

No incidents reported.