AgResearch eRI Status

All Systems Operational

About This Site

AgResearch eRI status

Uptime over the past 90 days. View historical uptime.

Identity Broker Service Operational

90 days ago

100.0 % uptime

Today

Managed Storage Service Operational

90 days ago

100.0 % uptime

Today

General Flexi HPC Platform Operational

90 days ago

98.56 % uptime

Today

Network connectivity Operational

90 days ago

100.0 % uptime

Today

Compute cluster Operational

90 days ago

98.88 % uptime

Today

90 days ago

99.66 % uptime

Today

Operational

Degraded Performance

Partial Outage

Major Outage

Maintenance

Past Incidents

Sep 19, 2025

gpu-0 compute node drained

Resolved - This incident has been resolved.
Sep 19, 13:25 NZST

Monitoring - We have made some network configuration changes on gpu-0 which have reduced the mlag failover frequency. We will monitor for any regression. gpu-0 is now availabel again in SLurm.
Sep 17, 16:21 NZST

Investigating - During last week's network issues we discovered gpu-0 had its own set of unrelated problems and hence it has been drained. The node is suffering from frequent mlag failover on its bonded interface. We will be reseating and testing cables early this week.
Sep 15, 10:10 NZST

Sep 18, 2025

No incidents reported.

Sep 17, 2025

Overnight problems

Resolved - This incident has been resolved.
Sep 17, 10:00 NZST

Monitoring - Kia ora all, here is a belated update on this incident.

From approx 2200hrs on Thurs 4th September, tenant network and associated floating IP connectivity to all FlexiHPC VM instances started going offline. A subset of instances also went into the SHUTOFF state. This was the result of a config and automation regression in our OpenStack infrastructure. A config fix was rolled out approx 0400hrs on Friday the 5th which resolved network connectivity issues for impacted instances.

However, a subset of instances that were SHUTOFF by the original issue were additionally impacted by a serious corner case in the OpenStack deployment tooling that we use. This resulted in duplicate VM instances being launched, which in turn meant that some instance root drives and attached volumes were inadvertently multi-attached, which could lead to instance availability and potential data corruption issues. Our team have since worked tirelessly with impacted instance owners to address the follow-on issues and recover services. If you are still experiencing any issues, please contact support. We apologise the disruption to service and are making adjustments to our processes to minimise the possibility of similar problems in future.
Sep 11, 22:09 NZST

Identified - Peaks Online has been restored but is currently awaiting the installation of a license. All other production services are available.

There are numerous Openstack instances (VMs) that my be impacted by having duplicate virtual machine processes. These require operator intervention to attempt restoration of service. If you have an active instance that you can no longer log onto, or an instance that is shutoff and will not start up, please open a support ticket.

We appreciate your patience over this difficult problem and apologise for any inconvenience it has caused.
Sep 5, 15:42 NZST

Investigating - We had some serious issues from about 10pm last night that interrupted numerous Openstack instances. If you have VMs that were shutdown please try restarting them, if this is not successful please log a support ticket. We are aware that Peaks Online is down.
compute-0 is drained, we are working on it, but otherwise the compute cluster appears ok. Slurm is available and jobs are running
Sep 5, 08:51 NZST

Sep 16, 2025

No incidents reported.

Sep 15, 2025

Slurm reservation in place for Sep 9/10 maintenance

Completed - The scheduled maintenance has been completed.
Sep 15, 16:56 NZST

In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Sep 1, 12:15 NZST

Scheduled - A slurm reservation is now in place for next weeks maintenance window, starting on Sep 9th at 4pm. Any jobs that won't complete by this time will not be started.
Sep 1, 12:02 NZST

eRI Full outage for network maintenance on Sep 9th and 10th

Completed - The scheduled maintenance has been completed.
Sep 15, 16:55 NZST

Update - The filesystems are now available. The Slurm reservation has been removed and jobs are running. Nix has been mounted and verified
Sep 10, 14:41 NZST

Verifying - Verification is currently underway for the maintenance items.
Sep 10, 14:09 NZST

Update - HPE continues to troubleshoot the storage system. We are still working to bring the filesystem (projects/datasets) back online. A further update will be posted around 1400 hrs; maintenance window is extended to 1500 hrs.
Sep 10, 13:23 NZST

Update - The maintenance window has been extended until 1400 hrs. A further update will be posted around 1300 hrs.
Reason: network upgrade is complete, but we are experiencing issues bringing the filesystem (projects/datasets) back online. Urgent vendor support has been engaged.
Sep 10, 11:20 NZST

In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Sep 9, 16:00 NZST

Scheduled - The eRI cluster and associated services will be shutdown and unavailable whilst network upgrades are undertaken. Slurm jobs will not run, and VMs will not be available during the maintenance. This is scheduled to start at 1600hrs on Tuesday Sep 9th and be completed by 1200hrs Wed Sep 10th.
Aug 11, 15:32 NZST

Resolved - This incident has been resolved.
Sep 15, 16:55 NZST

Update - A further fix to the routing configuration within the eRI's high-performance network was implemented by our L2 network support at approx 1440hrs. Connectivity between the eRI compute and filesystem (GPFS) has since been stable and performant.
All test have completed successfully, Nix is now running as expected.

We will continue to monitor for any further degradation.
Sep 12, 15:44 NZST

Update - We continue to suffer packet loss on the network causing slow response and disruption to Nix. Our L3 support partner has been engaged again and we hope for further progress tonight.
Nix clients have just been restarted on all compute nodes and login-0, and Nix is working for the moment.
Sep 12, 14:00 NZST

Update - Network packet loss has occurred again at 0930 causing further disruptions and slow response. We are investigating.
Sep 12, 10:10 NZST

Monitoring - A fix to routing configuration with the eRI's high-performance network was implemented by our technology partner at approx 1800hrs. Connectivity between the eRI compute and filesystem (GPFS) has since been stable and performant.

We recognise the impact to users from this issue was major and have upgraded the incident here as a result (rest assured we were treating it as such regardless). If you observe any further issues following on from the maintenance work this week, please reach out to support.
Sep 11, 22:25 NZST

Update - We are still experiencing network issues but we have our L3 support engaged and expect some progress overnight.
Apologies for the ongoing frustrations.
Sep 11, 17:27 NZST

Update - We have now identified an underlying network issue causing packet loss and retransmissions between the compute and storage clusters. This will result in periodic slow responses, and at worst, a login or compute node being expelled from the cluster, which results in a longer period of recovery (15 - 30 mins). We are working hard to identify the exact fault, and are engaging our third-party network experts.
Sep 11, 11:56 NZST

Identified - We have identified a GPFS issue occurred at the time slowness was reported. This has recovered and for now Nix test times are back to normal. Investigation continues
Sep 11, 10:42 NZST

Investigating - We are currently investigating this issue.
Sep 11, 10:08 NZST

Sep 14, 2025

No incidents reported.

Sep 13, 2025

No incidents reported.

Sep 12, 2025

Sep 11, 2025

Sep 10, 2025

Sep 9, 2025

Sep 8, 2025

No incidents reported.

Sep 7, 2025

No incidents reported.

Sep 6, 2025

No incidents reported.

Sep 5, 2025

All Systems Operational

About This Site

Related

Past Incidents