Update - We have been working with HPE to narrow down the scope and impact of this Problem. We now know that the issue appears to be limited to the GPFS filesystems and only a handful of files. We can also restore impacted files from DMF to other locations as a workaround to allow access to them.

HPE have just started an online filesystem scan to check for any metadata issues. This may impact IO performance while running.

Dec 17, 2025 - 14:44 NZDT
Investigating - We are experiencing some issues accessing offline files from DMF. This can manifest as files that can't be read or even deleted. The files affected can possibly be identified using the "du" command and will show as having a 0 bytes size.
We have escalated this issue with our storage support vendor

Dec 04, 2025 - 12:03 NZDT
Investigating - We have identified a couple of issues with differing group memberships between login-0 and login-1. The issue does not seem to be widespread but we are investigating regardless.
Nov 24, 2025 - 10:53 NZDT

About This Site

AgResearch eRI status

Identity Broker Service Partial Outage
90 days ago
89.47 % uptime
Today
Managed Storage Service Degraded Performance
90 days ago
100.0 % uptime
Today
General Flexi HPC Platform Operational
90 days ago
99.99 % uptime
Today
Network connectivity Operational
90 days ago
100.0 % uptime
Today
Compute cluster Operational
90 days ago
99.39 % uptime
Today
Login nodes Operational
90 days ago
99.97 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.

Scheduled Maintenance

Scratch autocleaner suspended for the holidays - Announcement Dec 22, 2025 13:00 - Jan 4, 2026 01:00 NZDT

The automated cleaning of the /mnt/gpfs/scratch filesystem will be suspended over the Xmas break. It will be re-enabled sometime in mid-January, with the exact date to be confirmed.
Posted on Dec 22, 2025 - 12:32 NZDT
Dec 25, 2025

No incidents reported today.

Dec 24, 2025

No incidents reported.

Dec 23, 2025

No incidents reported.

Dec 22, 2025

No incidents reported.

Dec 21, 2025

No incidents reported.

Dec 20, 2025

No incidents reported.

Dec 19, 2025

No incidents reported.

Dec 18, 2025

No incidents reported.

Dec 17, 2025
Resolved - This incident has been resolved.
Dec 17, 22:26 NZDT
Investigating - We've become aware that compute nodes are no longer able to access external resources. We suspect this will be a hangover issue from the recent network upgrade work and expect to coordinate with our support partners to resolve it this evening.
Dec 17, 12:19 NZDT
Completed - The upgrade is complete. There are some remaining issues we are working through though with separate incident notices covering these.
Dec 17, 14:47 NZDT
Update - Initial controlled shutdown of the first switch went smoothly thanks to configuration fixes made after the last aborted attempt. The network OS update was also applied successfully.
However, bring-up of redundant services on the first upgraded and reconfigured switch is causing some unforeseen issues. This has resulted in several short periods of external and internal connectivity loss that may have adversely affected some services. eRI cluster login nodes remain available and Slurm is healthy.
Vendor support engineers are working to investigate and resolve these issues. A further update will be given before 9am.

Dec 16, 02:46 NZDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Dec 15, 21:00 NZDT
Scheduled - The border switches upgrade has been rescheduled to Mon Dec 15th from 9pm. Any ssh and external connections to the cluster and storage may get broken during this maintenance. Slurm jobs will be unaffected. The login nodes may experience some short term disruption.
Dec 12, 12:39 NZDT
Resolved - This incident has been resolved.
Dec 17, 13:20 NZDT
Monitoring - A fix has been implemented and we are monitoring the results.
Dec 17, 09:59 NZDT
Investigating - We are experiencing issues where the Research Developer Cloud Dashboard and API's are not responding correctly or at all.

Instances running should be unaffected and we connections into those instances should still be possible.

This is currently being investigated, we apologize for the inconvenience caused at this time.

Dec 17, 09:52 NZDT
Resolved - This incident has been resolved.
Dec 17, 12:16 NZDT
Monitoring - A fix has been implemented and we are monitoring the results.
Dec 16, 18:44 NZDT
Update - The workaround is in place with most nodes back online and processing Slurm jobs again. Now dropping the impact as we bring more nodes back online.
Dec 16, 18:07 NZDT
Update - We've uncovered the underlying problem and are now attempting to implement a workaround until it can be fully resolved.
Dec 16, 15:28 NZDT
Update - More compute nodes have dropped off the network now so we are upgrading this to a major outage for the Slurm cluster. We're narrowing down the cause but may not be able to restore service until overseas L3 support engineers come online this evening. Apologies for the disruption!
Dec 16, 13:13 NZDT
Update - Three compute nodes and both huge memory nodes are now down exhibiting the same network issue. We are still working to determine the cause.
Dec 16, 10:25 NZDT
Identified - It appears that around 6:45am this morning a network event has occurred and disconnected a handful of compute nodes from the cluster. This does not seem to be link to the overnight border networking maintenance, though we are still attempting to restore connectivity and will focus on RCA later.
Dec 16, 09:32 NZDT
Investigating - We are currently investigating this issue.
Dec 16, 08:44 NZDT
Dec 16, 2025
Dec 15, 2025
Dec 14, 2025

No incidents reported.

Dec 13, 2025

No incidents reported.

Dec 12, 2025

No incidents reported.

Dec 11, 2025
Completed - The scheduled maintenance has been completed.
Dec 11, 15:44 NZDT
Verifying - The critical migrations are now completed. Slurm, Ondemamd are now available.
Dec 11, 14:12 NZDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Dec 11, 13:34 NZDT
Scheduled - We are having some issues with the VM migrations but we are actively working on it. Apologies for the delay
Dec 11, 13:33 NZDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Dec 11, 13:00 NZDT
Update - This maintenance has been confirmed for 1pm today
Dec 11, 10:30 NZDT
Scheduled - We are tentatively scheduling a maintenance window for Thursday Dec 11th from 1pm. There will be a short (20 mins) period where new Slurm job submissions won't work, Ondemand apps will be unavailable, and Peaks will be unavailable. Running Slurm jobs will not be affected.
The other affected VMs will be live migrated and should be unaffected other than a brief pause.

Dec 10, 14:08 NZDT