Several compute nodes down

Incident Report for AgResearch eRI

Resolved

This incident has been resolved.
Posted Dec 17, 2025 - 12:16 NZDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Dec 16, 2025 - 18:44 NZDT

Update

The workaround is in place with most nodes back online and processing Slurm jobs again. Now dropping the impact as we bring more nodes back online.
Posted Dec 16, 2025 - 18:07 NZDT

Update

We've uncovered the underlying problem and are now attempting to implement a workaround until it can be fully resolved.
Posted Dec 16, 2025 - 15:28 NZDT

Update

More compute nodes have dropped off the network now so we are upgrading this to a major outage for the Slurm cluster. We're narrowing down the cause but may not be able to restore service until overseas L3 support engineers come online this evening. Apologies for the disruption!
Posted Dec 16, 2025 - 13:13 NZDT

Update

Three compute nodes and both huge memory nodes are now down exhibiting the same network issue. We are still working to determine the cause.
Posted Dec 16, 2025 - 10:25 NZDT

Identified

It appears that around 6:45am this morning a network event has occurred and disconnected a handful of compute nodes from the cluster. This does not seem to be link to the overnight border networking maintenance, though we are still attempting to restore connectivity and will focus on RCA later.
Posted Dec 16, 2025 - 09:32 NZDT

Investigating

We are currently investigating this issue.
Posted Dec 16, 2025 - 08:44 NZDT
This incident affected: Compute cluster.