compute-[1-4] nodes affected by a GPFS long waiter

Incident Report for AgResearch eRI

Resolved

The compute-1 GPFS restart has now been completed and the associated waiter has been cleared. All nodes are now available to Slurm

Posted Jan 19, 2026 - 09:05 NZDT

Monitoring

Compute-4 has now been restarted, and the storage side deadlock has now been cleared. Compute-1 has a different waiter problem so is still draining until we can restart GPFS there. We will continue to manage and communicate that status via this status page. All other compute nodes are now available

Posted Jan 15, 2026 - 08:53 NZDT

Update

The deadlock on compute-3 has now been cleared, the node is available in Slurm

Posted Jan 13, 2026 - 14:27 NZDT

Update

Compute-3 is now stuck in a completing state so we are going to attempt a restart of GPFS there. Any jobs still running there will unfortunately be killed

Posted Jan 13, 2026 - 14:18 NZDT

Identified

Compute-[1-4] are all being affected by a long GPFS waiter on the storage cluster. However Slurm jobs continue to run there so we are attempting to resolve the issue without killing all the jobs. We need to restart GPFS on those nodes, so we are currently draining compute-1 and -4 as a first step. If the situation deteriorates further we may be forced to kill all jobs on those nodes so we can restart GPFS on all four nodes.

Posted Jan 13, 2026 - 11:01 NZDT

This incident affected: Managed Storage Service and Compute cluster.