gpu-0 compute node drained

Incident Report for AgResearch eRI

Resolved

This incident has been resolved.
Posted Sep 19, 2025 - 13:25 NZST

Monitoring

We have made some network configuration changes on gpu-0 which have reduced the mlag failover frequency. We will monitor for any regression. gpu-0 is now availabel again in SLurm.
Posted Sep 17, 2025 - 16:21 NZST

Investigating

During last week's network issues we discovered gpu-0 had its own set of unrelated problems and hence it has been drained. The node is suffering from frequent mlag failover on its bonded interface. We will be reseating and testing cables early this week.
Posted Sep 15, 2025 - 10:10 NZST
This incident affected: Compute cluster.