We have made some network configuration changes on gpu-0 which have reduced the mlag failover frequency. We will monitor for any regression. gpu-0 is now availabel again in SLurm.
Posted Sep 17, 2025 - 16:21 NZST
Investigating
During last week's network issues we discovered gpu-0 had its own set of unrelated problems and hence it has been drained. The node is suffering from frequent mlag failover on its bonded interface. We will be reseating and testing cables early this week.