Resolved -
All bare metal compute nodes have now had the network config change applied. Packet loss is no longer occurring. In addition we have since identified a workload that was causing slow i/o response from the filesystem. This has been removed whilst we work to improve it.
Oct 29, 15:32 NZDT
Update -
The network config change has now been applied to compute-2 and -4 successfully. Nix is now running better on these two nodes. We will continue to apply the same change to all the compute nodes as they become available.
Oct 23, 10:21 NZDT
Update -
We are continuing to monitor for any further issues.
Oct 23, 10:19 NZDT
Monitoring -
The cluster and login nodes appear to be stable and performant now, although Nix may be slow on some compute nodes (2 and 4). We will be implementing a network config change on each compute node, in a rolling fashion to minimise the impact. This requires draining each node in the Slurm cluster, one at a time.
Oct 22, 12:30 NZDT
Investigating -
We are continuing to see slow response issues with Slurm and Nix but it seems to be intermittent. Investigation continues.
Oct 21, 10:07 NZDT
Monitoring -
We made a network configuration change to a single node last night. The cluster has been stable overnight, with some load on it. We'll continue monitoring today as the load increases. We will make the same change to the other bare metal compute nodes, in a rolling fashion, as they become available.
Oct 21, 08:45 NZDT
Update -
We have found evidence of network packet loss again and are continuing to investigate
Oct 20, 16:23 NZDT
Investigating -
We are currently investigating this issue.
Oct 20, 15:22 NZDT