Overnight problems

Incident Report for AgResearch eRI

Monitoring

Kia ora all, here is a belated update on this incident.

From approx 2200hrs on Thurs 4th September, tenant network and associated floating IP connectivity to all FlexiHPC VM instances started going offline. A subset of instances also went into the SHUTOFF state. This was the result of a config and automation regression in our OpenStack infrastructure. A config fix was rolled out approx 0400hrs on Friday the 5th which resolved network connectivity issues for impacted instances.

However, a subset of instances that were SHUTOFF by the original issue were additionally impacted by a serious corner case in the OpenStack deployment tooling that we use. This resulted in duplicate VM instances being launched, which in turn meant that some instance root drives and attached volumes were inadvertently multi-attached, which could lead to instance availability and potential data corruption issues. Our team have since worked tirelessly with impacted instance owners to address the follow-on issues and recover services. If you are still experiencing any issues, please contact support. We apologise the disruption to service and are making adjustments to our processes to minimise the possibility of similar problems in future.
Posted Sep 11, 2025 - 22:09 NZST

Identified

Peaks Online has been restored but is currently awaiting the installation of a license. All other production services are available.

There are numerous Openstack instances (VMs) that my be impacted by having duplicate virtual machine processes. These require operator intervention to attempt restoration of service. If you have an active instance that you can no longer log onto, or an instance that is shutoff and will not start up, please open a support ticket.

We appreciate your patience over this difficult problem and apologise for any inconvenience it has caused.
Posted Sep 05, 2025 - 15:42 NZST

Investigating

We had some serious issues from about 10pm last night that interrupted numerous Openstack instances. If you have VMs that were shutdown please try restarting them, if this is not successful please log a support ticket. We are aware that Peaks Online is down.
compute-0 is drained, we are working on it, but otherwise the compute cluster appears ok. Slurm is available and jobs are running
Posted Sep 05, 2025 - 08:51 NZST
This incident affects: General Flexi HPC Platform.