On Wednesday, 15 August 2018, at approximately 1pm, the primary VMware cluster at SDSC experienced a complete outage. During the outage all VM guests were unable to access disk resources, and eventually needed to be powered off to allow cluster recovery. At 10pm the cluster was stable, at which point all VM guests were placed online.
VMware hypervisors were being reconfigured to increase network throughput when the distributed storage system detected inconsistencies. The VM environment automatically isolated the hypervisors to prevent data loss, while at the same time removed disk write access to the VM guests. We immediately contacted VMware critical support technicians who assisted in recovering from this outage and provided additional information on avoiding future similar incidents.
Following this incident, we are implementing additional notification processes to expeditiously communicate outages to users, and to perform system infrastructure changes during announced maintenance periods.