SDSC Outage Notification – Core Networking – 16 May 2016, 23:30 – 17 May 2016, 23:30

(23:30) The remaining switch chassis was brought online. One 1GbE port was deactivated pending discussion with the customer. Please contact SDSC operations if you feel there are remaining network connectivity issues.

(22:51) The majority of switchgroup N77 is online — web/db and ITSS service has been restored. One chassis still remains offline serving one project. They have been contacted directly.

(22:20) Services such as Oracle, hosted websites, mysql, postgres and a limited number of colocated equipment is still offline while the switchgroup is investigated.

(22:10) SDSC engineers rebooted all individual switchgroups, isolating the one group seeming to cause the problem. All groups except the N77 group have been returned to service.

(20:07) SDSC engineers plan to reboot each of the core distribution network switches, one at a time. Systems connected to the core switches directly may experience outages, while systems connected to top-of-rack switchgroups should fail over between uplinks and not experience outage.

(19:02) Links which were disabled for testing have been restored. Services access is being restored.

(18:59) A wider network outage has just occurred. Services like core NFS, project storage, and managed systems are unavailable.

(18:31) SDSC engineers are checking each physical interconnect link and collecting flow data. The scope of systems has not changed since the beginning of the outage.

(15:14) There has been some guidance from the vendor to investigate physical connections and cabling between the routers and switches. Vendor technical support continues to investigate with SDSC engineering staff.

(12:17) Engineers have started to move paths to the secondary router. The issue appears to be that the VRRP gateway IP (typically .1 in the subnet) is frequently disappearing for some hosts. The router static gateways (.2 and .3) continue to be available during this time. When the .1 interface is available, the systems will be provided routing and can access the wider networks.

(10:07) The problem has been escalated to the hardware vendor. The secondary internal router has been disabled to attempt to fail all paths to the primary unit.

(08:08) Network engineers continue to investigate.

(06:35) The headline timestamp has been modified to reflect a more accurate original outage time.

(06:23, 17 May 2016) SDSC central networking is experiencing a sporadic loss of connectivity to assorted hosts and networks starting around 00:30 on 17 May 2016. Hosts affected appear scattered across VLANs and subnets, offices and datacenter.

More will be posted as there is more information.

Leave a Reply