bertran |
Pels més geeks
Monday, October 18th 2010, 3:23 PM
Poso el que l'empresa que gestiona els servidors ha comentat als seus clients directes. Bàsicament explica que ha fallat un enrutador, i es treu les culpes de sobre demanant explicacions a l'empresa fabricant de l'enrutador que suposadament ha fallat.
Date: 18/10/2010
Time: 09:20
Duration: <270 mins
Affected Service(s): IS-01366, IS-01611, IS-01612
The issue affected servers in the North side of our Maidenhead Data Centre, Spectrum House. (RSH-North) Approximately 50% of servers were affected, so generally speaking 50% of the servers listed above would have been affected. It is not possible to say in retrospect which these were, because our monitoring servers were affected and hence they recorded an outage for all servers, including those at other data centres that were not affected at all. We have cleared this misleading monitoring data.
At ~08:55, we became aware of a network issue affecting some servers in the North side of our Maidenhead Data Centre (RSH-North). Approximately half of these servers were experiencing connectivity problems ranging from packet loss to total loss of connectivity. Other servers were unaffected by this issue, and were responding as normal. Our network monitoring server was amongst those fully affected by this problem and therefore reported a total outage, including for servers hosted at other data centres and not affected at all. We are in the process of clearing this misleading monitoring data.
The issue we detected was affecting both the primary and secondary Cisco 6500 network system that are configured in a VSS-1440 redundant cluster. We ran through our emergency procedures to identify the problems, but all tests were responding within normal parameters.
After finishing our emergency procedures, and not identifying a specific problem, we raised a case with Cisco TAC at ~10:10. A Cisco engineer then logged into our routers to try and identify the problem. After 3 hours, the Cisco engineer was unable to provide a resolution; we understood the problem was either a software bug within the routers, or else a hardware fault.
We took matters into our own hands at ~13:20, and decided to reboot both routers. This affected all servers in the RSH-North data floor, as it takes about 15-20 minutes for the routers to reload. During the reload, the primary router failed to boot up normally. The secondary router booted normally, and our monitoring showed service was restored as a result of this.
Our conclusion is that the failure of the primary Cisco 6500 to boot indicates a hardware problem. We take full responsibility for all the infrastructure required to provide you with a reliable service, and therefore we asked Cisco to provide an answer to these questions:
1) Why were Cisco unable to diagnose a hardware fault within a 3 hour time frame?
2) Why did traffic not automatically fail-over to the secondary 6500, as by design?
Cisco commented that they do not know for sure if this is a hardware problem, and so were unable to provide a specific response to these two questions. Clearly these are very important questions that need to be answered, and we will continue to work with Cisco to provide a full and adequate response to them.
Regards,
The RapidSwitch Team
|