Emergent Downtime
We had some downtime after a failure at our hosting facility.
We would like to address the power loss which occurred in our Virginia
Datacenter on Wednesday, June 13th. We are still investigating the
root cause, but in the interest of full disclosure, here are the facts
as we know them today. A more complete post-mortem will be sent to you
as soon as possible.
Mmm, full disclosure and analysis. What a neat idea.
That sounds like server beach…
Yup!
– A massive wind and hail storm struck the mid-Atlantic region of the
United States including Northern Virginia yesterday afternoon.
– Our internal monitoring system alerted us that the local power grid
dropped at approximately 4:01 pm EDT.
– We have three generators on site:
– – The first generator failed to start up correctly. We are not sure
why and our engineers are investigating further.
– – The second generator started as expected.
– – The third generator started initially but soon failed for an
unknown reason. We are investigating root cause here as well.
– Because we had only partial generator power, not all the Power
Distribution Units (PDUs) were receiving power.
– Power was restored to approximately 50% of our servers in a matter
of minutes. The remaining ~50% required further attention from our
engineers.
– All power was restored at approximately 6:15pm EDT although there
were isolated power fluctuations over the next two hours.
All available technicians were on-site throughout this incident and
have remained in the datacenter over the last 24 hours to ensure all
servers came back online.
This business about generators not starting seems surprisingly common. I wonder why? It’s not like firing up an internal combustion engine is new tech. There must be something more to it that I just don’t get.
Lack of testing. When was the last time they tested all three generators in a simulated power out situation? Its all too rare. At my last position – we insisted on quarterly tests.
And honestly, getting a generator to run is not the problem area. Getting it to run and provide power to where its needed is the problem area. If you don’t do proper testing during installation and follow it up with drills – these things never run the way you expect to in a real emergancy (also – make sure the fuel pump is not run by electrical power off the grid 🙂 )