Learning Lessons from Incidents
[no description provided]After the February, 2017 S3 incident, Amazon posted this:
We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks. We will also make changes to improve the recovery time of key S3 subsystems. ("Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region")
How often do you see public lessons like this in security?
"We have modified our email clients to not display URLs which have friendly text that differs meaningfully from the underlying anchor. Additionally, we re-write URLs, and route them through our gateway unless they meet certain criteria..."
Relatedly, Etsy's Debriefing Facilitation guide. Also, many people are describing this as "human error," which reminds me of Don Norman's "Proper Understanding of 'The Human Factor':"
...if a valve failed 75% of the time, would you get angry with the valve and simply continual to replace it? No, you might reconsider the design specs. You would try to figure out why the valve failed and solve the root cause of the problem. Maybe it is underspecified, maybe there shouldn't be a valve there, maybe some change needs to be made in the systems that feed into the valve. Whatever the cause, you would find it and fix it. The same philosophy must apply to people.
(Thanks to Steve Bellovin for reminding me of the Norman essay recently.)