Comparing Retrospectives
We can learn a lot from comparing retrospectivesLast week, I read two retrospectives. One, by Microsoft, was Results of Major Technical Investigations for Storm-0558 Key Acquisition, while the other was Thornton Tomasetti’s Arecibo Telescope Collapse Forensic Investigation.
The contrast between the reports is remarkable, and so I’m gonna remark. Reports like these serve several purposes. The most important is to create and preserve an authoritative report of what happened, and ideally to extract lessons from it so the organization and perhaps others can learn from experience. Another is to reassure stakeholders that the organization has done so. In this post, I’m going to look at the reports as a mismatched pair, and ask what we can learn about reports by comparing and contrasting them. Smart retrospectives are important not only to our ability to anticipate and manage within our organization; sharing them helps us all get smarter faster, and get a higher return on our threat modeling work.
Looking at these reports, the first thing that jumps out is the length. Microsoft’s post is 748 words. I believe the executive summary of the Arecibo report is longer. Certainly, the full report, at 362 pages (with appendices) is longer — nearly a page for every two words in the Microsoft post. The Arecibo report is authored by three named and credentialed authors. The Microsoft report isn’t even attributed beyond “by MSRC.” Having written these, I have little doubt that’s because the list of lawyers and marketing people is far longer than the list of engineers who wrote the draft and checked what emerged.
I did want to comment on one specific point made by Microsoft: “Due to log retention policies, we don’t have logs with specific evidence of this exfiltration by this actor, but this was the most probable mechanism by which the actor acquired the key.” Those log retention policies were last revisited when disk cost dollars per megabyte, and after another breach at Microsoft, even less well documented, I advocated for longer retention, using free disk space to hold arbitrary amounts of old logs. I tweeted about that here (where Karl Baron points out a GDPR question), and also remembered some of the complexities. To be clear: log retention policies are a security design choice, and logs seem to be deleted by design more than by attackers.
The log retention issues, unlike several other issues, is not followed by a parenthetical “(this issue has been corrected).” It’s hard to interpret that omission.
There’s more ways to use these reports to reflect on how we learn from experience than simply the length. They include:
- Private vs public. Microsoft is a private firm, not a public agency. We reasonably expect more transparency from government (“our tax dollars at work”) than we do from companies. On the other hand, in this instance, Microsoft was serving at least two government departments, and this incident has gotten attention from the Senate, and so there’s an argument for more transparency than if serving private companies. (I’ll return to this.)
- Physical vs software. The failures of the Arecibo were failures of physical systems, which can be easier to observe or analyze.
- Records. The engineering diagrams and plans of record for the telescope go back 53 years, and specifically record as designed, as built, and modifications. They are specific and clear, and the report includes, for example, scanned in cable tension tables from the structural drawings. We rarely invest in similarly clear records for software, and don’t require an engineer of record to sign off on the designs or changes.
- Standards. The Arecibo report cites (for example) the American Society of Civil Engineers ACSE-19-16 Structural Applications of Steel Cables for Buildings and the American Association of State Highway and Transportation Officials, Bridge Design Specifications (page 13). There are fewer standards for software construction, and those that we have are far less specific.
- Adversaries vs nature. The Microsoft breach was was the result of action by “Storm-0558”, who are presumed to be the Chinese government. People are able to adapt and adjust their attacks, and there is a faction that argues that Microsoft should not “tip its hand” as to what it knows about the intrusion, in case there are things that Microsoft knows that Storm-0558 doesn’t know that Microsoft noticed or knows. There is no need for this hall of mirrors in dealing with a failed zinc-filled spelter socket assembly.
- Specific lessons. The telescope failure provides lessons that can clearly be applied elsewhere, including choices of safety factors in similar systems (especially those exposed to hurricanes or earthquakes), and the inadvisability of ignoring cable slip. It is less clear if the issues in the Microsoft report are generally applicable, but this is a defect of the Microsoft report. What, precisely, was the race condition that allowed key material in a crash dump? That is, what process or thread was racing with what? Are the issues “The key material’s presence in the crash dump was not detected by our systems”, and “Our credential scanning methods did not detect its presence” discussing the same issue or separate issues? Both are commented as “this issue has been corrected,” which indicates that perhaps they’re separate issues? The “post incident review” lists them separately as items 2 and 3.
I said that I’d return to the issue of transparency. There are several goals which transparency can support. They include:
- Recording specifics for learning and sharing engineering lessons
- Informing investors
- Reassuring customers
As I discuss in the “specific lessons” point above, the blog post falls short in the recording of lessons and the sharing of engineering lessons.
With regards to informing investors, in recent guidance, the SEC has encouraged firms to disclose material cybersecurity incidents to allow investors to incorporate that into their investment decisions. Microsoft has apparently decided that this issue is not material, and materiality is a legal term. I’ll simply argue that how the flaws crept into Microsoft’s operational systems and were not discovered is information that an investor might want to understand.
Microsoft states that “As part of our commitment to transparency and trust, we are releasing our investigation findings.” I would like to encourage them to demonstrate that commitment to transparency with a release of the full internal report, possibly with small redactions.
[Update Sept 25: Fixed link to Arecibo report.]
Image by midjourney, “a bright watercolor of storm clouds passing over a corporate campus, filled with low square concrete buildings. All are surrounded by well manicured lawns. In the background is the iconic arecebo radio telescope --ar 8:3”