Safety-Critical Systems – Michael McRoskey

The root causes of the Therac-25 accidents were an overemphasis on software-defined physical functions and buggy, untested software that responded poorly to real-time situations. I think these issues could have been largely averted by slowing down and clarifying the procedure to be performed before it began, requiring plenty of confirmation by the operator. Such inconvenience is a little price to pay for the lives of patients.

One of the most fascinating parts about working for Google this summer was learning about how they test their products. Specifically, there are often way more lines of code defining test cases for a feature than lines of code defining the feature itself, particularly for features with a direct revenue stream. One of my friends worked in Google Ads, which accounts for a significant portion of Google’s revenue. At one meal, he explained to me the many safeguards it took to make sure Ads never went down. If even a portion of the service was down for 10 minutes, it might cost millions in revenue for Google! Pushing a commit of production code requires several layers of approval and can take weeks (as it did for my team) to get fully approved. Google had such a strong testing culture that it posted “Testing on the Toilet” memos each week with tips on how to make test cases specific, inclusive, and encompassing. Though it was frustrating to write comprehensive test cases, it was worth doing so for the integrity of the product and future versions.

Such safety measures to ensure steady revenue should also be made to ensure the safety of consumers or their data. The recent Equifax breach highlights gross irresponsibility in protecting sensitive consumer data. Last month it was announced that 146 million Americans’ personal information was compromised when Equifax failed to update their Apache servers with a security patch within a 2-month window. Subsequent investigations found that typing “admin” as the username and password for a specific Equifax login credential granted access to hundreds of thousands of records. These incidents could have been prevented with simple security procedures in place like frequent vulnerability patch updating and guidelines on account management.

Even more important that consumer data is human safety, which intersects often with technology in transportation and medicine, as we saw with the Therac-25 incident. One of the most fascinating parts about attending talks at Google was learning about the various ways they employ redundancy to safeguard against malfunctions. A relevant safety-critical system in production there is Waymo’s autonomous vehicles. These self-driving cars can be seen zipping about Mountain View day and night. One of the most important features of these cars is having two computers–one as a fallback in case the first malfunctions. Further, the cars are designed such that control “waterfalls”: if power is cut, each component down to the sensor has redundancy to safely stop the vehicle.

Software engineers must take these systems seriously and be held responsible for inadequate testing. While it may be impossible to drum up every edge case in code, developers can certainly follow best practices and consult with others to ensure their code covers a wide range of scenarios.