Reading 05 – Therac-25

There were many things that went wrong with the Therac-25. First, it seems like major portions of the code were borrowed from an older machine that had little to do with the current one. They made multiple assumptions of this old code, claiming that use of the old machines was enough testing of this code. However, at the same time they felt comfortable giving a lot of the security checks of the machine to that software; security which was originally kept by the hardware in place. These assumptions were the primary root of the problem. This confidence on the old code is what led to a very lacking software developing process with a team of only 1 member. There was no severe evaluation of this code. Furthermore, most testing was made either on a simulated module or on very controlled environments.

This great confidence on the code was made even more apparent later on when after receiving death reports from people who received treatment, they were reluctant to admit that there were errors in the system. When they eventually went in to check, they thought they found the bug and after making the small fix they boasted that “analysis of the hazard rate of the new solution indicates an improvement over the old system by at least five orders of magnitude.” Yet, after this bold claim, more deaths were reported.

Given the bugs were obscure ones (one happening only within a frame of time and another occurring on a 1/256 chance overflow), the company had multiple chances to discover this bug. It shouldn’t have gone past the first death before they involved themselves in a rigorous investigation of the error. True, accidents happen, but the best should be done to prevent them. Today, there are better software development approaches in place that help at the very least reduce these dangers: including extended physical testing, review by external parties, and instant review and patching of errors found.

Given the nature of this error, AECL is 100% liable of the incident. However, in other situations the line may be harder to draw. To assist in this, it is important that strict legislation is passed on regards to the process through which safety-critical systems are developed. For example, having code for these systems be reviewed by at least two external parties could be a possible rule. Other rules would be added as required by applications. In general, the goal would be to ensure that unless a company made an honest effort to address all dangers (preventive and post-accident), they will be held liable for all damage caused.

What we can learn about the Therac-25 incident is that programmers and people in charge of programmers should exhibit great skepticism over the correctness of their code. There is no perfect programmer. Bugs happen and are extremely common. I hope that we never hear another incident like the Therac-25 in the future, but the only way we can do it is by preparing the future programmers with the lessons learned from this.

In that regard I’m disappointed that in my 4 years at Notre Dame, I have seen very little focus on error checking and having robust code. The error checking is usually seen as cumbersome and unnecessary for our domain, but it an important skill for us to learn and master. Many of us will oversee safety-critical systems, but few of us will be trained on the skill from our degrees. My wish is that this could be changed in the future, to the point that a course targeted specifically at robust code writing be made part of the core curriculum. I’m not sure how prevalent this kind of courses are in other schools, but that would be a great step for all computer science programs to take. Until then, we may at least hope that enough programmers have heard of the Therac-25 so that they can help prevent this from happening again.