World’s Worst Software Bugs: The Therac-25 Disaster
How Poor Testing Practices Blasted Patients With Radiation
In June 1985, a woman named Katy Yarbrough visited an oncology center in Georgia.
She went into the treatment room, laid down on the table, and the operator set up the machine that would shoot focused beams of radiation at her breast cancer. It was all fairly routine.
But when the machine turned on, Yarbrough suddenly felt what she later described as a “red-hot sensation,” like she’d been burned. It turns out the machine had overdosed her with much more radiation than was safe, and she eventually lost the use of her shoulder and arm.
Within two years, five other patients had similar stories, and three of them died.
The machine in question was called the Therac-25, and today, it’s notorious for containing some of the worst computer bugs of all time.
The Therac-25 could treat cancer in two main ways. In electron mode, it would deliver a more spread out beam of radiation to target shallow cancers like skin cancer. In photon mode, the radiation was more powerful and could penetrate deeper into the body, but it was also much narrower, to minimize the possible damage to healthy tissues.
To set up a treatment session, the operator would choose one of these modes and select the appropriate dose of radiation. Then the machine would automatically move certain parts into position to focus the radiation beam based on the settings. Sometimes minor error messages would pop up, but all you had to do was press a button to proceed with the treatment anyway. The operators knew that for major errors, the software was designed so you couldn’t proceed with the treatment.
Except, of course, when the software missed the problem.
When the FDA finally shut down all Therac-25 machines pending a full investigation, the manufacturer, AECL, discovered two main bugs that were causing the overdoses.
One involved the part of the program that moved different parts into position based on what type of beam you wanted. It took about eight seconds to set things up, and if you changed the setting within the first second it would also reset the positioning accordingly. But if you switched the setting after the first second, but before it was done moving, it wouldn’t change what it was doing. It also wouldn’t detect that anything was wrong.
That’s how, in the first five accidents, the machine delivered the much higher dose of radiation it would normally use for the narrower, focused photon mode — even though its magnets were still set up for the more spread-out beam of electron mode.
AECL actually rolled out a fix for that bug before the FDA forced a full shutdown, telling hospitals to remove a few keys from the keyboard so you couldn’t change the settings while the parts were moving.
But then the machine overdosed and killed the sixth patient, and it turned out that was an entirely separate problem. That bug came from the limitations of computer memory.
See, the software had a safety mechanism in place to keep the radiation from turning on while it was still in its testing mode.
That safety mechanism relied on one variable, a number that would increment over time in test mode. Out of test mode, the variable would be set to 0, which would allow the radiation beam to turn on.
Except that variable was only stored in one byte of memory, and in the binary number system of computers, one byte can only count from 0–255.
So, once the variable hit 255, it would roll over to 0 and continue counting upward from there. The operator setting up the machine during the final accident just happened to hit the “set” button during the fraction of a second when the variable was equal to 0, and the radiation beam turned on.
A machine with these life-threatening bugs should never have been allowed on the market. But thanks to a combination of oversights, it happened anyway.
AECL never released the information of the programmer who worked on the Therac-25. But we do know it was one guy who basically took the software from earlier machines, the Therac-6 and Therac-20, and spruced it up for the new model.
They did do some safety and reliability testing, but those analyses were based on whether the hardware components would wear out, not whether there were problems with the software itself. After all, the software ran fine on the earlier machines, right?
Well, it turns out these bugs did exist in the software for those machines. But the earlier machines were designed differently, with safety mechanisms set up directly in the hardware to make sure that no matter what, they couldn’t deliver an unsafe dose of radiation.
The thing is, those earlier machines also had a manual mode where the operator could set up the radiation beam without using the computer. For the Therac-25, AECL removed the manual mode, figuring that computers were faster and better at the setup than humans.
At the same time, since computers were obviously so reliable, they removed the safety interlocks in the hardware and relied on the software to prevent any overdoses.
As software engineers, we do our best to learn from the mistakes of the past. Today there are much stricter requirements before a potentially lethal device like the Therac-25 is allowed on the market, partly because of this story.
Even now, more than three decades later, what happened with the Therac-25 still serves as an important reminder of the dangers of complacency.
Bugs happen. It’s a fact of life. But if AECL had accepted that and conducted better safety testing, they might have caught the bugs and saved these patients’ lives.