Software automated tests versus medical tests
Medical tests can teach us about how we interpret software automated test results.
Ever have blood work ordered by your doctor? Say you’re having some stomach pain and it’s worse than the usual upset stomach. If you see your doctor he might order some “liver tests” with names like ALT (alanine animotransferase), bilirubin, and alkaline phosphatase. Each test result will be printed next to a “normal range”, like bilirubin = 1.0 (0.3 to 1.9) mg/dL. Even if one or two of the tests are out of range your doctor may say they are “normal” and tell you your liver is working OK. Doctors have been doing this sort of laboratory test analysis for a hundred years. Can it teach us anything about unit tests, integration tests, and the types of results we encounter in software?
Software is probably the most complex thing humans have ever made. If each individual variable was the equivalent of a mechanical device “moving part” then even a medium size program would be more complicated than the Space Shuttle. But the human body is orders of magnitude more complex than a complex software system. How do blood tests probe the complexity in a meaningful way?
The body is packed with highly evolved hacks. Evolution is a master of reuse: accidentally clone a gene, pick up a few point mutations or rearrangements and boom, new protein that does something different than the original version. Repeat over millions of years and you get organs like the liver, which has all of these functions:
Doctors deal with this complexity by careful consideration of whether or not a test result actually predicts a disease. ALT is a protein that exists primarily inside of liver cells. Injure the liver somehow (punch it, poison it with alcohol, plug it up with a gallstone) and the cells rupture, leaking ALT into the blood where it can be detected by a test. However, there’s also some normal turnover of liver cells, so even in people who aren’t sick there is a little bit of ALT in the blood.
Doctors (and scientists in general) think about tests as having a “predictive value,” usually discussed in terms of a false-positive rate and a false-negative rate. False positive means the test was elevated, but the patient’s liver isn't injured. False negative means that something was wrong with the patient’s liver, but the value was normal anyway. Note that these rates must be discussed relative to some outcome you can measure - ALT has a false-positive and false-negative rate for the specific diagnosis of viral hepatitis A, not for “being sick” in general.
So what does this have to do with unit tests, integration tests, and the like? Engineers write automated tests for several reasons: to catch errors in their current code, to catch errors in the future (perhaps introduced by someone else), to force themselves to think harder about the problem they are solving, and to enforce a certain structure on the code they write. In my opinion the first two reasons are the most valuable, so I’ll focus on those.
I like to think of automated test runs as the equivalent of ordering a battery of medical tests. Each test has a normal value (or sometimes a normal range) and tests that are out of range are reported as failing. Based on the types of tests that are failing I might conclude that there really is an error, or I might conclude that there is no error and the test is “flaky”.
“Flaky” tests can be thought of as false positives. Maybe there isn’t really a problem in the code under test -- some other test running in parallel grabbed keyboard focus and caused this test to falsely fail. Most engineers I know have a good grasp of these sort of issue.
But what about tests that “pass”? Most engineers don’t think about those very much. But false negatives live here. Maybe the test code isn’t actually executing. Maybe the test runs but it doesn’t actually run the underlying code you expect. Maybe you inverted the sense of EXPECT_TRUE versus EXPECT_FALSE -- a defect in the test itself.
When I see test failures, I think of them as medical tests that have come back with values outside the normal range. They might mean something, or it might not. The type of test matters a lot to me. A failing unit test usually means something is broken. Unit tests tend to be small and highly deterministic. It’s like looking at a test for HIV viral load -- if you have virus in your system, you have an HIV infection. Basically, their false positive rate is low. Another word for this is "specific". A failing unit test usually means an error in the module under test and not something else. Note that false negatives can still happen -- it’s easy to write a unit test that doesn’t run, or doesn’t exercise the code you think it does.
Higher level tests like integration test suites are notorious for being flaky and hard to interpret. They’re a lot like most medical tests. “Sign-up flow failed” could be due to a low-level networking problem or a high-level UI layout problem. Your blood ALT level could be elevated due to a hepatitis infection... or being hit in the liver... or it could just be normal for you. Interpreting the result takes some careful thought.
In my next installment I’ll talk a bit about how I think about interpreting high level test suite failures.
Ever have blood work ordered by your doctor? Say you’re having some stomach pain and it’s worse than the usual upset stomach. If you see your doctor he might order some “liver tests” with names like ALT (alanine animotransferase), bilirubin, and alkaline phosphatase. Each test result will be printed next to a “normal range”, like bilirubin = 1.0 (0.3 to 1.9) mg/dL. Even if one or two of the tests are out of range your doctor may say they are “normal” and tell you your liver is working OK. Doctors have been doing this sort of laboratory test analysis for a hundred years. Can it teach us anything about unit tests, integration tests, and the types of results we encounter in software?
Software is probably the most complex thing humans have ever made. If each individual variable was the equivalent of a mechanical device “moving part” then even a medium size program would be more complicated than the Space Shuttle. But the human body is orders of magnitude more complex than a complex software system. How do blood tests probe the complexity in a meaningful way?
The body is packed with highly evolved hacks. Evolution is a master of reuse: accidentally clone a gene, pick up a few point mutations or rearrangements and boom, new protein that does something different than the original version. Repeat over millions of years and you get organs like the liver, which has all of these functions:
- Filtering blood immediately after it absorbs nutrients from food in the intestine
- Process the breakdown products of red blood cells
- Manufacture glucose
- Manufacture bile to help digest food
Doctors deal with this complexity by careful consideration of whether or not a test result actually predicts a disease. ALT is a protein that exists primarily inside of liver cells. Injure the liver somehow (punch it, poison it with alcohol, plug it up with a gallstone) and the cells rupture, leaking ALT into the blood where it can be detected by a test. However, there’s also some normal turnover of liver cells, so even in people who aren’t sick there is a little bit of ALT in the blood.
Doctors (and scientists in general) think about tests as having a “predictive value,” usually discussed in terms of a false-positive rate and a false-negative rate. False positive means the test was elevated, but the patient’s liver isn't injured. False negative means that something was wrong with the patient’s liver, but the value was normal anyway. Note that these rates must be discussed relative to some outcome you can measure - ALT has a false-positive and false-negative rate for the specific diagnosis of viral hepatitis A, not for “being sick” in general.
So what does this have to do with unit tests, integration tests, and the like? Engineers write automated tests for several reasons: to catch errors in their current code, to catch errors in the future (perhaps introduced by someone else), to force themselves to think harder about the problem they are solving, and to enforce a certain structure on the code they write. In my opinion the first two reasons are the most valuable, so I’ll focus on those.
I like to think of automated test runs as the equivalent of ordering a battery of medical tests. Each test has a normal value (or sometimes a normal range) and tests that are out of range are reported as failing. Based on the types of tests that are failing I might conclude that there really is an error, or I might conclude that there is no error and the test is “flaky”.
“Flaky” tests can be thought of as false positives. Maybe there isn’t really a problem in the code under test -- some other test running in parallel grabbed keyboard focus and caused this test to falsely fail. Most engineers I know have a good grasp of these sort of issue.
But what about tests that “pass”? Most engineers don’t think about those very much. But false negatives live here. Maybe the test code isn’t actually executing. Maybe the test runs but it doesn’t actually run the underlying code you expect. Maybe you inverted the sense of EXPECT_TRUE versus EXPECT_FALSE -- a defect in the test itself.
When I see test failures, I think of them as medical tests that have come back with values outside the normal range. They might mean something, or it might not. The type of test matters a lot to me. A failing unit test usually means something is broken. Unit tests tend to be small and highly deterministic. It’s like looking at a test for HIV viral load -- if you have virus in your system, you have an HIV infection. Basically, their false positive rate is low. Another word for this is "specific". A failing unit test usually means an error in the module under test and not something else. Note that false negatives can still happen -- it’s easy to write a unit test that doesn’t run, or doesn’t exercise the code you think it does.
Higher level tests like integration test suites are notorious for being flaky and hard to interpret. They’re a lot like most medical tests. “Sign-up flow failed” could be due to a low-level networking problem or a high-level UI layout problem. Your blood ALT level could be elevated due to a hepatitis infection... or being hit in the liver... or it could just be normal for you. Interpreting the result takes some careful thought.
In my next installment I’ll talk a bit about how I think about interpreting high level test suite failures.
Comments
Post a Comment