Medical tests and software tests, part 2
Medical test interpretation may have lessons for the interpretation of complex software integration test suites. In Chrome, these integration tests are often implemented as "browser tests" or "pyauto tests" that involve spawning a running browser in a separate process and controlling it from the test process. These tests are sufficiently complex that they have a significant false-positive failure rate, which makes a report of a test failure difficult to interpret.
This is similar to common medical test suites. For example, a "urinalysis" is a set of several tests of urine: concentration, protein level, signs of white blood cell activity, etc. In general medical tests are not ordered unless there is a specific potential illness being investigated, but sometimes they get ordered routinely and come back with unexpected abnormalities. How do doctors deal with this?
A common abnormal finding on a urinalysis is a trace amount of blood. Blood can be a sign of irritation of the bladder, kidneys, or the tube (ureters) connecting the two, for example from a small kidney stone or a minor infection. That's pretty common and mostly harmless. But blood could also be a sign of serious kidney disease, or even kidney cancer. That's rare, but really bad.
The first thing the doctor usually does is repeat the test. Maybe there was some contamination in the sample, or the lab made a mistake, or the typical run-to-run result variation just happened to fall outside the normal range of values. If it's still abnormal, the doctor will often order a more invasive, more expensive, but more accurate test. For example, he or she might order an intravenous pyelogram, where a radiologist injects the patient with dye and takes x-rays showing its flow through the kidneys, ureters and bladder.
Does this help us with software tests? Maybe. Large projects like Chrome often run integration tests in an environment that's very different than the one used by the developer who wrote the test initially. The developer probably wrote the test on a very powerful workstation and ran the test alone, or after a few others. The production test environment might run on slower machines (perhaps in a cluster, so more powerful in aggregate but individually slower). The test might also run in parallel with other tests, or after another test that polluted the machine environment with files, database entries, etc.
Combining the medical approaches of "repeat the test" and "do a more accurate test" we could run the software test again in an environment more similar to where it was developed. For example, the test could be run on a freshly restarted machine, or in a fresh VM instance, with no other tests are running. The original failure could be compiled into a "nag report" but a true failure only reported if the second run fails. This would cut the false positive rate, at the expensive of possibly missing some non-deterministic true failures.
The idea of "do a more invasive test" also leads to the idea of running failing tests under addition reporting tools. For example, the test could be run under a memory checker like valgrind. It could also be run with tools to help a programmer investigate the failure, like automatically running at a more verbose logging level or under a profiler.
The ultimate in "do a more detailed test" is how Chrome handles it now: have a human being investigate the test by hand. But perhaps the above approaches would either decrease the false-positive rate or make human investigation more efficient.
This is similar to common medical test suites. For example, a "urinalysis" is a set of several tests of urine: concentration, protein level, signs of white blood cell activity, etc. In general medical tests are not ordered unless there is a specific potential illness being investigated, but sometimes they get ordered routinely and come back with unexpected abnormalities. How do doctors deal with this?
A common abnormal finding on a urinalysis is a trace amount of blood. Blood can be a sign of irritation of the bladder, kidneys, or the tube (ureters) connecting the two, for example from a small kidney stone or a minor infection. That's pretty common and mostly harmless. But blood could also be a sign of serious kidney disease, or even kidney cancer. That's rare, but really bad.
The first thing the doctor usually does is repeat the test. Maybe there was some contamination in the sample, or the lab made a mistake, or the typical run-to-run result variation just happened to fall outside the normal range of values. If it's still abnormal, the doctor will often order a more invasive, more expensive, but more accurate test. For example, he or she might order an intravenous pyelogram, where a radiologist injects the patient with dye and takes x-rays showing its flow through the kidneys, ureters and bladder.
Does this help us with software tests? Maybe. Large projects like Chrome often run integration tests in an environment that's very different than the one used by the developer who wrote the test initially. The developer probably wrote the test on a very powerful workstation and ran the test alone, or after a few others. The production test environment might run on slower machines (perhaps in a cluster, so more powerful in aggregate but individually slower). The test might also run in parallel with other tests, or after another test that polluted the machine environment with files, database entries, etc.
Combining the medical approaches of "repeat the test" and "do a more accurate test" we could run the software test again in an environment more similar to where it was developed. For example, the test could be run on a freshly restarted machine, or in a fresh VM instance, with no other tests are running. The original failure could be compiled into a "nag report" but a true failure only reported if the second run fails. This would cut the false positive rate, at the expensive of possibly missing some non-deterministic true failures.
The idea of "do a more invasive test" also leads to the idea of running failing tests under addition reporting tools. For example, the test could be run under a memory checker like valgrind. It could also be run with tools to help a programmer investigate the failure, like automatically running at a more verbose logging level or under a profiler.
The ultimate in "do a more detailed test" is how Chrome handles it now: have a human being investigate the test by hand. But perhaps the above approaches would either decrease the false-positive rate or make human investigation more efficient.
 
It would make sense to run tests in a virtualized environment. When a test fails, the state of the VM can be captured, allowing the assigned developer to reproduce the test accurately.
ReplyDelete