Discovery Research vs Hypothesis Testing: Sherlock Holmes, Colonel Mustard, and “How Exposomics Learned the Trick” (Part II)

By taking what we’ve learned about discovery research and hypothesis testing, Dr. Robert Wright explains how we can grapple with the millions of factors that make up our environment and the different ways they affect our health.

Exposome Perspectives Blog by Robert O. Wright, MD, MPH

Catch up on Part I: the differences between exposomics and traditional environmental health research – and the importance of integrating them

Making sense of big data

In exposomic research, there are far more than six possible factors that may contribute to health outcomes; in fact, there are millions. Let’s expand our detective analogy to true-crime mystery novels, where there might be many more suspects than six who could be the killer. For the vast majority of diseases, we know very little about what causes them. What if there are very few clues, no known motives and more than a million possible killers? There are about 9 million people living in London. Perhaps that is all we know at the onset of the crime. How can we narrow that number down to something manageable? Exposomics works like this to narrow down our environmental suspects, as we know there are millions of environmental factors that might affect health. Our goal is to first narrow the list done to the important ones, and then test some hypotheses. Going back to Mr. Boddy’s murder, perhaps Sherlock Holmes is informed by a witness that the murderer used a gun with a white handle. When Sherlock checked City records for registrations of guns with white handles, there are 10,000 people who own such a gun in London. While still daunting, that’s a good start compared with 9 million people in London.

About 9 million people are living in London. How can we narrow that number down?

What if he had one other clue that he could use to narrow it down further? The bullets were made of pure silver that only a very wealthy person would be likely to afford. Sherlock decides he can save time again by eliminating anyone he is less than 95% sure (in research we call this a 95% confidence level) could afford silver bullets. He starts by searching tax records to find out the income of the 10,000 people who bought that brand of gun. He needs to pick a cut-point for income—the wealthiest person isn’t necessarily the killer after all, and he decides that a person in the top 5% of income would be most likely to buy these bullets. But this only narrows it down to 500 people. That’s still a pretty big list. Luckily, he has another clue: the depth of a shoe print left in mud indicates that the killer is heavyset. He has Scotland Yard round up all 500 suspects and weighs them, narrowing the pool of suspects to the top 5% by body weight, resulting in 25 suspects who are heavy enough, rich enough, and who own a white handle gun. Each of these steps is a form of validation research to narrow down the possible subjects. The danger at this stage of discovery is that Sherlock uses the rank order of the suspect’s weight and declares that the heaviest of the 25 is the killer. Too often discovery researchers do just that, they rank order thousands of results and declare the top ranked result an important new finding without formally testing it as a hypothesis. In this example, this poor person just happened to fit some criteria, he/she is actually one of 25 possible, the most likely probability is that he/she is innocent, despite making the top of the list.

Now Sherlock needs to do some hypothesis testing. He looks for additional data that shows a connection among these 25 people with the victim and he ends ups with a single prime suspect: Mr. Green. Sherlock focuses now on Mr. Green who also lives near Mr, Boddy and learns that he owes Mr. Boddy substantial sums of money, and upon further investigation, a neighbor saw him running home around the time of the murder, then bury something in his backyard. Note, he didn’t assume Mr. Green is the killer based only on his computer searches and Mr. Green’s body weight. He hypothesized it first and then looked for additional proof in the form of relationship to the victim, motive, alibis, and other evidence. Instead of gathering all of this evidence on 10,000 people, he was able to reach a testable hypothesis after just four rounds of data collection. His approach was sequential, moving from discovery to hypothesis testing.

While Sherlock is doing this, Detective Inspector Lestrade of Scotland Yard has started the process of eliminating each of the 10,000 possible killers. As an old-school detective, he instructs his team to interview each of the 10,000 owners of white handle guns for their whereabouts the night of the murder and has gotten through about 25 in the time it took Sherlock to solve the crime. Mr. Green is on Lestrade’s list at #4,564, so he had a long way to go before he would get to the killer. It may have even taken years and enormous amounts of manpower. He is quite relieved when he hears from Sherlock that the case is solved!

Like the Clue example, the true-crime mystery illustrates how exposomics and traditional environmental health research approaches differ but benefit from integration. In the analogy, Mr. Boddy is the disease; the killer is the cause of disease; each suspect is a unique risk factor; income and weight are common biomarkers that are associated with, but not unique to, the killer (disease); and Sherlock Holmes is the scientist trying to solve the. Inspector Lestrade is practicing 20^th century crime methods, which, while rigorous, are extremely slow. Sherlock is taking a modern, “omic” approach that sifts through large amounts of data in a multi-stage design to find the killer. Sherlock being Sherlock, also understands the role of chance and doesn’t jump to conclusions too early.

Balancing speed with rigor

Solving the crime outlined above involved a rigorous amount of detective work. Let’s imagine that Sherlock’s older brother Mycroft was not as rigorous. Say Mycroft has invented a 19^th century computer that operates on a steam engine and can scan 10,000 people’s ownership of that brand of gun and predict body weight. He wants to screen them quickly and make an announcement for the newspapers. He can rank order their probability as the killer and Mycroft decides to focus on the ranked suspect who happens to be Colonel Mustard as he was the heaviest. Mycroft arrests the Colonel and makes the announcement that he was at the top of the computer’s suspect list!

But we know that Colonel Mustard is not the killer; Mr. Green is. So what happened?

This is where probability comes into play. Research studies typically use a probability of finding a false positive result of five percent. When testing a hypothesis for a single theory we allow the possibility that five percent of studies conducted the same way might get the wrong result and cause us to falsely reject the hypothesis, but 95% of the time we will get a true and accurate result. However, if you are testing multiple theories (suspects), the probability that your findings will be incorrect (e.g. you accuse the wrong suspect) is much greater than five percent. When we test many hypotheses all at once, we will find one or more that is positive simply by chance and are incorrect, or false positives. The more hypotheses that you test at once, the greater the likelihood of finding a false positive.

If Mycroft’s statistical model is sound, it should pick Mr. Green more than anyone else to be in the top 500. But in all likelihood, Mr. Green won’t be at the very top of the list of 500 people. Of the people chosen in the model, 499 of the 500 were innocent. Only one was the killer. Mycroft forgot about the probability of being wrong—what we call “positive predictive value.” If there is only one killer and a test is 95% accurate, by testing 10,000 people, five percent will test positive by chance, but we know that only one in 10,000 persons is a killer. The “one” true killer will be selected, but he or she is surrounded by innocent false positive tests. Despite its 95% “correctness” when applied to 10,000 people, it will be wrong 499 times and correct only one time.

Mycroft forgot about the probability of being wrong—what we call “positive predictive value.” If there is only one killer and a test is 95% accurate, by testing 10,000 people, five percent will test positive by chance

Because of probability and the presence of only one killer among the 10,000, the majority of those who are selected by the model are actually innocent. Sherlock understood that this happens when you don’t carry out inductive reasoning to its natural conclusion, which is to deduce a hypothesis that you then can test. Sherlock didn’t fall for this trap. Rather than accuse Mr. Green when he was among the final 25, he tested his hypothesis by gathering additional evidence.

Exposomic screens work this way. After the first few screens we develop a list of candidates, and we then form hypotheses about them that must be formally tested. All “omic” sciences discover information that must be hypothesis tested to be certain they are true positives. Sometimes we say discovery research is about hypothesis generation and not hypothesis testing. Once we have generated a hypothesis, we start fresh and set about testing it without relying on the data that generated it.

In Sir Arthur Doyle’s “How Mr. Watson Learned the Trick”, Sherlock’s companion Doctor Watson attempts to demonstrate how he has learned Sherlock’s “trick” of logical deduction by summarizing Holmes’ current mood and plans for the day ahead. In the end, Sherlock illustrates that every one of Watson’s deductions was incorrect. Perhaps in the age of exposomics, we can learn to avoid mistakes of touting false positive results or rejecting methods just because they are new and different. If we can learn that trick, let’s call it the “exposomic trick”, we will solve the mystery of health and disease.

Read other entries in the Exposome Perspectives Blog