GHO5T is an energetic and scholarly follower of and commenter on JunkScience.com
I am going to borrow on his comments yesterday to expand a little on p values and statistical tools in epidemiology.
GH05T commented on Stan Young never sleeps
So the NATURE commentary from Joe Bast got Stan Young going on one of his favorite topics. …
GHO5T: The problem with P values is not in the methodology, but in human nature. Metrics create reality. If journal referees decide on an arbitrary threshold to reject the null hypothesis, they no longer have to delve deep into the woods to accept or reject the research. Once the threshold becomes public, researchers now have a target to aim for. When unscrupulous researchers know what end number they have to achieve, the rest is just algebra.
Dunn: Can’t add much to that. I always looked at p values as just a measure of safety from randomness errors. I agree strongly with the point that a number becomes a magical thing and the magical threshold of P= 0.05 is used to deceive. They say it’s statistically significant with an emphasis on significant, and parade around the 95% confidence and statistically significant result as though that’s really important.
Dunn: My point is a data point that is highly unlikely to be random error (1 in 20 chance or 0.05) still is just a number. Statistically significant is misused to mean deceitfully–dispositive or reliable proof. Not so. It may be a data point that is just plain wrong or certainly not reliable as evidence because of other factors–like bad methods or design or the nature of the study. An observational study is always unreliable–as distinguished from a Randomized and Controlled study with proper blinding of subjects and researchers. That’s why Randomized Clinical Trials are considered the gold standard and small results are considered reliable.
Dunn: In observational ecological uncontrolled studies the size of the effect or endpoint being measured has to be robust to get out of the noise. That’s why a RR of 2 is considered a threshold for reliable proof of causation. One more time with gusto.
GHO5T: Unfortunately, too many lay people have no understanding of what these numbers mean or how we arrive at them (I’ll save the public school criticism for later). All they know is that “95% confidence” sounds really convincing and if the P-Value is below the magic number, we get to use the word “cause”. To the uninitiated the word “cause” means when A happens, B happens as a direct result, not when A happens B happens slightly more often than it would in the absence of A.
Dunn: Again a good point about people not understanding that association is not causation.
GHO5T: It’s important to remember that rules, regulations, and standards don’t exist for honest people. The concept of P-values is sound when considered one tool amongst many used by honest researchers for checks and balances against unintentional bias. However, taken as a gold standard by themselves, they have the potential to lend undue intellectual and even legal credence to spurious, barely correlated results.
Dunn: See why I think this commenter is worth reading, even when he’s givin’ me a hard time about something?
It is important to educate the general public that P-value means one thing and one thing only, the probability of getting the results you did (or more extreme results) given that the null hypothesis is true. This may mean that it is extremely unlikely that the null hypothesis is true, but someone wins the lottery every once in a while. How much money are you willing to bet that the numbers 1-5 won’t come up on a 100-sided die? How many times in a row would you be willing to make that bet? For me, repeatability is the most important aspect of the scientific method. Randomization and the double-blind protocol go a long way toward preventing bias, but only replication, and lots of it, can truly elevate a claim to the status of “proven” for my money.
Dunn: Right after I met Stan Young PhD statistics and genetics over the phone and he taught me some of what he knew about epidemiology studies and his big effort to put the problem of unreliable medical research as revealed by Ioannidis research on the stage at the annual AAAS meet, he sent me a pair of 10 sided dice and asked me to go through an exercise so he could make a point about the problem of multiple inquiries and how they increase the risk for false positive associations. Then he splained the Bonferroni Claw adjustment for multiple inquiry. Stan has a thing about multiple inquiries since so much of bad science and unreproducible studies come from uncontrolled multiple inquiry and multiple endpoint research. It’s like the ultimate fishing expedition.
Randomness is one thing that can produce errors, but methods and computer programs that cheat is another. As Joel Schwartz, my friend and mentor pointed out to me–if you slice and dice the data enough you can torture it to give you an answer you like. I would add that big computers are the perfect tool for cargo cult scientists. They put a fine point on what may just a big lie.
Lying for justice–a concept that is the ice pick in the heart of public science projects, the public science that is poisoned by political agendas.