Is the truth really worth the result?

Statistics abuse : The fatal power of the p-value

Do I have to go without red meat? How harmful are car exhaust really? Science promises answers to such questions. But what was considered safe yesterday is already out of date today. The reason for this is often cited as the constant progress in research, which constantly corrects old knowledge.

But that's not the whole truth, criticizes John Ioannidis. The doctor and statistics professor from Stanford University School of Medicine, currently visiting scientist at the Berlin Institute of Health (BIH), is bothered by the quality of scientific work, for example in nutrition research. Lay discussions on the subject of nutrition generally consist of about "95 percent bias". But things are hardly looking better in scientific nutrition research, he said at a Swiss conference of nutritionists last year.

Ioannidis sees grievances everywhere

He accused those present of poor study design, large measurement errors, problems with statistics and financial entanglements with the food industry. The scientists resisted and accused the statistician of "serious distortions" of their profession. One participant complained in a trembling voice about Ioannidis' "insane claims that ten years of honest work" would "throw away" as it were.

Ioannidis smiled under his mustache and countered that he had meanwhile even lost confidence in one of the most extensive nutritional studies on the Mediterranean diet with almost 7500 test subjects. This "Predimed study" had to be withdrawn last summer because a commission of experts had noticed serious errors in the statistics and in the test protocols. Despite corrections, these errors now call the whole project into question.

Nutritional science is not alone in its grievances. Other scientific disciplines are also in serious crises. Many spectacular research results in biomedicine, social sciences and psychology cannot be reproduced. Ioannidis discusses these failures in papers with titles such as "Why Most Published Research Is Wrong" or "Why Most Clinical Research Is Unsuitable". Because of these writings, he is now one of the most widely read and cited scientific authors.

General purpose weapon p-value

Ioannidis is one of a handful of rebellious statisticians who have been fighting a battle for more resilient studies for decades. In March the world's largest association of the guild, the American Statistical Association (ASA) issued a statement warning of rampant misuse of statistics in science. This affects not only the research itself, but also "the public order, journalism and legislation".

One thing has always been a particular thorn in the side of statisticians: easy-to-use significance tests that are extremely common and often misused. They are intended to help you decide between two formal assumptions: for example, whether two series of measurements are essentially the same (the "null hypothesis") or whether there is a significant difference between them (the "alternative hypothesis"). To put it more simply: whether a scientific result is reasonably certain to be "real".

This is useful, for example, to find out whether the blood pressure in a group of test persons has actually changed after taking a drug. So if the researcher tests series of measurements on untreated and treated blood pressure patients, the test spits out a probability value, the so-called p-value. If this is below a certain limit - traditionally at a basically arbitrary 0.05 - then the result is considered "statistically significant". And thus as real and significant.

Above all, care is important when experimenting

However, the p-value does not reveal how great the effect of the blood pressure drug really is. It is only an indication of the probability with which a result cannot have come about by chance. It can therefore assume values ​​from zero to one. The smaller the value, the more closely the result corresponds to the alternative hypothesis, i.e. that about two series of measurements differ. The p-value is only valid within the narrow framework specified by the mathematical formulas of the test and depends, for example, on the number of data points in the series of measurements. "A p-value with no context or additional scientific evidence provides little information," the ASA writes.

The p-value is a nice, simple number and is often presented as a measure of the size of a measured effect. But the value cannot do that. Even small differences between series of measurements can be highly significant, especially when there are many data points. Nor does it say anything about the degree of probability with which a study result could have come about by chance - unlike what many in science believe. A particularly high p-value just below or above the 0.05 limit does not mean that the study result was more likely to have come about by chance than in a test series with a particularly low p-value. The p-value is therefore not a unit of measurement, just a pointer.

It sounds banal, but it is not a matter of course: If you want to know whether a blood pressure drug is effective, it is not statistical expertise that is important, but above all experimental care. For example, the more controlled the test conditions, the better. Even if more subjects take part, that makes the study more resilient. The more thoroughly mixed the groups of participants are - for example with regard to age - the more likely it is that an average value also has something to do with the drug. And the less financial self-interest the makers of a study have in its "success", the more likely it is to be assumed that it also reflects reality. Good studies are extremely time-consuming and not without reason devour millions of euros.

The p-value as a shortcut to fame

But the p-value sometimes provides a welcome and promising abbreviation that conceals the deficits of a study that are difficult to quantify. A particularly perfidious practice is "p-hacking". If an experiment does not achieve the desired value, it is sometimes simply repeated until the p-value is "correct", that is, it happens to fall below the threshold value and the test is statistically significant. A result obtained in this way then makes it into specialist literature and perhaps also into the media. The rest of the measured values ​​disappear in the drawer. In fact, such results are scientifically completely worthless.

"The misuse of p-values ​​is so simple and automated that it is addicting researchers," says Ioannidis. Because significant results can be published, and published results can be used to justify new research projects. "We should view the researchers who produce these flash floods of p-values ​​as drug addicts in need of withdrawal and rehabilitation."

Because statistics are not always misused for base motives. Many simply don't know any better. "Most of those who use statistics incorrectly are simply barely trained," says the statistician. Some journals such as "Basic and Applied Social Psychology" have pulled the emergency brake and simply banned p-values ​​completely. Instead, they call for "strong descriptive statistics" such as graphics and diagrams.

Statisticians rarely speak with one voice

In the last few years there have been a couple of stage wins in the fight against the misuse of statistics. According to Ioannidis, there is at least "an awareness of the extent of the challenge we are facing". More transparency and openness also help, because errors are noticed more quickly. The fact that there are no major successes is also due to the fact that even professional statisticians do not speak with one voice. "The experts disagree on how to solve the problem," says Ioannidis.

In a comment in the journal "Nature", 800 of his colleagues recently signed an anti-significance appeal: "We call for the entire concept of statistical significance to be abolished," it says. Instead of yes-or-no statements, statistical parameters such as p-values ​​should be analyzed and discussed in detail. Ioannidis doesn't think that's a good idea, even naive. "Statistical significance is like an inefficient and corrupt peace power in a wild country," he says. "It does not lead to complete peace and prosperity, but without it there will be war."

Good advice? Often not on offer

The consumer of research news is generally unaware of such epic battles between statisticians and statistics users. He just wants to know how to eat best, or whether he should move to the countryside in the face of fine dust-polluted cities.

But this good advice is often not only expensive, but also not on offer at all. "People are in a precarious situation and it is not easy for them to see through the whole thing," says Ioannidis. "You should ask, is it a large study? Is it randomized? Has other studies confirmed the result? Are there any conflicts of interest?" Questions like this can help you get closer to the truth. Or just the insight that some research results are perhaps more of an interesting clue, but rarely an ultimate truth.

Now new: We give you 4 weeks of Tagesspiegel Plus! To home page