ONE OF THE biggest concerns in science is bias—that scientists themselves, consciously or unconsciously, may put their thumbs on the scales and influence the outcomes of experiments. Boffins have come up with all sorts of tactics to try to eliminate it, from having their colleagues repeat their work to the “double blinding” common in clinical trials, when even the experimenters do not know which patients are receiving an experimental drug and which are getting a sugar-pill placebo.
But gathering the data and running an experiment is not the only part of the process that can go awry. The methods chosen to analyse the data can also influence results. The point was dramatically demonstrated by two recent papers published in a journal called Surgery. Despite being based on the same dataset, they drew opposite conclusions about whether using a particular piece of kit during appendix-removal surgery reduced or increased the chances of infection.
A new paper, from a large team of researchers headed by Martin Schweinsberg, a psychologist at the European School of Management and Technology, in Berlin, helps shed some light on why. Dr Schweinsberg gathered 49 different researchers by advertising his project on social media. Each was handed a copy of a dataset consisting of 3.9m words of text from nearly 8,000 comments made on Edge.org, an online forum for chatty intellectuals.
Dr Schweinsberg asked his guinea pigs to explore two seemingly straightforward hypotheses. The first was that a woman’s tendency to participate would rise as the number of other women in a conversation increased. The second was that high-status participants would talk more than their low-status counterparts. Crucially, the researchers were asked to describe their analysis in detail by posting their methods and workflows to a website called DataExplained. That allowed Dr Schweinsberg to see exactly what they were up to.
In the end, 37 analyses were deemed sufficiently detailed to include. As it turned out, no two analysts employed exactly the same methods, and none got the same results. Some 29% of analysts reported that high-status participants were more likely to contribute. But 21% reported the opposite. (The remainder found no significant difference.) Things were less finely balanced with the first hypothesis, with 64% reporting that women do indeed participate more, if plenty of other women are present. But 21% concluded that the opposite was true.
The problem was not that any of the analyses were “wrong” in any objective sense. The differences arose because researchers chose different definitions of what they were studying, and applied different techniques. When it came to defining how much women spoke, for instance, some analysts plumped for the number of words in each woman’s comment. Others chose the number of characters. Still others defined it by the number of conversations that a woman participated in, irrespective of how much she actually said.
Academic status, meanwhile, was defined variously by job title, the number of citations a researcher had accrued, or their “h-index”, a number beloved by university managers which attempts to combine citation counts with the importance of the journals those citations appear in. The statistical techniques chosen also had an impact, though less than the choice of definitions. Some researchers chose linear-regression analysis; others went for logistic regression or a Kendall correlation.
Truth, in other words, can be a slippery customer, even for simple-sounding questions. What to do? One conclusion is that experimental design is critically important. Dr Schweinsberg hopes that platforms such as DataExplained can help solve the problem as well as revealing it, by allowing scientists to specify exactly how they chose to perform their analysis, allowing those decisions to be reviewed by others. It is probably not practical, he concedes, to check and re-check every result. But if many different analytical approaches point in the same direction, then scientists can be confident that their conclusion is the right one. ■
This article appeared in the Science & technology section of the print edition under the headline “Methods and madness”
Jupiter’s “Stripes” Change Color. Now We Might Know Why
Astronomers are Searching for a Galaxy-Wide Transmitter Beacon at the Center of the Milky Way
Betelgeuse is Almost 50% Brighter Than Normal. What’s Going On?