"Reproducible" is not synonymous with "true": a comment on the NAS report

    London, 19 April 2017

    The timing was almost perfect, and that’s not a coincidence. Only a few weeks after the publication of my opinion in PNAS, which warned against making unsupported claims that science is in crisis, the National Academy of Scholars (NAS) issued a report on the “Irreproducibility Crisis of Modern Science”.
    I highly praise the authors of the report, David Randall and Christopher Welser, for inviting me to contribute an opinion on the matter. Such invitation epitomizes what I still be believe to be the only real antidote to bad science and misguided policies: an open and transparent scholarly debate.

    The timing of this report is not coincidental because, as I illustrated in the PNAS article, the narrative that science is in crisis is spreading as we speak. Like other similar documents, the NAS report aims to make potentially constructive and interesting proposals to improve research practices, but justifies them on the basis of an empirically unsupported and strategically counterproductive claim that the scientific system is falling apart.

    Before commenting on some of the 40 recommendations made in the executive summary of the report, I will very briefly restate that I see no evidence in the literature of an “irreproducibility crisis of modern science”. There is no evidence that most of the literature is hopelessly biased or irreproducible, no evidence that the validity of research findings has declined in recent decades, and no evidence that such problems are rising in the USA or other Western countries due to pressures to publish. I used to believe differently, but recent, better studies have changed my mind.

    Make no mistake, the research and publication practices of many fields have plenty of room for improvement. However, as summarized in the PNAS article (an extended version of which is in preparation) the most updated research suggests that problems with transparency, reproducibility, bias and misconduct are highly irregularly distributed – across and within individual disciplines - and have equally diversified and complex causes. This makes me extremely skeptical, indeed weary, of any recommendation to adopt “one size fits all” solutions. This is the main criticism that I have for some of the recommendations made by the report.

    Many of the recommendations made by the report, I strongly support.

    I emphatically agree, for example, with all recommendations to improve the statistical literacy of scientists, journalists, policymakers and indeed the general public (recommendations n. 8-12,28,33,34,35,39,40). If all of us had been trained in statistical thinking to the same extent that we were taught algebra and geometry, many ill-advised debates would dissolve, from within science as well as society, and the world would be a much better place.

    I also endorse any recommendation to pay greater attention to the methodological solidity of results, to focus on the substantive (and not merely statistical) significance of results, and to staff government agencies, judicial system and the media with adequately trained methodologists and statisticians (n. 1,2,21,23,36).
    I also generally endorse any recommendation to “experiment” with innovative research and publication practices (e.g. n. 5,15,17). The emphasis here, however, has to be on “experimenting” with, rather than “imposing” new standards.

    This is why I disagree, to varying degrees, with most of the other recommendations made.

    Most of the other recommendations explicitly or implicitly seem aimed at imposing general standards of practice, in the name of reproducibility, across all research fields. Such recommendations presuppose that reproducibility is a clear-cut concept, that can be defined and assessed universally, and that is substantially equivalent to the truthfulness, validity and generalizability of results. Unfortunately, this is not the case.

    Far from conclusively measuring how reproducible Psychology or Cancer biology are, the recent reproducibility studies cited by the report (and others not cited, whose results are less known and more optimistic) have sparked a fascinating debate over how the reproducibility of a study can be measured, assessed and interpreted empirically. This literature is gradually unveiling how complex, multifaceted and subtle these question really are.

    It is well understood, but too hastily forgotten, that research results may not replicate for reasons that have nothing to do with flaws in their methodology or with the low scientific integrity of their authors. Reality can be messy and complex, and studies that try to tackle complex phenomena (which is to say most social and biological studies) are bound to yield evidence that is incomplete, erratic, sometimes contradictory and endlessly open to revision and refinement.

    Therefore, whilst recommendations such as “all new regulations requiring scientific justification rely solely on research that meets strict reproducibility standards” (n. 29) or “to prevent government agencies from making regulations based on irreproducible research” (n. 31) may be agreed in principle, in practice they are unlikely to work as hoped. At the very least, the standards and criteria mentioned ought to be established on a case-by case basis.

    Methods can certainly be made more “reproducible”, in the sense of being communicated with greater completeness and transparency. Recommendations to improve these components of the research process o are unobjectionable, as is the suggestion to experiment with practices that add statistical credibility to results, such as pre-registering a study. However, no amount of pre-registration, transparency, and sharing of data and code can turn a badly conceived and badly designed study into a good one. Even worse, by superficially complying to bureaucratic reproducibility standards, a flawed study might acquire undeserved legitimacy.

    Unfortunately, “reproducible” is not synonymous with “true”. If there was a simple methodological recipe to determine whether a research finding is valid, we would have found it by now. Ironically, the root cause of many of the problems discussed in the report is precisely the illusion that such a recipe exists, and that it comes in the form of Null Hypothesis Significance Testing. Behind the recommendation to lower the significance threshold to P<0.01(n. 20) I see the risk of perpetuating such a myth. A risk that I don’t see, conversely, in recommending the use of Confidence Intervals ( n. 3) and Bayesian thinking (n. 9).

    For similar reasons, I am conflicted about calls to fund replication research (e.g. n. 19) or reward the most significant negative results (n. 22). These ideas are excellent in principle, but presuppose that we have universal methodological criteria to establish what counts as a valid replication.
    If an original study was badly conceived, the best way to show its flaws is not to replicate it exactly, but rather to design a different, better study. Or, sometimes, it might be best to just critique it and move on.




    ***


Maturing Meta-Science Was On Show In Washington DC

    Stanford, 7 April 2017

    Meta-Research is coming of age. This is the energizing insight that I brought home from Washington DC, where I had joined the recent Sackler Colloquium held at the National Academy of Sciences. Organized by David B. Allison, Richard Shiffrin and Victoria Stodden, and generously supported by the Laura and John Arnold foundation and others, the colloquium brought together experts from all over the academic and geographic world, to discuss “Reproducibility of Research: Issues and Proposed Remedies”.

    The title was great but, let’s be honest, it didn’t promise anything exceedingly new. By now, small and large events announcing this or similar themes take place regularly in all countries. They absolutely need to because, even though we seem to understand relatively well the biases and issues that affect science the most – as we showed in a recent paper – we are far from having an accurate picture of the issues at hand, let alone devising adequate solutions. Needless to say, it was an absolute honor and a real pleasure for me to take part as a panelist, with the task of closing the day dedicated to “remedies”.

    Never judge a conference by its title. Something new was in the DC air – or at least that’s what I felt. That certain sense of déjà entendu, that inevitable ennui of the converted who is preached to, were not there. In their place, was the electrifying impression that debates were surging, that meta-science was truly in the making.

    Every topic was up for debate and no assumption seemed safe from scrutiny. Talks, questions and discussions felt mature, prudent and pragmatic, and yet they expressed an exciting diversity of experiences, opinions, ideas, visions and concerns.

    Much praise, therefore, goes to the organizers. The lineup of speakers cleverly combined meta-research household names - like Brian Nosek, and David Moher - our ex-visiting scholar whom we sorely miss – with voices that are less commonly heard in the meta-research arena. Lehana Thabane, for example, who discussed reporting practices, and Emery Brown, whose appeal to teach statistics in primary school ought to be broadcast the world over.

    Most importantly, however, the program included reputable counter-voices. For example, that of Susan Fiske, who has been under fire for her “methodological terrorism” remarks and is now studying scientific discourse in social media. Or that of Kathleen Hall Jamieson, who warned about the public image damages caused by an exceedingly negative narrative about science. Videos of all talks are available from the Sacker YouTube Channel.

    As I argued in my session, whilst we should definitely strive to improve reproducibility and reduce bias wherever we see it, we have no empirical basis to claim that “science is broken” as a whole, or indeed that science was more reliable in the past than it is today. We simply do not know if that is the case and the difficulties in defining and measuring reproducibility were well illustrated by Joachim Vandekerckhove, Giovanni Parmigiani and other speakers. Indeed, the very meaning of reproducibility may be different across fields, as our opinion piece led by Steven Goodman argued last year.

    Despite, or perhaps because of these difficulties, the best evidence at the moment seems to me to suggest that biased and false results are very irregularly distributed across research fields. The scientific enterprise badly needs interventions in specific areas, but as a whole is still relatively healthy. This is also what my past studies on positive study conclusions, retractions, corrections, scientific productivity and our most recent meta assessment of bias suggest.

    Moreover, we do not need to believe that science is totally broken to endorse initiatives to improve reproducibility. Alternative narratives were offered, implicitly, by some of the speakers. These include Victoria Stodden, who on the first day showed how computational methods (i.e. the field where the concept of “reproducible research” was invented) are pervading all sciences, bringing into them new standards of reproducibility. A narrative of industrialization of the research process was suggested by Yoav Benjamini and one of democratization of knowledge by Hilda Bastian.

    My assessment of the condition of modern science could be wrong, of course, and my remarks were met by several skeptical comments by the public. These were naturally welcome, because only diversity and debate allow a research field to make progress and mature. Meta-science, this latest event proved to me, is certainly doing so.