We will offer a 2-week immersive seminar on Philosophy of Statistics (PhilStat) at Virginia Tech for faculty and post-docs in philosophy who wish to acquire or strengthen their background in PhilStat as it relates to philosophical problems of evidence and inference, to today’s statistical crisis of replication, and to associated evidence-policy debates. We also invite social scientists and methodology researchers interested in strengthening their philosophical scholarship in this arena. A total of 12-15 applicants will be selected. Given our goals, we anticipate approximately 2/3 will be philosophers, but we are not applying any rigid rules. We will consider up to 2 advanced Ph.D. students working on a dissertation that is directly in this area. Update (3/3/19): Please see important update in FAQ (5).
All accepted participants will receive private housing with kitchen facilities (Marriott Residence, Blacksburg) and a stipend of $1,000 (in 2 installments). See STIPENDS and .
Our primary goal is to strengthen Phil Stat research and teaching in Philosophy programs (by incorporating PhilStat). However, we wish also to enable statistical practitioners and researchers on methods to gain a greater understanding of the philosophical dimensions of statistical debates, as well as a facility in the conceptual and critical skills included under the umbrella of Phil Stat. Today’s debates are intertwined with philosophical terms that often remain vague, such as evidence, validity, inference, realism, reliability, rationality, explanation, induction, confirmation, and falsification. This hampers communication among various stakeholders, making it difficult to even see where they agree. Thus we also encourage interested social scientists and methodology researchers to apply.
Philosophy of Statistics (Phil Stat), broadly understood Phil Stat includes
(a) philosophy of statistics: the philosophical and conceptual foundations of statistical inference and the analysis of the uses of probability in collecting, modeling and learning from data;
(b) problems of general philosophy of science insofar as aspects of error-prone and uncertain reasoning are involved: problems of inductive and deductive inference, evidence, testing, falsification, science/pseudoscience, underdetermination, realism, modeling, data analysis and experiment;
(c) the philosophy and history of science, and evidence policy, insofar as these involve error-prone reasoning, whether formal or informal.
The focus of the Summer Seminar in Phil Stat Our focus will revolve around current controversies about statistical methodology. The field of philosophy of statistics has long been marked by philosophical disagreement and controversy. Although the availability of high-powered methods has led to eclecticism and pragmatism, long-standing battles still simmer below the surface in questions about scientific trustworthiness and in the evidence policy reforms put forward by journals and professional societies. Rival conceptions of the nature of statistical inference show up unannounced in today’s problems of scientific integrity, irreproducibility, and questionable research practices. When it comes to how to restore scientific integrity, the experts do not agree. We will peel back the layers as to why.
For example, today’s statistical crisis of replication grows out of the difficulty in obtaining statistically significant results when an independent group seeks to replicate a finding using more stringent protocols and preregistration of hypotheses and data generation. However statistical significance tests are themselves steeped in the inductive-statistical controversies played out in the philosophy and history of statistics. Key notions–P-values, Type I and II error probabilities, power, confidence levels, posterior probabilities–are often confused and are interpreted in myriad ways. A jungle of thorny debates about probabilistic ontology, the nature of scientific progress, the role of probability–not to mention matters of politics–are interwoven. These issues form an important part of what we are calling PhilStat. Questions about values in science and ethics of research are also intimately entwined. Work in AI, machine learning, and causal modeling are also components of the broad area we have in mind, although we will not have time to study those topics in our summer seminar. However, participants may have projects that relate to these.
Justification The current statistical crisis in science cries out for philosophical illumination, while offering a superb avenue for philosophers to genuinely impact scientific practice and policy. The assumptions behind today’s competing evidence reforms are largely hidden to those outside the loop of Phil Stat. This is a crucial obstacle to scrutinizing the consequences to science policy, clinical trials, personalized medicine, and across a wide landscape of Big Data modeling. Our program is intended as a first step to overcoming this obstacle. Despite technical sophistication, basic concepts of statistical testing and inference are more unsettled than ever. We aim to strengthen the role of philosophy in statistical debates as well as to enhance statistical inference in philosophical debates.
Many of the key problems in today’s evidence-policy disputes are conceptual. Philosophers of science should be involved. Likewise, by bringing philosophical acumen to the debates, practitioners and metascience researchers can cut through the unclarity we often see regarding the nature and roles of probability in inference. The deepest problems underlying the replication crisis go beyond formal statistics–into measurement, experimental design, communication of uncertainty. In addition, the importance of ethics in research and in data science is becoming increasingly recognized. Without an understanding of the basic statistics, informed by philosophical assumptions about the nature and goals of using probability in learning, it’s impossible to see where the formal and informal, value-laden issues intersect.
Participants Can Expect to:
(1) Gain the necessary background in statistical methodology and philosophical foundations of statistics in order to integrate PhilStat into their research, teaching and professional engagements.
(2) Study current debates between frequentist, Bayesian, likelihood and other methods, as they interrelate to (i) philosophical problems of evidence and inference, and to (ii) today’s debates over data methods and statistical policy.
They will also have a role in shaping a futuristic interdisciplinary field including cutting across philosophy, statistics, and science.
A key recognition among those who write on the statistical crisis in science is that the pressure to publish attention-getting articles can incentivize researchers to produce eye-catching but inadequately scrutinized claims. We may see much the same sensationalism in broadcasting metastatistical research, especially if it takes the form of scapegoating or banning statistical significance. A lot of excitement was generated recently when Ron Wasserstein, Executive Director of the American Statistical Association (ASA), and co-editors A. Schirm and N. Lazar, updated the 2016 ASA Statement on P-Values and Statistical Significance (ASA I). In their 2019 interpretation, ASA I “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned,” and in their new statement (ASA II) announced: “We take that step here….’statistically significant’ –don’t say it and don’t use it”.To herald the ASA II, and the special issue “Moving to a world beyond ‘p < 0.05’”, the journal Nature requisitioned a commentary from Amrhein, Greenland and McShane “Retire Statistical Significance” (AGM). With over 800 signatories, the commentary received the imposing title “Scientists rise up against significance tests”!
Note: By “ASA II” I allude only to the authors’ general recommendations, not their summaries of the 43 papers in the issue.)
Hardwicke and Ioannidis (2019) worry that recruiting signatories on such a paper politicizes the process of evaluating a stance on scientific method, and fallaciously appeals to popularity (argumentum ad populum) “because it conflates justification of a belief with the acceptance of a belief by a given group of people”. Opposing viewpoints are not given a similar forum. Fortunately, John Ioannidis (2019) can come out with a note in JAMA challenging ASA II and AGM, but the vast majority of stakeholders in the debate go unheard. Appealing to popularity gives a prudential reason to go along, it is risky to stand in opposition to the hundreds who signed, not to mention, the thought leaders at the ASA. There is also an appeal to fear, with the result that many will fear using statistical significance tests altogether. Why risk using a method that is persecuted with such zeal and fanfare?
Ioannidis (2019) points out what may not be obvious at first: it is not just a word ban but a gatekeeper ban:
Many fields of investigation … have major gaps in the ways they conduct, analyze, and report studies and lack protection from bias. Instead of trying to fix what is lacking and set better and clearer rules, one reaction is to overturn the tables and abolish any gatekeeping rules (such as removing the term statistical significance). However, potential for falsification is a prerequisite for science. Fields that obstinately resist refutation can hide behind the abolition of statistical significance but risk becoming self-ostracized from the remit of science.
Among the top-cited signatories who respond to their questionnaire, Hardwicke and Ioannidis find a heavy representation of fields with prevalent concerns about low reproducibility. Yet “abandoning the concept of statistical significance would make claims of ‘irreproducibility’ difficult if not impossible to make. In our opinion this approach may give bias a free pass”.
I agree, and will show why.
I continue with (excerpts of a preprint of) my article; references are formatted in the usual way. You can read the “free access” version here.
It might be assumed I would agree to “retire significance” since I often claim “the crude dichotomy of ‘pass/fail’ or ‘significant or not’ will scarcely do” and because I reformulate tests so as to “determine the magnitudes (and directions) of any statistical discrepancies warranted, and the limits to any substantive claims you may be entitled to infer from the statistical ones.”(Mayo 2018) [Genuine effects, as Fisher insisted,require not isolated small P-values, but a reliable method to successfully generate them.] We should not confuse prespecifying minimal thresholds in each test, which I would uphold, with fixing a value to habitually use (which I would not). N-P tests called for the practitioner to balance error probabilities according to context, not rigidly fix a value like .05. Nor does having a minimal P-value threshold mean we do not report the attained P-value: we should, and N-P agreed!
The “no threshold” view is not merely to never use the S word and report continuous P-values
These two rules alone would not lead Hardwicke and Ioannidis to charge, correctly, in my judgment, that “this approach may give bias a free pass”. ASA II and AGM decry using any prespecified P-value threshold as the basis for categorizing data in some way, such as inferring that results are, or are not, evidence of a genuine effect.
“Decisions to interpret or to publish results will not be based on statistical thresholds” (AGM).
“Whether a p-value passes any arbitrary threshold should not be considered at all” in interpreting data (ASA II).
Consider how far reaching the “no threshold” view is for interpreting data. For example, according to ASA II, in order for the U.S. Food and Drug Administration (FDA) to comply with its “no threshold” position, it does not suffice that they report continuous P-values and confidence intervals. The FDA would have to end its “long established drug review procedures that involve comparing p-values to significance thresholds for Phase III drug trials”.
The New England Journal of Medicine (NEJM) responds (2019) to the ASA call to revise their guidelines, but insists that a central premise on which their revisions are based is “the use of statistical thresholds for claiming an effect or association should be limited to analyses for which the analysis plan outlined a method for controlling type I error”. In the article accompanying the revised guidelines:
“A well-designed randomized or observational study will have a primary hypothesis and a prespecified method of analysis, and the significance level from that analysis is a reliable indicator of the extent to which the observed data contradict a null hypothesis of no association between an intervention or an exposure and a response. Clinicians and regulatory agencies must make decisions about which treatment to use or to allow to be marketed, and P values interpreted by reliably calculated thresholds subjected to appropriate adjustments [for multiple trials] have a role in those decisions”.
Specifying “thresholds that have a strong theoretical and empirical justification” escapes the ASA II ruling: “Don’t conclude anything about scientific …importance based on statistical significance”.
Although less well advertised, the “no thresholds” view also torpedoes common uses of confidence intervals and Bayes Factor standards.
[T]he problem is not that of having only two labels. Results should not be trichotomized, or indeed categorized into any number of groups. Similarly, we need to stop using confidence intervals [CIs] as another means of dichotomizing. (ASA II)
AGM’s “compatibility intervals” are redolent of the consonance intervals of Kempthorne and Folks(1971) , except that the latter use many thresholds, one for each of several consonance levels. Even these would seem to violate the rule that results should not be “categorized into any number of groups”.
…Nor could Bayes factor thresholds be used, as they often are, to test a null against an alternative. It is not clear how any statistical tests survive. A claim has not passed a genuine test, if none of the results are allowed to count against it. We are not told what happens to the use of significance tests to check if statistical model assumptions hold approximately, or not–essential across methodologies. As George Box, a Bayesian, remarks, “diagnostic checks and tests of fit … require frequentist theory significance tests for their formal justification”(1983, p. 57).
What arguments are given to accept the no threshold view?
Getting past the appeals to popularity and fear, the reasons ASA II and AGM give are that thresholds can lead to well-known fallacies, and even to some howlers more extreme than those long lampooned. Of course it’s true:
a statistically non-significant result does not ‘prove’ the null hypothesis (the hypothesis that there is no difference between groups or no effect of a treatment …). Nor do statistically significant results ‘prove’ some other hypothesis. (AGM)
It is easy to be swept up in their outrage, but the argument: “significance thresholds can be used very badly, therefore remove significance thresholds” is a very bad argument. Moreover, it would remove the very standards we need to call out the fallacies. A rule that went from any non-significant result to inferring no effect was proved, or to take something less extreme, to inferring it is well warranted or the like, would have extremely high Type II error probabilities. They deal with a point null hypothesis, which makes it even worse.
The danger of removing thresholds on grounds they could be badly used is that they are not there when you need them. Ioannidis zeroes in on the problem:
The proposal to entirely remove the barrier does not mean that scientists will not often still wish to interpret their results as showing important signals and fit preconceived notions and biases. With the gatekeeper of statistical significance, eager investigators whose analyses yield, for example, P = .09 have to either manipulate their statistics to get to P < .05 or add spin to their interpretation to suggest that results point to an important signal through an observed “trend.” When that gate keeper is removed, any result may be directly claimed to reflect an important signal or fit to a preexisting narrative.
As against Ioannidis’ anything goes charge, it might be said that even in a world without thresholds a largish P-value could not be taken as evidence of a genuine effect. For to do so would be to say something nonsensical. It would be to say: Even though larger differences would frequently be expected by chance variability alone (i.e., even though the P-value is largish), I maintain the data provide evidence they are not due to chance variability.
But such a response turns on appealing to a threshold to block it, minimally requiring the P-value be rather small e.g., < .1? (It also shows why P-values are apt measures for the job of distinguishing random error.) Thus, our eager investigators, facing a non-small P-value, are still incentivized to manipulate their statistics. Say they ransack the data until finding a non-prespecified subgroup that provides a nominally small enough P-value. In a world without thresholds, we would be hamstrung from highlighting, critically, P-values that breach (as opposed to uphold) preset thresholds.
“Whether a p-value passes any arbitrary threshold should not be considered at all when deciding which results to present or highlight” (my emphasis, ASA II).
More important than keeping a specific word is keeping a filter for error control. The 2016 ASA I warned in Principle 4: “Valid scientific conclusions based on p-values and related statistics cannot be drawn without at least knowing how many and which analyses were conducted, and how those analyses (including p-values) were selected for reporting”. …An unanswered question is how Principle 4 is to operate in a world with ASA II.
The NEJM’s revised guidelines, far from agreeing to use P-values without error probability thresholds, will now be stricter in their use. When no method to adjust for multiplicity of inferences or controlling the Type I error probability is prespecified, the report of secondary endpoints
should be limited to point estimates of treatment effects with 95% confidence intervals. In such cases, the Methods section should note that the widths of the intervals have not been adjusted for multiplicity and that the inferences drawn may not be reproducible. No P values should be reported for these analyses.
Confidence intervals severed from their dualities with tests, from which they were initially developed, lose their error probability guarantees.
The ASA P-value project is lately careering into recommendations on which there has been little balanced discussion and much disagreement. Hardwicke and Ioannidis find that more than half of the respondents deny significance should be excluded from all science, and the 43 papers in the special issue “Moving to a world beyond ‘p < 0.05’” offer a cacophony of competing reforms.
It is hard to resist the missionary zeal of masterful calls: Do you want bad science to thrive? or Do you want to ban significance? (a false dilemma). A question to raise before jumping on the bandwagon: Are they asking the most unbiased questions about the consequences of removing thresholds currently ensconced into hundreds of legal statutes and best practice manuals? This needs to be carefully considered, if the reforms intended to improve credibility of statistics are not to backfire, as they may already be doing.
ASA II is part of a large undertaking; it contains plenty of sagacious advice. Notably the M in ATOM: Modesty.
Be modest by recognizing that different readers may have very different stakes on the results of your analysis, which means you should try to take the role of a neutral judge rather than an advocate for any hypothesis.
ASA II regards its positions “open to debate”. An open debate is very much needed.
To use the Richard Morey Severity app to get the entries in Table 3.3 (SIST p. 145) you have to click on sampling distributions and make the replacements shown below (as it was created for an example we used with different numbers). His way of wording Severity, however, is enlightening. These entries are all for sample mean M = 151, and thus lead to “non-reject” (for the given 0.025 significance level cut-off). Hence, he computes SEV for mu ≤ various values. In each case, SEV is the area to the right of M = 151, computed under the alternative (153, 152, 151, 150.5, 150).
*Note I pasted these in the opposite order shown in Table 3.3, from largest to smallest.