Good Data Dredging

This abstract has open access

Abstract

"Data dredging"--searching non experimental data for causal and other relationships and taking that same data to be evidence for those relationships--was historically common in the natural sciences--the works of Kepler, Cannizzaro and Mendeleev are examples. Nowadays, "data dredging"--using data to bring hypotheses into consideration and regarding that same data as evidence bearing on their truth or falsity--is widely denounced by both philosophical and statistical methodologists. Notwithstanding, "data dredging" is routinely practiced in the human sciences using "traditional" methods--various forms of regression for example. The main thesis of my talk is that, in the spirit and letter of Mayo's and Spanos’ notion of severe testing, modern computational algorithms that search data for causal relations severely test their resulting models in the process of "constructing" them. My claim is that in many investigations, principled computerized search is invaluable for reliable, generalizable, informative, scientific inquiry. The possible failures of traditional search methods for causal relations, multiple regression for example, are easily demonstrated by simulation in cases where even the earliest consistent graphical model search algorithms succeed. In real scientific cases in which the number of variables is large in comparison to the sample size, principled search algorithms can be indispensable. I illustrate the first claim with a simple linear model, and the second claim with an application of the oldest correct graphical model search, the PC algorithm, to genomic data followed by experimental tests of the search results. The latter example, due to Steckhoven et al. ("Causal Stability Ranking," Bioinformatics, 28 (21), 2819-2823) involves identification of (some of the) genes responsible for bolting in A. thaliana from among more than 19,000 coding genes using as data the gene expressions and time to bolting from only 47 plants. I will also discuss Fast Causal Inference (FCI) which gives asymptotically correct results even in the presence of confounders. These and other examples raise a number of issues about using multiple hypothesis tests in strategies for severe testing, notably, the interpretation of standard errors and confidence levels as error probabilities when the structures assumed in parameter estimation are uncertain. Commonly used regression methods, I will argue, are bad data dredging methods that do not severely, or appropriately, test their results. I argue that various traditional and proposed methodological norms, including pre-specification of experimental outcomes and error probabilities for regression estimates of causal effects, are unnecessary or illusory in application. Statistics wants a number, or at least an interval, to express a normative virtue, the value of data as evidence for a hypothesis, how well the data pushes us toward the true or away from the false. Good when you can get it, but there are many circumstances where you have evidence but there is no number or interval to express it other than phony numbers with no logical connection with truth guidance. Kepler, Darwin, Cannizarro, Mendeleev had no such numbers, but they severely tested their claims by combining data dredging with severe testing.

Submission ID :

PSA2022100

Submission Type

Symposium

Topic 1

Probability and Statistics