Abstract
Astonishingly large datasets are now relatively easy to come by in many scientific fields. The availability of open datasets means that it is possible to acquire data on a problem without formulating any hypothesis whatsoever. The idea of an exploratory data analysis (EDA) predates this situation, but many researchers find themselves appealing to EDA as an explanation of what they are doing with these new resources. Yet there has been relatively little explicit work on what EDA is or why it might be important. I canvass several positions in the literature, find them wanting, and suggest an alternative: exploratory data analysis, when done well, shows the expected value of experimentation for a particular hypothesis. There are three main positions on EDA in the literature. The first identifies EDA with a set of techniques that can be applied to data in order to suggest hypotheses. Tukey (1969, 1977, 1993), who emphasized the “procedure-oriented” nature of exploratory analysis and the extent to which these techniques were “things that can be tried, rather than things that ‘must’ be done” (1993, 7). Hartwig and Dearing (2011, 10) similarly speak of EA as a “state of mind” or a “certain perspective” that one brings to the data. Yet this does not suggest any sort of success conditions for EDA—either in particular cases or for new techniques in general—and therefore offers little guidance on EDA as such. Second, EDA is sometimes treated as simply confirmatory data analysis done sloppily, with looser parameters and more freedom. Authors who suggest this view do so primarily to denigrate EDA (Wagenmakers et al., 2012). This is too pessimistic: charity demands that we prefer a model where authors who appeal to EDA are not simply covering up their sins as researchers. Third, EDA is sometimes linked to socalled exploratory experiments (Steinle, 1997; Franklin, 2005; Feest and Steinle, 2016). Exploratory experimentation is no doubt important, and the techniques of EDA can shed light on particular kinds of exploratory experimentation. Yet EDA also finds use in mature fields where phenomena have been stabilized and the basic theoretical menu is complete, suggesting EDA is related to but distinct from exploratory experimentation. I suggest instead that EDA is primarily concerned with finding hypotheses that would be easy to confirm or disconfirm if a proper experiment were to be done. The techniques associated with EDA are geared towards showing unexpected or striking effects. Whether these effects actually hold cannot be determined from the dataset: EDA also picks up artifacts of undirected data collection. (?) Nevertheless, proper confirmatory experiments are often costly and time consuming, and a good EDA shows where those costs should best be spent. Importantly, EDA tells us whether a hypothesis is worth testing without telling us whether it is likely to be true: rather, it tells us that we are likely to get an answer for a suitably low cost. I link this idea to related work on tradeoffs between information costs in political economics (Stigler, 1961) and Bayesian search theory (Stone, 1976). The resulting position shows why previous positions have the plausibility they do, while providing a principled framework for developing and evaluating EDA techniques.