Nov 10, 2022 01:30 PM - 04:15 PM(America/New_York)
20221110T133020221110T1615America/New_YorkGeneration and Exploration in Data Science
Abstract: Many scientific fields now benefit from 'Big Data.' Yet along with large datasets come an abundance of computational and statistical techniques to analyze them. Many of these techniques have not been subject to sustained philosophical scrutiny. This is in part because the scant literature on philosophy of data science often focuses on hypothesis confirmation as the primary end of data analysis. Yet there are many scientific contexts in which generation-of hypotheses, of categories, of methods-is at least as important an aim. This symposium will contribute to debates about realism, natural kinds, exploratory data analysis, and the value-ladenness of science through the lens of philosophy of data science, opening critical discussion about the nature of data and the emerging methods and practices used to foster scientific knowledge.
Abstract: Many scientific fields now benefit from 'Big Data.' Yet along with large datasets come an abundance of computational and statistical techniques to analyze them. Many of these techniques have not been subject to sustained philosophical scrutiny. This is in part because the scant literature on philosophy of data science often focuses on hypothesis confirmation as the primary end of data analysis. Yet there are many scientific contexts in which generation-of hypotheses, of categories, of methods-is at least as important an aim. This symposium will contribute to debates about realism, natural kinds, exploratory data analysis, and the value-ladenness of science through the lens of philosophy of data science, opening critical discussion about the nature of data and the emerging methods and practices used to foster scientific knowledge.
Data, Capta and Constructa: Exploring, confirming, or manufacturing dataView Abstract Contributed Papers01:30 PM - 04:15 PM (America/New_York) 2022/11/10 18:30:00 UTC - 2022/11/10 21:15:00 UTC
The technology of Machine Learning (ML), arguably, is one of the most significant general purpose technologies of our age. The appealing promise of machine learning is that it can take a given large corpus of “raw” data packaged up into a “dataset”, learn and discover various patterns, and derive putatively objective and reliable conclusions about the world according to this data-based learning. This technology, and its implicit mindset, is becoming increasingly the subject of attention in the sciences [Hey 2020], which in turn suggests value in viewing this mindset in terms of philosophy of science. Although there is now work that argues that data is indeed not given (as the etymology of “data” suggest) but should be viewed as capta (i.e. taken by deliberate choice) [Kitchin 2014], we go further and argue that we should view it as constructa something constructed as part of the entire rhetorical chain of building reliable knowledge. We start by arguing that the ML-centric conception of raw data-set as a given is a large part of the problem in using ML in scientific endeavors. We build on the philosophical literature on values in science [Douglas 2000, 2016; Longino 1996] to show that the two essential and ever-present aspects of the context and values are intrinsic to the manufacturing of data-sets. Ironically, by obscuring and ignoring these aspects, the very ostensible goal of using data-driven inferences for rational, reliable, and sound knowledge and action in the world is thwarted. We argue that rather than conceiving the goal of data-science reasoning to just provide warrants for the calculations made upon the given data (construed as an accurate representation of the world), we are better served if we conceive of the entire process, including the acquisition (construction) of the data itself, and seek to legitimate that entire process. Our analysis allows us to identify and attack three pervasive, but flawed, assumptions that underpin the default conception of data in ML: (1) data is a thing, not a process; (2) data is raw and aimlessly given; (3) data is reliable. These assumptions not only result in epistemic harms, but more consequentially can lead to social and moral harms as well. We argue that value-ladenness and theory-ladenness coincide for data, and are readily understood by conceiving of data-based claims as rhetorical claims, where “facts” and “values” are equated in the sense that they are taken as incontrovertible assumptions. Then the justification for data-based inferences amounts to a rhetorical warrant for the whole process. Hence instead of presuming your data was given (or taken) and your job (as a scientist) is to explore it or confirm hypotheses tied to it, it is better to construe data as constructa - something manufactured, just like scientific knowledge, the solidity and reliability of which is within the control of the scientist, rather than being an intrinsic property of the world.
Exploratory analysis and the expected value of experimentationView Abstract Contributed Papers01:30 PM - 04:15 PM (America/New_York) 2022/11/10 18:30:00 UTC - 2022/11/10 21:15:00 UTC
Astonishingly large datasets are now relatively easy to come by in many scientific fields. The availability of open datasets means that it is possible to acquire data on a problem without formulating any hypothesis whatsoever. The idea of an exploratory data analysis (EDA) predates this situation, but many researchers find themselves appealing to EDA as an explanation of what they are doing with these new resources. Yet there has been relatively little explicit work on what EDA is or why it might be important. I canvass several positions in the literature, find them wanting, and suggest an alternative: exploratory data analysis, when done well, shows the expected value of experimentation for a particular hypothesis. There are three main positions on EDA in the literature. The first identifies EDA with a set of techniques that can be applied to data in order to suggest hypotheses. Tukey (1969, 1977, 1993), who emphasized the “procedure-oriented” nature of exploratory analysis and the extent to which these techniques were “things that can be tried, rather than things that ‘must’ be done” (1993, 7). Hartwig and Dearing (2011, 10) similarly speak of EA as a “state of mind” or a “certain perspective” that one brings to the data. Yet this does not suggest any sort of success conditions for EDA—either in particular cases or for new techniques in general—and therefore offers little guidance on EDA as such. Second, EDA is sometimes treated as simply confirmatory data analysis done sloppily, with looser parameters and more freedom. Authors who suggest this view do so primarily to denigrate EDA (Wagenmakers et al., 2012). This is too pessimistic: charity demands that we prefer a model where authors who appeal to EDA are not simply covering up their sins as researchers. Third, EDA is sometimes linked to socalled exploratory experiments (Steinle, 1997; Franklin, 2005; Feest and Steinle, 2016). Exploratory experimentation is no doubt important, and the techniques of EDA can shed light on particular kinds of exploratory experimentation. Yet EDA also finds use in mature fields where phenomena have been stabilized and the basic theoretical menu is complete, suggesting EDA is related to but distinct from exploratory experimentation. I suggest instead that EDA is primarily concerned with finding hypotheses that would be easy to confirm or disconfirm if a proper experiment were to be done. The techniques associated with EDA are geared towards showing unexpected or striking effects. Whether these effects actually hold cannot be determined from the dataset: EDA also picks up artifacts of undirected data collection. (?) Nevertheless, proper confirmatory experiments are often costly and time consuming, and a good EDA shows where those costs should best be spent. Importantly, EDA tells us whether a hypothesis is worth testing without telling us whether it is likely to be true: rather, it tells us that we are likely to get an answer for a suitably low cost. I link this idea to related work on tradeoffs between information costs in political economics (Stigler, 1961) and Bayesian search theory (Stone, 1976). The resulting position shows why previous positions have the plausibility they do, while providing a principled framework for developing and evaluating EDA techniques.
Exploratory analysis: Between discovery and justificationView Abstract Contributed Papers01:30 PM - 04:15 PM (America/New_York) 2022/11/10 18:30:00 UTC - 2022/11/10 21:15:00 UTC
With the advent of ‘Big Data’ came an abundance of computational and statistical techniques to analyze it, somewhat vaguely grouped under the label of `Data Science'. This invites philosophical reflection and systematization. In this paper we will focus on exploratory data analysis (EDA), a widespread practice used by researchers to summarize the data and formulate hypotheses about it after briefly exploring it. Using Reichenbach's (1938) distinction between context of discovery and context of justification, EDA seems to sit in between exploring in order to discover new hypotheses; and exploiting the data to justify doing confirmatory work. In this paper we will present different conceptualizations of it, shine on its importance, and suggest success conditions for it to be well functioning. The distinction between context of discovery and context of justification is well known and heavily discussed in the literature, albeit different authors provide different interpretations. By it we mean the distinction to be one about two aspects or features of the scientific practice, namely between the process of arriving at hypotheses, and the defense or validation of those hypotheses, the assessment of their evidential support - confirmatory work. One playful way of conceptualizing the difference, and the role that EDA plays in between these two contexts, is to model it as a trade-off between exploration and exploitation. Exploration allows for the discovery of new hypotheses, exploitation allows for assessing the evidential support of hypotheses, obtaining a reward in justification. This allows for a test for when EDA was done successfully, by balancing the trade-offs. Yet it might be objected that EDA has no place in confirmatory work, as Wagenmakers et al. (2012) emphasizes. In a nutshell, it would amount to using the data both to formulate and test hypotheses. I sympathize with this take, but it assumes a deflationary notion of justification. In the literature on epistemic justification, there are two broad tribes. On the one hand, foundationalist theories which in the chain of justifications defend that there are propositions that are self-evident (Descartes), axiomatic (Aristotle, Euclid), acquainted by experience (Russell) or (sense) data (empiricism), etc. In a nutshell, that there is a set of propositions that do not require others to be justified. On the other hand, coherentist theories defend the holistic idea that propositions can be justified by how coherent they are with one another (see Bovens and Hartmann (2003) for a Bayesian formulation of this notion). If justification is understood in a foundationalist vein, then Wagenmakers et al. are correct in arguing that EDA is flawed methodology. But if justification is understood in a coherentist way, or foundherentist one (a mixture between the two developed by Haack (1995)), then there is some role that EDA can play in the context of justification.
On Clustering Algorithms and Natural KindsView Abstract Contributed Papers01:30 PM - 04:15 PM (America/New_York) 2022/11/10 18:30:00 UTC - 2022/11/10 21:15:00 UTC
How and to what end can we “derive” natural kinds and categories from large data sets? This is a question of interest to philosophers of science, natural scientists, and data scientists, who each offer rich but disciplinarily siloed insights. Cluster analysis refers to a variety of algorithmic processes aimed at identifying “clusters” in data sets; subsets of data points that are relevantly more “similar” to one another than to the larger data set. These algorithms are concrete, explicit artefacts that encode and apply a range of theories and intuitions about the purpose and nature of classification. Their computational specifications and theoretical justifications mirror the rich philosophical literature on classification and natural kinds and the roles they play in scientific understanding. Yet the synergies between these two literatures have been largely unexplored (excepting some insightful theoretical work by data scientists, including von Luxburg et. al., 2012 and Hennig, 2015). This paper aims to bridge this gap (especially on the philosophical side) by providing a comparative birds-eye view of both disciplinary conceptions of clustering and classification, drawing out areas of particular promise for future interdisciplinary research. I begin with a brief summary of the roles of classification in science and existing philosophical discussions on the nature, promise, and limitations of classificatory practices. I discuss the general role of classification in inductive inference, and mention some specific considerations that arise from the roles of classification in specific disciplines. I then survey the most common types of clustering algorithms employed by data scientists (largely drawing on (Xu and Wunsch 2005). I tease out their core theoretical assumptions and connect them to the conception of classification in philosophy of science. I proceed to consider the contexts in which such algorithms are implemented, where scientists’ discretion and contextual peculiarities provide a richer picture of how these clustering algorithms are understood and used by scientists. I pay particular attention to the philosophy of biology, where the role of data analysis has been discussed by philosophers (Leonelli 2016), and where scientists already engage with philosophical work on natural kinds (Boyd 1999). I conclude with a discussion of the ways in which data scientists and philosophers of science can both benefit from the lessons the other has to offer on the nature and purpose of classification.
Presenters Sarita Rosenstock Postdoctoral Research Fellow, Australian National University