Abstract
The technology of Machine Learning (ML), arguably, is one of the most significant general purpose technologies of our age. The appealing promise of machine learning is that it can take a given large corpus of “raw” data packaged up into a “dataset”, learn and discover various patterns, and derive putatively objective and reliable conclusions about the world according to this data-based learning. This technology, and its implicit mindset, is becoming increasingly the subject of attention in the sciences [Hey 2020], which in turn suggests value in viewing this mindset in terms of philosophy of science. Although there is now work that argues that data is indeed not given (as the etymology of “data” suggest) but should be viewed as capta (i.e. taken by deliberate choice) [Kitchin 2014], we go further and argue that we should view it as constructa something constructed as part of the entire rhetorical chain of building reliable knowledge. We start by arguing that the ML-centric conception of raw data-set as a given is a large part of the problem in using ML in scientific endeavors. We build on the philosophical literature on values in science [Douglas 2000, 2016; Longino 1996] to show that the two essential and ever-present aspects of the context and values are intrinsic to the manufacturing of data-sets. Ironically, by obscuring and ignoring these aspects, the very ostensible goal of using data-driven inferences for rational, reliable, and sound knowledge and action in the world is thwarted. We argue that rather than conceiving the goal of data-science reasoning to just provide warrants for the calculations made upon the given data (construed as an accurate representation of the world), we are better served if we conceive of the entire process, including the acquisition (construction) of the data itself, and seek to legitimate that entire process. Our analysis allows us to identify and attack three pervasive, but flawed, assumptions that underpin the default conception of data in ML: (1) data is a thing, not a process; (2) data is raw and aimlessly given; (3) data is reliable. These assumptions not only result in epistemic harms, but more consequentially can lead to social and moral harms as well. We argue that value-ladenness and theory-ladenness coincide for data, and are readily understood by conceiving of data-based claims as rhetorical claims, where “facts” and “values” are equated in the sense that they are taken as incontrovertible assumptions. Then the justification for data-based inferences amounts to a rhetorical warrant for the whole process. Hence instead of presuming your data was given (or taken) and your job (as a scientist) is to explore it or confirm hypotheses tied to it, it is better to construe data as constructa - something manufactured, just like scientific knowledge, the solidity and reliability of which is within the control of the scientist, rather than being an intrinsic property of the world.