Skip to main content Link Menu Expand (external link) Left Arrow Right Arrow Document Search Copy Copied

Behind the Interface: The Loaded Lingo of Data Pre-Processing


Behind the interface examines the technocultural dimensions of working with (text) data, with the understanding that computing infrastructure and practices are not neutral but emerge from complicated historical lineages that often remain hidden to the user. By peering behind the interface at the circumstances, biases and assumptions surrounding the layers of decision-making involved in developing technologies, we encourage you to consider how structures of inequality become hard-coded into the tools and conventions of data science and how we can work towards opening up new sites of resistance and critique.

On “data cleaning”

Correcting errors in the dataset is often referred to as “data cleaning,” suggesting that data - as received from its source - is disorderly or chaotic. OpenRefine, for example, advertises itself as a “powerful tool for cleaning messy data.”

In associating source data with messiness, a value-laden binary is created between data before pre-processing takes place and after. But, as we have already observed from the “Pre-processing Digitized Texts” lesson, our pre-processing actions do not necessarily leave the dataset improved: we may accidentally introduce new errors or make decisions to omit certain features from our analysis. As D’Ignazio and Klein ask in their book, Data Feminism:

[W]hat might be lost in the process of dominating and disciplining data? Whose perspectives might be lost in that process? And, conversely, whose perspectives might be additionally imposed? (131) [1]

To make data amenable to our analyses, then, can involve reducing intrinsic complexity and other trade-offs; data pre-processing is therefore not the unequivocal good that might be suggested by the term “cleaning.”

D’Ignazio and Klein, citing the work of postcolonial science studies scholar Banu Subramaniam, go on to remind the reader that “certain core principles – like a generalized belief in the benefit of control and cleanliness” are inextricable from historical scientific discourses influenced by eugenics (132). Although they emphatically clarify that describing data pre-processing tasks as “cleaning” is not tantamount to perpetuating eugenics, they illustrate how certain ideas are “tidied up” in ways that efface their troubling origins.

The fraught binaries in the language of data pre-processing - disorder / order, messy / clean, unruly / wrangled - manifest in contemporary contexts as well. In “Invisible Defaults and Perceived Limitations: Processing the Juan Gelman Files,” Elvia Arroyo Ramirez recounts the experience of trying to image floppy disks containing the records of Argentinian poet and human rights activist Juan Gelman. Running into a barrier when filenames containing Spanish-language diacritics raised an “invalid encoding” error within the disk imaging software, she reached out to the digital preservation community for advice on how to approach the challenge; many of the responses, she notes, used the language of needing to “clean” or “scrub” the diacritic glyphs and - in at least one instance - even refer to the diacritics as “illegal characters.”

Implicitly aligning diacritics with messiness or invalidity works to create an “invisible default” by privileging English language texts, reflecting the biases of the disk imaging software’s creators. The reference to “illegal characters” in particular, though likely unintentional on the part of the respondent, cannot help but also invoke current debates around immigration in the United States - who belongs and who does not. Arroyo reveals her solution - which did not involve removing diacritics - and urges her readers to think critically about how these invisible defaults influence notions of what is possible:

Perceiving diacritics as a compromise on an ideal born-digital processing scenario, or thinking it is an acceptable practice to purge, sanitize, or cleanse file and folder names of them is representative of the amount of work the [archival] profession needs to do. [2]

Data is the new oil

The powerful metaphors invoked by the use of language in data science are not just limited to the binary of messy / clean: you may also notice the use of metaphorical language that aligns data with unprocessed natural resources like oil or ore. For example, data mining or raw data or describing the data processing workflow as a pipeline.

In fact, data are never “raw” – they are always already determined by the choices we make in collecting and selecting them, which are in turn influenced by the social, political and historical forces that shape our work. Much like computational tools and processes themselves, data are situated and partial, one of many rather than authoritative.

While we may use the language of the discipline in order to communicate with other practitioners, it is important to consider how language is used to frame thinking about data in particular ways that are grounded in a colonial paradigm of extraction.

References

[1] D’Ignazio, C., & Klein, L. F. (2020). Data Feminism. The MIT Press. https://doi.org/10.7551/mitpress/11805.001.0001.

[2] Arroyo-Ramirez, E. (2016, October 30). Invisible Defaults and Perceived Limitations: Processing the Juan Gelman Files. Medium. https://medium.com/on-archivy/invisible-defaults-and-perceived-limitations-processing-the-juan-gelman-files-4187fdd36759.