Skip to main content Link Menu Expand (external link) Left Arrow Right Arrow Document Search Copy Copied

Congratulations

Congratulations! You’ve just finished this workshop.

You should now be able to:

  • Perform initial data analysis on OCR text output
  • Explain the importance of data provenance
  • Apply computational techniques to correct common OCR errors
  • Identify an appropriate data pre-processing approach

Additional Resources

To learn more about any particular topic, take a look at the links below.

OpenRefine

As we are using OpenRefine for our pre-processing tasks, having a better grasp of OpenRefine will assist your error correction effort! There are numerous tutorials available; the Library Carpentries workshop on OpenRefine will reinforce your learning and gets into greater depth on some topics (though in a more general context - i.e. not specific to OCR error correction).

You may also wish to refer to the documentation for OpenRefine to really dive in to what’s possible with the tool.

Regular Expressions

Likewise, since one of the major OCR error correction strategies discussed involves using regular expressions (RegEx), a strong grasp of RegEx will help you make the most of OpenRefine. In addition to the resources listed on the “Correcting OCR Errors with OpenRefine: Strategies” page:

You can also dynamically test your RegEx patterns with Regular Expressions 101 or RegExr.

Critical Data Studies

If “Behind the Interface” piqued your curiosity about how language frames our understanding of data, you may be interested in the following texts:

  • Benjamin, Ruha. Race after technology: Abolitionist tools for the new Jim code. Polity, 2019.
  • Boyd, Danah, and Kate Crawford. “Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon.” Information, communication & society 15.5 (2012): 662-679.
  • Fordyce, Robbie, and Suneel Jethani. “Critical data provenance as a methodology for studying how language conceals data ethics.” Continuum 35.5 (2021): 775-787.
  • Gitelman, Lisa, ed. “Raw data” is an oxymoron. MIT press, 2013.