Link Search Menu Expand Document

Welcome to Pre-Processing Digitized Texts

Pre-Processing Digitized Texts belongs to a series of workshops on computational text analysis.

We underestimate our abilities as humans to make sense of orthographic errors and alternative spellings like thcn or shew. Machines are less capable of making these inferences, meaning that OCR text output must often be corrected in the pre-processing stage of the textual analysis pipeline to render it legible to computational methods.

In this workshop, we’ll use several approaches to correcting errors in the OCR text output and discuss when to use them. We’ll also introduce the concepts of initial data analysis (IDA) and data provenance, as well as exploring how some techniques used for correcting OCR errors in digitized texts can also be extended to pre-processing born-digital texts.

Learning outcomes

By the end of the workshop, you will be able to:

  • Perform initial data analysis on OCR text output
  • Explain the importance of data provenance
  • Apply computational techniques to correct common OCR errors
  • Identify an appropriate data pre-processing approach

Workshop duration

Going through the workshop from start to finish (and you need not do so necessarily!) will take you approximately 2 to 3 hours to complete, depending on your familiarity with OpenRefine and/or Python and whether you are working with your own dataset alongside the sample corpus.


Next –> Preparation