Computational Approaches to Text Preparation and Analysis
Are you interested in textual analysis but unsure about where to start? Join us for an interactive “no experience required” introduction to the fundamental concepts, processes, and methodological approaches for preparing and analyzing text using computational approaches. Following a general introduction to the topic, participants will be guided through prepared exercises that demonstrate how different software packages (OpenRefine, Python) can be used to prepare for and perform textual analysis.
Presented by Jay Brodeur (Associate Director, Digital Scholarship Infrastructure & Services and Administrative Director of the Sherman Centre for Digital Scholarship) and Devon Mordell (Educational Developer, The MacPherson Institute for Teaching and Learning).
Preparation
For this workshop, you will need OpenRefine and a web browser. Follow the instructions provided by the Library Carpentry to install OpenRefine on your system (whether it is Windows, Mac, or Linux).
- NOTE: When opening OpenRefine for the first time in a Mac, you may need to open your security preferences and permit OpenRefine to run. See this article from Apple Support about opening a Mac app from an unidentified developer.
Contents
Segment | Time Allotted | Key Topics / Activities |
---|---|---|
Introductory remarks | 20 minutes | Introduction to text preparation and analysis Overview of concepts and methods Key considerations for different source materials and analyses |
OpenRefine | 40 minutes | Introduction to OpenRefine Manual cleanup (e.g. find and replace) Faceting |
Getting Programmatic with Python | 20 minutes | Overview of programmatic approaches The ‘what’ and ‘when’ to program Using Python for text preparation Link to notebook |
Break | 10 minutes | Break |
Sampling of text analysis methods | 75 minutes | Named entity recognition (Link to notebook) Topic Modeling (Link to notebook) Sentiment analysis (Link to notebook) |
Q & A; Final Thoughts | 10 minutes | Questions and wrap-up Where to learn more |
Workshop notebooks
Most of our work will be done using jupyter notebooks hosted in Google Colab.
- Introduction to programmatic text prep
- Named entity recognition (NER)
- Topic modeling
- Sentiment analysis
Workshop recording
View the original here.
Workshop slides
Learn more
Here are a variety of helpful resources to explore and learn more
OpenRefine
- Library Carpentry lesson on OpenRefine
- University of Toronto Libraries OpenRefine tutorials
- OpenRefine Manual on Regular Expressions
- Using regular expressions in OpenRefine: Tutorial by Peter Green, includes non-Latin script.
- Regular expression testers
- https://www.regular-expressions.info/
- https://regex101.com/
- Regexr: Interactive regular expression (regex) coder and explainer
Python & NLP
Python Integrated Development Environments
- There are many, many different Python IDEs. Find which one is best for you. Jay is partial to Pyzo.
Python packages for text prep and Natural Langauge Processing
- PyTesseract: Simple Python Optical Character Recognition
- spaCy NLP library and documentation
- NLTK NLP library and docmentation
- natas: Library for processing historical English corpora, especially for studying neologisms
- Python phonetics package, which includes methods for matching and clustering words by phonetic similarity
- pyspellchecker: A simple Python-based spell checking algorithm
- BookNLP: A natural language processing pipeline that scales to books and other long documents (in English).
Other tutorials and resources
- Check out Devon Mordell’s two excellent text prep and analysis modules shared through the SCDS: Pre-Processing Digitized Texts and Named Entity Recognition.
- Constellate a comprehensive set of resources to learn how to build your text and data mining skills.
- How to Clean Text for Machine Learning with Python. An excellent step-by-step walkthrough of the fundamentals of text prep with Python.
- Python Regex (Regular Expressions) for Data Scientists
- Cleaning OCR’d text with Regular Expressions by Laura Turner O’Hara for The Programming Historian.
- Natural Language Processing With Python’s NLTK Package: An excellent end-to-end tutorial using the nltk package
- Natural Language Processing with Python: Introduction. This is an excellent step-by-step introduction to basic pre-processing steps (though no clustering or error find/replace)
- Using Binder to connect GitHub repositories to Jupyter Notebooks