Skip to main content Link Menu Expand (external link) Left Arrow Right Arrow Document Search Copy Copied

Workshop Title Slide

Background photo image courtesy Ed Robertson via unsplash

Computational Text Analyses Bootcamp

Do you have a textual analysis project that you are trying to get off the ground? Or are you simply interested in learning more about how to analyze texts with computers? Join us for an intensive - but fun! - bootcamp at the Sherman Centre with opportunities to work on your own documents or a sample corpus if you just want to practice the techniques.

Through hands-on exercises, we will introduce to the fundamental concepts, processes, and methodological approaches for preparing and analyzing text using computational approaches. We’ll show you how to use tools like OpenRefine and Python for text preparation and introduce analytic techniques including named entity recognition (NER), topic modeling, sentiment analysis and stylometry. Participants are not expected to have any pre-requisite knowledge of text preparation and analysis, but experience with Python is an asset. Participants will be given an opportunity to complete exercises in advance of the workshop to build basic competency.

This is an in-person event and open to all who are able to travel to the Sherman Centre, which is accessibly located on the first floor of Mills Library at McMaster University.

Workshop Preparation

In this workshop, we will use the following tools and platforms:

  • Google Colab, which requires a Google account. If this poses a challenge, please reach out to the Sherman Centre for alternative arrangements.
  • OpenRefine: Download and install prior to the first session.
  • (optional) Constellate.org, which is available to all McMaster members, as well as members of other institutions. If you do not have access, please contact the Sherman Centre for alternative arrangements.

Facilitator Bios

Devon Mordell is an Educational Developer at The MacPherson Institute for Teaching and Learning. Devon draws on her experience in media art, hobbyist programming and instructional design to teach workshops for the Sherman Centre. Her areas of interest in digital scholarship include data visualization, computational analyses of texts, sonification and critical digital humanities. Her research practice explores the algorithmic culture industry and platform psychogeography.

Jay Brodeur (he/him) is the Director of Digital Scholarship Infrastructure & Services and the Administrative Director of the Sherman Centre for Digital Scholarship. Jay has years of experience working with data in a wide variety of formats and interdisciplinary contexts. A scientist by training with a PhD in Earth and Environmental Sciences, he’s comfortable working and advising on all kinds of data-related activities, ranging from data wrangling and integration to analysis and mapping to research data management. Jay’s also keenly interested in the application of digital approaches to support experiential learning opportunities within and outside of the classroom.

Subhanya Sivajothy (she/her) brings a background of research in data justice, science and technology studies, and environmental humanities. She is currently thinking through participatory data design which allow for visualizations that are empowering for the end user. She also has experience in Research Data Management—particularly data cleaning and curation. Do not hesitate to reach out to her if you would like to talk more about data analysis and visualization as they evolve throughout the research process. Contact Subhanya at sivajos@mcmaster.ca.

Workshop Materials and Preparation

All files for the bootcamp are available in this shared Google Drive folder (u.mcmaster.ca/cta-bootcamp). Download the contents to your local computer and unzip them, AND copy the contents into your own Google Drive before beginning the exercises.

Contents

Day 1

Time: 0930 - 1600 The Jupyter notebook name for each exercise is indicated below.
View/download slides

Segment Time Allotted Key Topics / Activities
Introductory remarks 20 minutes Introduction to text preparation and analysis
Overview of concepts and methods
Text preparation 120 minutes Text prep with OpenRefine
Building workflows with Python (CTA-Bootcamp-2024-python-prep.ipynb)
Lunch (1200 - 1300) 60 minutes Lunch
Text Analysis 180 minutes Named Entity Recognition [45 mins] (CTA-Bootcamp-2024-NER.ipynb)
Sentiment Analysis [45 mins] (CTA-Bootcamp-2024-SA.ipynb)
Topic Modeling [45 mins] (CTA-Bootcamp-2024-TM.ipynb)
Stylometry [45 mins] (CTA_Bootcamp_2024_stylometry.ipynb)
Wrap up 10 minutes Recap & thinking about day 2 projects

Day 2

Time: 0930 - 1600
View/download slides

Segment Time Allotted Key Topics / Activities
Corpora Selection 30 minutes Sources and types
Key considerations for different source materials and analyses
Case studies
Visualization for Dissemination 75 minutes Core concepts
Visualization types
Hands-on exercises
Working Period 75 minutes Work on your own data or a pre-selected project
Lunch (1230 - 1330) 60 minutes Lunch
Working Period 120 minutes Continue project work
Share Back, Closing Comments 30 minutes Share your work
Questions and wrap-up

Here are a variety of helpful resources to explore and learn more.

Natural Language Processing Training and Resources

Constellate (NLP training and analysis)

Constellate is a text analysis learning and analysis platform supported by JSTOR Labs and ITHAKA. McMaster members can access tutorials, digitized materials, and an integrated python notebook environment by registering with their McMaster email address.

Constellate provides:

  • A comprehensive set of interactive Jupyter Notebook-based tutorials for text analysis, shared via GitHub under a CC-BY license.
  • Analytical access to content from 35+ million articles, books, and newspapers from JSTOR, Portico, Chronicling America, etc.
  • A computational platform to develop notebooks and collect, create, analyze, and store data (to members of McMaster and other subscribing institutions).
  • Access to advanced support (to members of McMaster and other subscribing institutions).

To access the features of the pedagogy package (McMaster members):

  1. Sign up for an account using your McMaster email.
  2. Follow the instructions to verify your account
  3. Log in via https://constellate.org/login.

HathiTrust Research Centre

The HathiTrust Research Center (HTRC) enables computational analysis of works in the HathiTrust Digital Library (HTDL) to facilitate non-profit research and educational uses of the collection. McMaster members have access to the HTRC services, materials, and training through the institution’s membership in the HathiTrust.

To access the HTRC:

  1. Sign in at https://analytics.hathitrust.org/.
  2. Select McMaster University as your institution and choose Continue.
  3. Follow through McMaster’s Single Sign On process, if prompted.
  4. When prompted, check your email for a confirmation link and confirm your account.

Other tutorials and resources

OpenRefine

Regular Expressions

Python

Python Integrated Development Environments

  • There are many, many different Python IDEs. Find which one is best for you. Jay is partial to Pyzo.

Python packages for text prep and Natural Langauge Processing

  • PyTesseract: Simple Python Optical Character Recognition
  • spaCy NLP library and documentation
  • NLTK NLP library and docmentation
  • natas: Library for processing historical English corpora, especially for studying neologisms
  • Python phonetics package, which includes methods for matching and clustering words by phonetic similarity
  • pyspellchecker: A simple Python-based spell checking algorithm
  • BookNLP: A natural language processing pipeline that scales to books and other long documents (in English).