Preparing La Gaceta for Text Mining


This update comes to us from Paige Morgan, Digital Humanities Librarian at University of Miami Libraries, working with Head of Digitization Laura Capell and Digital Initiatives Metadata Librarian Elliot Williams.


The University of Miami Libraries recently completed its digitization of almost 50 years of nineteenth-century Cuban newspaper La Gaceta de La Habana. The images are available in our digital collections, but a small team of library faculty are working to prepare the OCR’d text for researchers who might be interested in using text mining to explore the collection.

The OCR, though imperfect, is good enough that distant reading is indeed possible. The number of page images (nearly 50,000) means that manually correcting OCR errors one at a time isn’t possible, but we’re currently experimenting with AntConc to see whether we can identify recurring OCR errors that could be fixed throughout the whole collection. We’re also determining the best way to package the text files and their metadata so that users can download them in bulk—whether they want the whole fifty years, or just a few months.

We’re excited about making this dataset available to the research community, and working with them to refine it further!


Did you enjoy this post? Please Share!

Share on facebook
Share on twitter
Share on linkedin
Share on pinterest
Share on reddit

Related Posts

DLF Forum Featured Sponsor: George Blood LP

Featured post from 2022 DLF Forum sponsor George Blood LP. Learn more about this year’s sponsors on the Forum website. Committed to Exceptional Quality and Service

New NDSA Code of Conduct Website

A Code of Conduct webpage is now available sharing information on NDSA’s Code of Conduct practices.  The website links to the Code of Conduct itself,

Skip to content