Preparing La Gaceta for Text Mining
This update comes to us from Paige Morgan, Digital Humanities Librarian at University of Miami Libraries, working with Head of Digitization Laura Capell and Digital Initiatives Metadata Librarian Elliot Williams.
The University of Miami Libraries recently completed its digitization of almost 50 years of nineteenth-century Cuban newspaper La Gaceta de La Habana. The images are available in our digital collections, but a small team of library faculty are working to prepare the OCR’d text for researchers who might be interested in using text mining to explore the collection.
The OCR, though imperfect, is good enough that distant reading is indeed possible. The number of page images (nearly 50,000) means that manually correcting OCR errors one at a time isn’t possible, but we’re currently experimenting with AntConc to see whether we can identify recurring OCR errors that could be fixed throughout the whole collection. We’re also determining the best way to package the text files and their metadata so that users can download them in bulk—whether they want the whole fifty years, or just a few months.
We’re excited about making this dataset available to the research community, and working with them to refine it further!
April 2017 Update: The La Gaceta dataset is now available at the UM Libraries Collections as Data repository.