Notes from HathiTrust UnCamp 2015
Its a funny conundrum. Many of us work at dynamic institutions of higher education, where on any given day there are dozens of relevant and provocative programs by faculty and students doing cutting edge work. And yet, it’s somehow easy to pass up these opportunities when they happen on your own campus because we’ve got work to do back in the office. I had the HathiTrust Research Center UnCamp 2015 on my calendar for weeks on my own campus, yet I was still tempted to head to the office last Monday morning to “get work done.” I’m so glad I ignored my better impulse.
The HTRC UnCamp (#HTRC15) was held in Ann Arbor on March 30 and 31, 2015, hosted by the University of Michigan, University of Illinois, and Indiana University. The program featured a diverse range of perspectives, ideas, demos, and poster sessions all featuring non-consumptive uses of HathiTrust. Some context: the HTRC is a collaboration between Indiana University and University of Illinois Urbana-Champaign. It was founded in 2011 to facilitate computational research, scholarship, education, and invention across the body of material in HathiTrust.
Mike Furlough, Executive Director for HathiTrust, kicked off the program with opening remarks, commenting on the deeply member-driven, collaborative nature of HathiTrust and the HTRC. As an example, he noted the Copyright Review Management System. CRMS systematically is an effort to make copyright determinations for works in HathiTrust, involving over 50 individual reviewers at partner institutions and supported by a series of IMLS National Leadership grants. (Full disclosure: I am the PI for that grant so was particularly pleased that copyright work featured in Mike’s remarks.) This collective attitude is key to ensuring access to knowledge.
The HTRC enables new research—and thus new knowledge. Mike discussed how it is fundamental to HathiTrust to be be actively engaged with researchers to actively do things, discover new forms of knowledge in new disciplines in ways not possible 10 years ago. One example: the HathiTrust Bookworm, a very easy-to-use tool for searching for trends over thousands of texts being developed with funding from the National Endowment for the Humanities. The keynote talk was evidence of this kind of newly possible exploration. In her keynote, Michelle Alexopoulous, Professor in Economics, University of Toronto discussed her use of the HTRC to examine new measures of technical change and what library collections tell us about measuring and predicting innovation. She looked at patents and research and development to see if there were identifiable patterns in knowledge production which could inform strategies for achieving better returns from investment. She contemplated that if R&D is viewed as input into knowledge and patent is viewed as output, then patterns in knowledge production might correspond to patterns in journal publication.
— Peter Organisciak (@POrg) March 30, 2015
However, analysis of the data showed that patent applications were not reliable predictors for determining when innovations will be commercialized. She found only a weak relationship indicating that journal articles were not predictive of relationships between investment and patent. The use of patents also meant this approach could not be used to identify managerial and process innovations. However, she did find that books and book metadata could be used to identify patterns of rapid innovation in multiple fields and in turn promising areas for investment. In the context of the UnCamp Michelle’s work demonstrates the value of HathiTrust for research to track diffusion of innovation across groups and disciplines and the significance of metadata like copyright and publication dates.
Beth Plale, co–director of the HTRC, gave an update on HTRC introducing the Secure HathiTrust Analytical Research Commons or SHARC—“like the predator.” She also covered the HTRC Portal that facilitates research over public domain volumes and the HTRC Sandbox that allows for experimentation like the HathiTrust Bookworm. There was a brief discussion of the need for secured access to copyrighted content for non consumptive research featuring the HTRC Data Capsule, produced with support from the Sloan Foundation.
Paul Conway, Associate Professor of Information, University of Michigan introduced a panel that showcased examples of research and instruction using HathiTrust or inspiring ideas for possible uses. Paul noted the value of the HASTAC conference in 2011 in building a case for digital humanities work at the University of Michigan and laying important groundwork for exploring connections for the library with faculty. I’ll just mention two of the panelists here. Liz Rodrigues, PhD Candidate in English, University of Michigan discussed her use of HathiTrust to examine the notion of a “typical immigrant narrative” using a data-driven approach. She was able to use HathiTrust Portal to look at early 20th century autobiography to look for patterns, then do closer reads of these handily public domain works. She found that the typical immigrant narrative is actually diverse and not at all typical. Liz also discussed ways she examined language of sentiment as a proxy for plot.
Steve Abney, Associate Professor of Linguistics, University of Michigan shared thoughts on the idea of HathiTrust as a comprehensive multilingual data set, how it could be a “Human Genome project for language.” For this kind of research, there is a desire for plain text in large amounts—ideally a billion or trillion words—along with an interest in a diversity of languages beyond the top 20 commercial languages. This kind of resource would build toward artificial intelligence, noting there are in the range of 6800 languages; we don’t have plain text in most of these languages. Steve noted that where plain text exists, OCR is a challenge and that crowdsourcing was of limited use for the more obscure languages because of the smaller pool of people available to help. (He also mentioned the Linguistic Data Consortium as a source for digital libraries, but they emphasize languages with commercial potential: https://www.ldc.upenn.edu/)
You can lead a horse to water…Instructors Session
One theme that came up on formal and informal conversations asked how to promote the use of the HTRC and library-based digital humanities resources and expertise generally. Beth Sandore Namachchivaya, Associate University Librarian for Information Technology Planning and Policy and Associate Dean of Libraries, Professor of Library Administration, University of Illinois at Urbana-Champaign discussed UIUC’s Savvy Research Series, which integrates guidance around concerns like copyright and establishing a scholarly identity. These are held in the library in their Scholarly Commons. She noted that among the things that HathiTrust is funding is support for the development of services offered through partner libraries for scholars, looking at ways of accessing HTRC services. Beth also complimented a weekend digital humanities program held by Harriet Green, English and Digital Humanities Librarian & Sayan Bhattacharyya, Postdoctoral Researcher at the University of Illinois.The weekend program build a high level of engagement that led to specialized opportunities to learn and use the training offered in their library like their Introduction to the Hathi Trust Research Center Portal for Text Mining Research. “Grab them intellectually first.”
When you can’t be everywhere at once…
The great thing about an UnCamp is the diversity of conversation. The problem is that you can’t be everywhere at the same time, though we’ve got that covered with Twitter (#HTRC15). So I got some additional insights for this post from colleagues at the University of Michigan Library: Alix Keener, Digital Scholarship Librarian and Rick Adler, Project Manager, CRMS.
Alix Keener: Thoughts on Day 2
The second day of the UnCamp was the more “un-conference-y” of the two. We planned breakout sessions based on three themes: communication and instruction, bibliographic data enhancement/extraction, and improving knowledge about the corpus. All sessions carried the theme of improved access to HTRC for newcomers, but because the “communication/instruction” group centered around just that—and also consisted of mainly librarians—the focus was on supporting researchers.
Rick said that the “managing metadata session” was great, considering what it would take to make HathiTrust more searchable. There was some discussion of expanding the metadata provided for HT works and adding things like word count and page count into the HathiTrust index so users can use them to search for works in HT. Participants also discussed strategies for updating and correcting bibliographic metadata in the HathiTrust catalog, highlighting that the responsibility for HathiTrust records is widely distributed across the contributing members.
The group agreed that there is a dearth of how-to tutorials. As researchers ourselves, blogging is essential to those engaged with digital scholarship; it helps to read about others’ experiences with digital tools, to solicit feedback from researchers, to share syllabi. Thus, the idea of a HTRC “cookbook” was born. Similar to Programming Historian, a resource with step-by-step tutorials on using HTRC tools would be useful both for research and for exercises in workshops and instruction.
Rick on the closing keynote
Erez Lieberman Aiden, Assistant Professor in the Department of Genetics at the Baylor College of Medicine, gave the closing keynote, titled “The Once & Future Past: Where Historical Record Digitization Has Been, & Where It Is Going.” Aiden, a partner on the HTRC Bookwork grant, asked what to do in a library? You can either read a few books closely or all the books not very closely. Along with his team and five million Google scans, Aiden used computational methods to get quantitative insights into human culture, examining how culture and language change. Because they were unable release the full-text of their datasets, Aiden came to idea of n-grams, which can show cultural trends—and led to the creation of the Google Ngram Viewer. Similar to Alexopoulous’ opening keynote, Aiden demonstrated the use of book data in HathiTrust to track collective memory, such as the speed of technology adoption via the appearance of words in a corpus. He also spoke of how text analysis isn’t just for historical research, but can also promote free discourse today; it could potentially automate censorship detection or even predict future trends.
All in all
The UnCamp was an amazing opportunity to learn more about the HTRC and think about ways to collaborate with a truly interdisciplinary crowd. HathiTrust embodies incredible promise for preservation and discovery. I’d happily go off-campus for next year’s gathering.