Collection Assessment in a Collaborative Environment: BHL and DPLA

Session Type: Working Session

Session Description
The Biodiversity Heritage Library (BHL) and the Digital Public Library of America (DPLA) work collaboratively with their partners to make scientific and culturally significant resources openly available to the world. The BHL (with a core consortium of fifteen libraries and hundreds of contributors) and the DPLA (with seventeen partners, including the BHL, who themselves represent over 600 libraries, archives, museums, and historical societies) continue to grow within the US and internationally.

Challenges:
The BHL collection consists of more than 60,000 scanned full-text biodiversity-related book and journal titles with attached OCR and the DPLA aggregates nearly 3 million records from diverse catalogs, local databases, and multiple metadata formats. Because
of the variety of metadata, deep analysis of these data sets has proved to be a significant challenge. Simple questions like “how many entomology or illustration records are there?” are difficult to answer as not all records include the exact term in either the subject heading or genre descriptions, or in fact, anywhere in the record.

Especially challenging for the BHL is that Library of Congress Subject Headings seldom connect directly with the biological taxonomies behind the literature rendering analysis of the collection’s strengths and gaps incompatible with the needs of its core user group of biodiversity scholars. Likewise, the DPLA brings together subject headings, genre terms, and format descriptions from a variety of specialized thesauri and controlled vocabularies, both widely accepted standards and local constructs, challenging true subject analysis and usage beyond keyword search.

Needs:
Both BHL and the DPLA are interested in developing a detailed view of each collection’s subject coverage to locate gaps to identify additional materials for digitization, and to connect with specific audiences and new partners that can fill these content voids. In addition, the DPLA is interested in automatically identifying the controlled vocabularies or thesauri for subject terms (or genres, formats, etc.) that come from our partners without an URI, pointer, or other indicator.

The two organizations are interested in discussing possible applications that might address some of these challenges:

  • Visualization tools to drill down from broad terms (e.g., Trees–North America) to more specific terms (e.g., Pinus banksiana) using thesauri specific to biodiversity and connecting them to the LCSH hierarchies.
  • Vocabulary identification tool that might “lob” subject headings at open controlled vocabularies to associate terms and grab URIs
  • Other ideas proposed by session attendees

The working session will start with brief presentations from BHL and DPLA representatives, followed by a discussion of common challenges. The last two hours will provide time for a mini “hackathon” to experiment with subject heading datasets and conceptualize prototypes for potential tools to satisfy collection assessment challenges highlighted by the discussion.

Session Leaders
Constance Rinaldo, Harvard University
Mark Phillips, University of North Texas

View the community reporting Google doc for this session.