A Survey of Digital Library Aggregation Services
Martha L. Brogan
Digital Library Federation Council on Library and Information Resources
Washington, DC.
2003
-ii-
About the Author
Martha Brogan is an independent library consultant with two decades of experience in academic libraries. Ms. Brogan has served
as Associate Dean of Libraries and Director of Collection Development at Indiana University-Bloomington; as a social sciences
librarian at Yale University; and as a western European library specialist and assistant to the Provost and Vice President
of Academic Affairs at the University of Minnesota. In 2001, Ms. Brogan was a Fellow in the Frye Leadership Institute sponsored
by the Council on Library & Information Resources, EDUCAUSE and Emory University.
Based on a survey of Digital Library Aggregation Services conducted in summer 2003.
-iii-
Contents
- 1.0 Executive Summary
- 2.0 Charge
- 3.0 Methodology
- 4.0 Survey Overview
- 4.1 Organizational Affiliation
- 4.2 Subject Coverage
- 4.3 Function
- 4.4 Audience
- 4.5 Status
- 4.6 Size
- Table 1: Overview of Sites Surveyed (see Appendix 2)
- 5.0 Identifying Clusters by Function
- 5.1 Challenges to categorization
- 5.2 Categories
- 5.2.1 Open Access E-Print Archives and Servers
- 5.2.2 Cross Archive Search Services and OAI Aggregators
- 5.2.3 From Digital Collections to Digital Library Environments
- 5.2.4 From Peer-Reviewed Referratories to Portal Services
- 5.2.5 Specialized Search Engines
- Table 2: Overview of Core Functions and Services
- 6.0 Comparative Review by Function
- 6.1 Open Access E-Print Archives and Servers
- 6.1.1 Physics: ArXiv
- 6.1.2 Technical Reports: NASA Technical Reports Server
- 6.1.3 Voluntary Publisher-Based Journal Archive: PubMed Central
- 6.1.4 Summary of Issues
- Table 2: NTRS Contents
- 6.2 Cross-Archive Search Services and Aggregators
- 6.2.1 General OAI Service Providers: Arc, OAIster, Cyclades
- 6.2.2 Community-Based Aggregators
- Theses & Dissertations: NDLTD Untion Catalogs
- Languages: OLAC
- Sheet Music: Sheet Music Consortium
- 6.2.3 Subject-Based Aggregators
- Cultural Heritage: UIUC Digital Gateway to Cultural Heritage Materials
- Sciences: Grainger Engineering Library at UIUC, Citebase, Archon
- 6.2.4 Summary of Issues
-iv-
- 6.3 From Digital Collections to Digital Library Environments
- 6.3.1 Cultural Heritage: American Memory, Heritage Colorado
- 6.3.2 Humanities: The Perseus Digital Library
- 6.3.3 Sciences: National Science Digital Library
- Federation: SMETE Digital Library
- K-12 Teacher Support: ENC Online
- Biology Node: BEN
- Geosciences Node: DLESE
- 6.3.4 Summary of Issues
- 6.4 From Peer-Reviewed Referratories to Portal Services
- 6.4.1 Peer-Reviewed Learning Resources: MERLOT
- 6.4.2 Expert & Machine-Gathered Internet Resources:
- All Disciplines: INFOMINE
- Disciplinary Hubs: UK's Subject Portals
- 6.4.3 Scholar-Designed Portal: AmericanSouth
- 6.4.4 Research Library Portals
- U.S.: ARL Scholars Portal
- Australia: AARLIN Scholars Portal
- 6.4.5 Summary of Issues
- 6.5 Specialized Search Engines
- 6.5.1 Sciences
- LANL's Federated Search Engine: Flashpoint
- Computer Science Web Crawler: CiteSeer
- Elsevier's Web Crawler: CiteSeer
- 6.5.2 Summary of Issues
- 7.0 Conclusions
- 7.1 Current Practice
- 7.2 Future Directions
- 7.2.1 More Attention to Users and Uses
- 7.2.2 Finding Solutions to Digital Rights Management and Digital Content Preservation
- 7.2.3 Building Personal Libraries and Collaborative Work Spaces
- 7.2.4 Putting "Digital Libraries in the Classroom" and Digital Objects in the Curriculum
- 7.2.5 Promoting Excellence
- 8.0 Major Web Sites Cited
- 9.0 Bibliography of Cited Works and Further Reading
- Appendix 1: Scope Notes
- Appendix 2: Table 1
-v-
Acknowledgements
The author wishes to acknowledge the exchange of ideas and invaluable feedback that she received from Jian Liu, Reference
Librarian, Indiana University, in the early formulation of this study.
-1-
1.0 Executive Summary
This report provides an overview of a diverse set of more than thirty digital library aggregation services, organizes them
into functional clusters, and then evaluates them more fully from the perspective of an informed user. Most of the services
under review rely wholly or partially on the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), although some of them predate its inception and a few use predominantly Z39.50 protocols. In the opening section
of this report, each service is annotated with its organizational affiliation, subject coverage, function, audience, status,
and size. Critical issues surrounding each of these elements are presented in order to provide the reader with an appreciation
of the nuances inherent in seemingly straightforward factual information, such as "audience" or "size." Each service is then
grouped into one of five functional clusters:
- open access e-print archives and servers;
- cross-archive search services and aggregators;
- from digital collections to digital library environments;
- from peer-reviewed "referratories" to portal services;
- specialized search engines.
After a brief discussion of difficulties in attempts at categorization, each cluster is discussed at greater length through
a closer examination of the purpose and functionality of individual services. A summary of overarching issues is provided
for each cluster along with observations about disciplinary or national differences. The report concludes with observations
about current practices and future directions. A list of major Web sites cited, a bibliography of cited works and further
reading, and an appendix with scope notes round out the report.
The services under review are evolving and improving quickly-many are experimental or under development-so any attempt to
describe or evaluate them must be undertaken with caution. The report is best viewed as a snapshot at a particular point in
time seen
-2-
through the lens of an informed user, looking at a moving target.
Most of the published literature is project-specific and authored by those involved in developing and implementing the service.
The 2003 special issue of Library High Tech focusing on the Open Archives Initiative merits special attention as an excellent state-of-the-art review of significant
successes and challenges in creating OAI aggregators written by principal investigators [Cole 2003a]. The European Union's
Open Archives Forum survey is exceptional in its effort to review broadly the organizational and technical characteristics of its member's archives
[Dobratz and Matthaei 2003]. Meanwhile, the papers from the June 2003 "Wave of the Future: NSF Post-Digital Library Futures
Workshop" give a fascinating picture of the challenges ahead [NSF 2003].
This report offers preliminary observations and points to future comparative studies-both broad-based and focused-that are
necessary to sharpen and deepen our understanding of digital library aggregation services. Overall, it finds reason for optimism
about open archives initiatives, especially given the relative youth of the OAI-PMH. However, it also points to the lack of
information that users have about these services and their lack of knowledge about how they fit into the larger landscape
of information seeking, resource discovery, and scholarly collaboration.
Many of the services are still in their first stage of development-collection and constituency building-where a primary concern
is to increase the size of their holdings to achieve a critical mass, while continuing to assure quality control. As a second
stage, some are beginning to provide coherent pathways through vast quantities of information by offering personalization
and customization services. Most still have a long way to go in building extended services such as systems of annotation and
collaboration. There is growing attention to a third phase of development, which is based on more flexible approaches to re-purposing
resources for varied audiences and uses. Protocols for digital rights management and reliable digital preservation solutions
will help to assure that these services reach their full potential.
2.0 Charge
This report, commissioned by the Digital Library Federation (DLF), reviews digital library aggregation services typified by
Open Archive Initiative sites such as the National Science Digital Library (NSDL) or OAIster. The survey relied on a core list of 28 online digital libraries, federations, and OAI services provided to me by the DLF.
The original annotated list of Web sites was arranged into four broad categories:
- Science, Technology and Medicine;
- Cross-Discipline;
- Humanities;
- Open Archive Initiative services-General.
-3-
As outlined in the section on "Scope Notes" (Appendix 1), I refined this list by removing some services and adding others.
More specifically, I was charged to evaluate these services based on their type, size, and function, while addressing the
following questions:
- Do they cluster into sub-groups by function as well as by discipline?
- What broadly characterizes their scope and operation?
- What range of audiences do they purport to serve? How successful are they, in your opinion and in the opinion of any prior
published assessments?
- What characterizes the experience of using these sites?
- Are there distinct differences in approach according to the discipline or nation that has produced the service?
3.0 Methodology
I conducted the review during July and August 2003, relying primarily on perusal of the Web sites, sample searches, and follow-up
e-mail correspondence with many of the service providers. In addition, phone interviews were conducted with Los Alamos National
Laboratory's (LANL) librarian regarding Flashpoint, the National Science Digital Library's (NSDL) communication director and the director of collection development, and ARL's
Scholars Portal project manager. Site visits were made to OCLC in Dublin, Ohio and to the ENC Online (Eisenhower National
Clearinghouse), headquartered at Ohio State University.
Due to the constraints of time and the diversity of services represented, a formal survey or questionnaire was not devised,
although the review was informed by selected literature about digitized collections, subject gateways, portals, digital libraries,
and open archives, especially in the broad areas of selection criteria and best practices; evaluation schemes; and most problematic
of all-conceptual or organizational frameworks.
The European Union's Open Archives Forum's survey of "Open Archives Activities and Experiences in Europe" [Dobratz and Matthaei
2003] provides an excellent overview of a wide range of services in Europe. Although there is a growing body of literature
about such digital library services worldwide, there are few examples of evaluations that compare resources or usability across
multiple services. Ultimately, much of the literature is derived from the Web sites of these services themselves, most of
which contain useful reports and studies.
I approached this review from the perspective of an "informed user" whose interest in technical issues is largely circumscribed
by a desire to understand, in general terms, how technical decisions or restrictions affect the "scope and operation" of any
given service, especially in terms of the "collections" covered or "items" retrieved. Given the recent literature on holistic
approaches to digital library
-4-
evaluation, which take into account the expectations of diverse users-individually and collectively-with diverse needs, I
acknowledge that my experience reflects a single stakeholder only.
4.0 Survey Overview
Although I suggested that the review be limited to those digital library aggregation services that rely solely on the Open
Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), ultimately a broader range of services was considered for
several reasons:
- There are numerous exemplary hybrid services which include a mix of OAI-compliant and other resources.
- OAI-compliancy is evolving rapidly-some services, while not presently OAI-compliant, will be tomorrow.
- Non-OAI-compliant services can provide useful comparisons-especially in terms of purpose and functionality, both inferior
and superior-to many fully compliant sites.
Table 1 provides an overview of all sites included in this review. [1] Each site is annotated by: organizational affiliation, subject coverage, function, audience, status, and size. A summary
of findings and major issues in each of these categories follows.
4.1 Organizational affiliation
This category identifies the host institution, agency or consortia, along with selected funding highlights.
Critical issues:
Concerns about organizational affiliation are closely tied to issues of quality assurance, economic viability, and long-term
sustainability. Virtually all of the sites under review are sponsored by institutions of higher education or by governmental
agencies. Many are promoted by a handful of key individuals; few are fully integrated into a broad-based organizational structure.
Many address "R&D" issues related to digital libraries and have not yet transitioned to full production. Almost none of the
services make readily known their business plan, although some rely on community-based input and collaboration with varying
degrees of formal governance structures. Most were developed with external funding support. Governmental agencies supporting
the Digital Library Initiatives Phase 2 include:
In Partnership with:
The Andrew W. Mellon Foundation stands out among private foundations in its support for digital library initiatives. Waters
[2001] describes Mellon's support for seven metadata harvesting projects. Cross-disciplinary services and R&D projects have
also been supported by the Coalition for Networked Information and the Digital Library Federation.
Published literature on organizational issues, including business models, is scarce but growing:
- As noted above, Dobratz and Matthaei [2003] survey the landscape in Europe for the Open Archives Forum. The OAF published
an "Interim Review of Organizational Issues" in November 2002 and will release its final report in early September 2003. It
considers two taxonomies of business models [Rappa 2001 and Timmer 1999], commenting on their applicability to the Open Archives
Initiative in its European context.
- Greenstein and Thorin [2002] consider three stages of digital library development within the context of research libraries:
from aspiration to "skunk works;" rolling projects into programs; and from integration to interdependency.
- Chien [2002] in "Whither Digital Libraries? The Case of a 'Billion-Dollar' Business" considers the changing vision of a digital
library to render them "sustainable (technologically, socially and economically) at the Internet scale." In particular, he
draws on examples from digital government to turn it into "a business partner and research investor . . . making e-contents
accessible, useful and profitable," with references to the European Commission's Information Society eEurope (http://europa.eu.int/information_society/eeurope/index_en.htm) and Japan's e-Japan Priority Policy Program (http://www.kantei.go.jp/foreign/it/network/priority/) .
-6-
- In "Business Models of News Web Sites: A Survey of Empirical Trends and Expert Opinion," Schiff [2003] examines differences
among eight business models and summarizes them "in terms of three cross-cutting characteristics: (A) features that differentiate
the online medium from print, broadcast and cable media; (B) key variables or components that affect business operations;
and, (C) the maximizing or optimizing behavior that guides management strategy and measures their performance." The eight
models include: Advertising revenue; Online traffic; Infant industry profits and stock values; Digital content delivery; Continuous
breaking news; Information retrieval and storage; Portal conduit; and, Interactive networking.
- In "Ghosts in the Machine: People and Organization Level Issues in Distributed Libraries," Nicholson [2003] argues that: "Size
matters where cooperation and collaboration are concerned. Even in a country as small as Scotland, a loose, nationally coordinated
hierarchy of relatively small sectoral, regional, and special interest groups is the key to success. Where interoperability
between people is concerned, small is beautiful."
- Zorich's [2003] "A Survey of Digital Cultural Heritage Initiatives and Their Sustainability Concerns" summarizes the organizational
types, governance structures, business models, and sustainability concerns of thirty-three organizations or projects and five
funding agencies or foundations.
- At the NSF's June 2003 "Wave of the Future: Post Digital Library Futures Workshop," Waters [2003b], in a paper entitled, "Beyond
Digital Libraries: The Organizational Design of a New Cyberinfrastructure," recommends a new program of research on "organizations
and organizational design," arguing that Advanced Cyberinfrastructure Program Centers "would need to be informed by current
expert understandings and additional targeted research regarding organizational factors such as mission, leadership, governance,
organizational structure, legal arrangements for intellectual property and financing . . ." He further recommends "an apparatus
for incubating and supporting new organizations" that are responsive to disciplinary contexts but also "economize on the costly
duplication of services" by creating a "family (or families) of efficiently run organizations" that "take responsibility for
providing a set of common services, such as accounting, human resources, board governance and legal advice."
- Also at the NSF 2003 workshop, Van de Sompel [2003] laments the "lack of impact of the DL field . . . at the level of defining
essential building blocks for the evolution of the Web infrastructure" and proposes the creation of "centers of excellence"
as a partial answer. He also envisions a new digital library "ecology based
-7-
on distributed service provision: nodes specialize in specific tasks and exchange their services for those of nodes with other specializations" rather than the pre-digital era library model wherein "a library
is an island [peninsula] that provides each and every service."
- Finally, Lynch [2003] highlights "the entire area of the stewardship, preservation and curation of information, discourse,
knowledge, data, and culture. There are tremendous technical, economic, legal and political problems here; much progress has
been made in mapping these problems, but much less in developing solutions." He suggests that these also need to become public
policy goals and be examined in relation to "national security, or the protection of a nation's cultural heritage."
4.2 Subject coverage
Services are broadly categorized by subject as: cross-disciplinary; cultural heritage; science; humanities; and language resources.
Critical issues:
Given the funding streams, it is not surprising that major initiatives cluster around the mission of those governmental agencies
supporting the Digital Library Initiatives Phase 2-predominantly in the sciences and cultural heritage. The participation
of scholarly societies and commercial publishers is evident in the sciences. Meanwhile, the cultural heritage services bring
together the museum, library, archive, and special collections sectors. Communities of practice are also forming around disciplines
(e.g., geosciences); audiences (e.g., K-12 educators); media (e.g., sheet music, images); software (e.g., eprint.org, DSpace,
Arc); or philosophies (e.g., preserving endangered languages, open access to scholarly communication).
Disciplinary differences in scholarly communication have been studied by Kling et al. [2000, 2002]; they argue that "it's
not just a matter of time" before all branches of the sciences join the preprint movement. Brown [2002, 2003] and Lawal [2002]
also survey differences in the adoption of preprints in various scientific disciplines. Articles about subject-based digital
aggregation services are only beginning to appear in disciplinary journals [Johnston 2003; Lundmark 2003]. Much of the literature
is produced by those who have created various services, e.g., Cole and other principal investigators in the special issue
devoted to OAI of Library Hi Tech [2003]. Both the Public Library of Science and DSpace have been the subject of recent mainstream newspaper coverage, focusing in part on the economic dynamics of the open access
movement. [2]
-8-
4.3 Function
Function is extracted in large part from the descriptions at each site as it relates to the service's primary mission. The
categories are:
- open access e-print archives and servers
- cross-archive search services and OAI aggregators
- from digital library collections to digital library environments
- from peer-reviewed referratories to portal services
- specialized search engines
Selected relevant published literature is discussed in each section along with disciplinary and national differences. Grouping
services by function is impeded by the issues outlined below and is discussed at greater length in the report.
Critical issues:
- the conflicting and overlapping definitions of concepts such as digital libraries, virtual libraries, portals, etc.
- the complexity of many services which don't lend themselves readily to solitary functional "encapsulation"
- the dynamic and innovative nature of these services which fuels their capacity to change functionality or scope
- the way in which successful data providers attract multiple new services, creating new levels of aggregation and customized
functionality
4.4 Audience
Audience identifies the primary targeted users as: academic community, research community, educators, digital library developers
or "interested public" although the latter could be attached to virtually of them.
Critical issues:
Two counter prevailing trends: serving multiple audiences for multiple uses versus serving a specialized audience for restricted
uses. Many services purport to serve multiple audiences although they are primarily designed by and for the scholarly community.
Others expect to serve a broad and diverse set of constituents, such as the NSDL, which has three audiences: users, content developers, and supporters (financial and political). Moreover, NSDL aims to serve users who are predominantly educators as well as users who have an interest in science in general. NSDL aims to provide the technical space, training, and tools for each constituency to use its collections appropriately. As the
concept of reusable or repurposed digital assets gains acceptance, digital libraries may routinely support multiple user communities
for multiple uses.
The counter-trend is services that are tailored to the particular needs of specialists and where some (or all) resources may
only be available to members or subscribers.
-9-
4.5 Status
The services' "status" is noted as: experimental; pilot; under development; or, established. However, the latter term is used
advisedly and is probably better conveyed as "evolving" because even the longest-lived sites adapt in response to new technology
or may have unstable funding.
Critical issues:
Status is a moving target: Arc is stable in terms of its technical underpinnings, but as a cross-archive search service, it is experimental and its financial
base is uncertain. OAIster, initially grant-funded, continues to improve its search functionality "as time permits." Perseus, created more than a decade ago, describes itself as "evolving." DLESE states that it is funded through 2007. These examples are characteristic of the overall ambiguous status of most of these
services.
4.6 Size
Size is expressed in varying ways contingent on what was most readily available at the site, but including such measures as
the number of institutional members, archives groups, or records. Size is difficult to measure and interpret for the reasons
outlined below.
Critical issues:
- Size can change rapidly and growth or reduction must be interpreted with care. For example, although OAIster attempts to "de-dup" records among overlapping services, it harvests data from the Open Language Archives Community (OLAC) aggregator as well as from some of the individual repositories that comprise OLAC, such as Ethnologue and Talkbank. As a result, it is difficult to determine the actual number of "distinct" repositories covered by OAIster. This overlap also results in duplicate records when searches are performed. (OAIster is by no means an isolated example
of these problems.)
- Close examination may reveal that a handful of archives account for the preponderance of records. For example, when OCLC provided
OAI-compliant data from an extract of WorldCat's theses and dissertations (XTCat), it suddenly made available 4.3 million records for harvesting. As a result, any service provider (such as Arc) which has harvested all of these records grew tremendously in size. Meanwhile, OAIster limits its harvest of XTCat to the subset of 8,259 full-text items representing electronic theses and dissertations.
- The UIUC Gateway to Cultural Heritage Materials presents another interesting case study in changes in size. At its peak, the Gateway contained about 3.5 million metadata records, provided by a total of 39 metadata providers (both OAI-compliant and surrogates).
However, the majority of these records described non-digital content resources (i.e., print and artifacts). Moreover, it included
-10-
metadata that was made available via other means than OAI-PMH, most notably 2.4 million Dublin Core records derived from EAD
finding aids. UIUC decided to refocus its effort on metadata records describing digital resources and those derived from OAI-compliant metadata
providers only. It removed all EAD collections, which they had broken apart into multiple item-level descriptors. The 8,500
EAD (Encoded Archival Description) records generated more than two million item-level records, which were removed from the
database. [3] And when CIMI ("museum intelligence" consortium) shut down its testbed of 185,000 OAI-compliant museum records in early 2003,
UIUC's coverage was further reduced. UIUC now harvests from 25 institutions and its repository contains 413,563 records. [4]
- The paradox of size: Critical mass is important. As repositories grow in size, they become more valuable; however by "being
large and general, they are less easily tailored to individual uses" [Borgman 2003]. So at the same time that the NSDL is pushing to increase its size, it is also creating specialized portals to help different constituents filter its resources.
Wiederhold [2003] refers to the "crucial task" of reducing the "available information to actionable information, i.e., the
specific information that will cause a change in behavior, a reduction in further work, or the making of decisions" and describes
the technologies to filter information that are "rapidly moving to harder and more speculative tasks."
Meanwhile Schatz [2003] purports: "In the future, online information will be dominated by small collections maintained and
indexed by small groups." He argues that "the Net has already made the transition from data transmission to information retrieval"
and that it is "in the process of making the transition from information retrieval to knowledge management." Whereas the Grand
Challenge in the 1990s was posed as "semantic interoperability across digital collections," Schatz proposes that the Grand
Challenge in the 2000s will be "conceptual navigation across community repositories."
Bearing these issues in mind, Table 1 is offered as a summary of key characteristics of each service. (See Appendix 2.)
5.0 Identifying Clusters By Function
5.1 Challenges to Categorization
After identifying the stated purpose or core function of these services, the greatest challenge lay in attempting to devise
a broader framework which would cluster them and help to inform a comparative analysis. This exercise was hindered by several
factors including:
-11-
- The conflicting and overlapping definitions (or lack thereof) of concepts such as digital libraries, virtual libraries, portals,
gateways, archives, repositories, e-print archives, collections, digital objects, digital assets, and learning objects;
To cite a few examples: AARLIN (Australian Academic Research Library Information Network) refers to itself as a "collaborative library service," a "research portal," and a "national virtual research library system."
Meanwhile, the Open Language Archives Community is "an infrastructure for distributed archiving of language resources," a "worldwide virtual library," and a "network of
language archives conforming to the Open Archives Initiative." SMETE: Science, Math, Engineering & Technology Education Library is a "dynamic online library and portal of services." Labeling itself a "digital library," SMETE is a "collection of collections" and a "community of communities."
As far as labeling the data providers embedded within aggregators: Arc refers to them as "archives groups"; OAIster calls them "institutions"; Cyclades and UIUC's Digital Gateway to Cultural Heritage Materials refer to them as "collections." The Open Archives Initiative's registry of OAI service and data providers refers to both
the aggregators and their component entities as "repositories." The National Science Digital Library (NSDL) refers to digital library services as "collections."
- The complexity of many services which don't lend themselves readily to solitary functional "encapsulation";
The National Science Digital Library (NSDL) aims to serve three broad constituencies: the generally curious (interested in science and research information), the NSDL developer community and partners, and funding agencies and supporters. [5] A forthcoming reorganization of "nsdl.org" in October 2003 is intended to better reflect the needs of different constituents.
NSDL seeks to provide the technical space, training, and tools for each of these audiences to use its resources appropriately.
Meanwhile, NSDL covers 199 "collections," comprising 301,702 items derived from both NSF-funded projects and NSDL-selected sites. In addition, it has "services" available to help developers create digital content or to assist educators
in evaluating and selecting digital resources. NSDL will also prototype several specialized portals to satisfy different sub-groups, e.g., middle school science teachers. As
a result, NSDL serves different core functions for different audiences.
- The dynamic and innovative nature of these services which fuels their capacity to change functionality or scope;
In July 2002, the NASA Technical Reports Servers (NTRS) launched the new version of its site, changing its architecture from distributed searching to metadata harvesting. Nelson
et al. [2003] discuss the impact of this change; however, from a user perspective,
-12-
several factors about the transition are noteworthy. The former collection constituted approximately 4.5 million abstracts
and 300,000 full-text publications. As of August 4, 2003 the number of records in NTRS was 553,921, of which slightly more than half are full-text. However, NASA-agency full-text records number fewer than 15,000
and it is the newly introduced non-NASA archives (Aeronautical Research Council, UK; arXiv; BioMed Central; and OSTI's Energy Citation database) that account for almost all of the full-text content. Populating the new service with NASA-agency full-text data remains
a priority. The new architecture makes it possible to search all the contents of NTRS by default. In addition, it offers both simple and "advanced search" functions. In the "advanced search" function, a search
can be limited to specified NASA or non-NASA agencies. Two final points: (1) while widely regarded as an "e-print archive,"
NTRS is only about 50% full-text, mostly harvested from external non-NASA agencies and (2) with the inclusion of non-NASA archives,
NTRS's subject scope has broadened.
The Perseus Digital Library originally concentrated on the development of collections, tools, and services to support classicists. However, after fifteen
years of experience, it now comprises both third-party collections and those created for experimental purposes in seven different
subject areas. Its future research agenda will focus on designing services that work with diverse collections and audiences. [6] It serves as a research bridge between cultural heritage digital libraries and the NSDL. With Johns Hopkins University, it received NSF funding starting in January 2003 to build a service for managing authority
lists for customized linking and visualization for NSDL, based on tools already in use at Perseus. [7]
- The way in which successful data providers attract multiple new services, creating new levels of aggregation and customized
functionality;
The highly successful e-print archive and server for physics (and related disciplines), arXiv, (an OAI-registered data provider) now forms the core of the Citebase repository along with two other major e-print archives-Cogprints and BioMed Central. Meanwhile Citebase is registered as both an OAI data and service provider. Citebase offers an experimental search service that includes impact analysis and reference and citation linking. ArXiv also figures prominently in Archon, which identifies itself as a "digital library that federates physics collections with varying degrees of metadata richness."
Archon will provide a unified search interface to diverse collections in physics, with sponsorship from LANL, Arc, the American
Physical Society, the CERN Document Server, OAI, and Old Dominion University. Archon is a "collection" within the NSDL.
-13-
Despite these difficulties in categorization, services were grouped together according to my understanding of their core value
and mission. While clues were taken from how the services described themselves, I didn't always adhere to their self-analysis
due to their overlapping use of terms as noted above, or because even though they referred to themselves as "digital libraries"
or "portals" they didn't exhibit the same characteristics as other entities falling within that rubric.
The resulting categories are all open to debate. The differences among categories are subtle-a matter of nuance or interpretation-and
their boundaries are fluid. Ultimately, the framework can best serve as an isolated effort to organize similar services together
so I could use them to exemplify certain characteristics and trends. Depending on the user's perspective and needs, any given
service could fall into a different category. In general, all the services under consideration are acknowledged to be exemplary,
strive to excel, and offer quality assurance to users both in terms of the authority of their content and the qualifications
of their producers.
5.2 Categories
5.2.1 Open Access E-Print Archives and Servers
This category includes scientific open access repositories that aim to provide access to full-text preprints, post-prints,
technical reports, or other research output. The three examples represent a range in purpose from rapid dissemination of research
findings without peer review, to public dissemination of scientific technical reports, to a publisher-based journal archiving
system intended to preserve access to digital copies of articles. They aim to enhance open access to scientific scholarly
communication and support the concept of "self-archiving" whether initiated by the author, the institution, or the publisher.
This category of services has been cited in the mainstream news media for its efforts to challenge prevailing economic and
publishing traditions.
5.2.2 Cross-Archive Search Services and OAI Aggregators
This category includes three broad-based, interdisciplinary cross-archive search services, one of which is a European collaborative
that has introduced extended services layered on top of the repository. It then considers a set of more focused OAI metadata
harvesting services and aggregators, grouped into either community-based or subject-based categories. The three community-based
aggregators each have a different approach to building communities of practice, but they all aim to develop an organized federation
of data providers who agree to adhere to certain philosophical and technical principles. These repositories serve, in large
part, as union catalogs, providing a unified search interface to data at various levels of granularity. Finally, four examples
of subject-based aggregators are discussed-one in cultural heritage and three in the sciences. Those in the sciences illustrate
a narrowing of subject focus along
-14-
with increasingly sophisticated functionality-ranging from a basic repository of scientific e-prints and journals, to a testbed
for e-prints with the potential for citation analysis and linking, to a federated collection that aims to serve as an authoritative
physics "digital library" with extended services. Given the short history of the Open Archives Initiative Protocol for Metadata
Harvesting (OAI-PMH) on which these services are based, it is not surprising to find that most of these services have been
created over the past few years and that many are experimental.
5.2.3 From Digital Collections to Digital Library Environments
This section considers a set of services that is evolving over time to greater complexity, starting with two examples from
the cultural heritage sector that were created to heighten the use and visibility of primary resources and whose foundation
is based on digitized collections. It continues with an example of how a discipline-focused digital resource is evolving over
time into a broader testbed for research to improve digital library functionality. The Perseus Digital Library thus serves as a research bridge between cultural heritage digital collections and scientific digital libraries, which are
examined more closely. It covers the National Science Digital Library (NSDL)-an extraordinarily rich and complex service-that strives to become a comprehensive digital library for the sciences, as well
as four of its component "collections," which are independent and highly sophisticated digital libraries in and of themselves.
These four are targeted for educators in the sciences at various academic levels.
The services in this category typically have more fully developed infrastructures-including evolving governance structures
and policies for collection development, contributing data, or privacy of use. Many require users to register in order to
obtain full benefits. They also offer a range of services such as conferences, workshops, or professional development opportunities;
e-mail news alerts services; and opportunities to personalize services or to identify potential colleagues-characteristics
that are also associated with "portals" as discussed below. Most represent a trajectory that moves beyond digital collections
to digital library environments, as characterized by Lynch [2002]. Some also begin to cross the boundaries between digital
library environments and digital learning environments as advocated by McLean and Lynch [2003].
5.2.4 From Peer-Reviewed Referratories to Portal Services
This category explores a set of Web or resource directories with different approaches to achieve quality-controlled content
for an academic clientele. [8] They are labeled as "referratories" because they don't develop content or collections of their own, but rather refer the
user to other sources of information. MERLOT has developed an advanced, distributed, national peer review system overseen by editorial boards and focused on expert-selected
"learning materials" for college and university educators. Two other multidisciplinary, academically-oriented
-15-
resource directories-one developed by a network of U.S. librarians and the other undertaken by disciplinary-based consortia
in the UK-use a combination of expert-selected and machine-generated selections (including OAI harvesting) to build their
"collections." Exhibiting features for personalization and collaboration, these services become "portals" to selected Internet
resources and beyond, as exemplified by the UK projects.
AmericanSouth, a project under development by a network of Southern research institutions, based at Emory University, shows the potential
of an OAI-based repository to become a "portal" for a specific user community through the implementation of customized services
that facilitate scholarly communication. It is considered in this category rather than as an OAI aggregator, because its content
will also draw on other sources such as local institutional catalogs and Internet resources.
Finally, two examples of research library portals under development, respectively in the U.S. and in Australia, are examined.
Both of them are concentrating first on developing single search capability across licensed databases and local library catalogs
relying on Z39.50 technology. In time, they may also develop the capacity to gather OAI-harvested data into their searches.
They feature personalization services characteristic of portals.
5.2.5 Specialized Search Engines
This category includes three examples in the sciences: one is a proprietary in-house system for federated searches conducted
primarily across locally loaded licensed databases and the local OPAC, and two are focused Web crawlers, capturing data from
OAI-compliant and other Internet sources. These are presented as alternatives to generic, but hugely popular search engines
such as Google or AltaVista.
Table 2: Overview of Core Functions and Services
| CORE FUNCTION |
SERVICES |
OPEN ACCESS E-PRINT ARCHIVES AND SERVERS
- Open access to full-content via the Internet
- Typically author or institutional self-archiving
- Include:
- Journal articles
- Preprints & post-prints
- Technical reports
- Book chapters
- Conference papers
- Research output, including theses and dissertations
- May or may not be refereed.
[Warner 2003 based on Pinfield et al. 2002] http://www.ariadne.ac.uk/issue31/eprint-archives/intro.html |
PHYSICS: AUTHOR SELF-ARCHIVING W/OUT PEER REVIEW FOR RAPID DISSEMINATION arXiv TECHNICAL REPORTS NASA Technical Reports Server VOLUNTARY PUBLISHER-BASED JOURNAL ARCHIVE OF PEER-REVIEWED ARTICLES PubMed Central
|
| CORE FUNCTION |
SERVICES |
-16-
CROSS-ARCHIVE SEARCH SERVICES & AGGREGATORS
- OAI metadata harvesting services and aggregators
- Search/discover gateways
- Information retrieval systems
- Indexes w/ unified search & browse features
- Function like union catalogs w/ enhancements
- Current status predominantly experimental
- Mix of collection-level and item-level access
|
GENERAL OAI SERVICE PROVIDERS Arc OAIster
Cyclades COMMUNITY-BASED ARCHIVESNDLTD Union Catalog (Networked Digital Library of Theses & Dissertations) ( XTCat) Electronic Theses/Dissertations OAI Union Catalog based at OCLC [NDLTD]
Open Language Archives Community (OLAC)
Sheet Music Consortium SUBJECT-BASED AGGREGATORSUIUC Digital Gateway to Cultural Heritage MaterialsSCIENCE & ENGINEERING ARCHIVE Grainger Engineering Library at UIUC SELECTED E-PRINT REPOS W/ CITATION AND IMPACT ANALYSIS + REFERENCE & CITATION LINKING SERVICE Citebase FEDERATION SERVICE FOR PHYSICS ARCHON |
| CORE FUNCTION |
SERVICES |
-17-
FROM DIGITAL COLLECTIONS TO DIGITAL LIBRARY ENVIRONMENTS
- Collection of tools that make content alive
- Help user find content, manipulate, analyze, annotate, and comment on it
- Attract, create, define a community
- Collaboratories where active group annotation, analysis, and creation of new knowledge happens
- Enable and facilitate implicit communication (e.g., recommender systems)
- Sum greater than its parts
[Lynch 2002]
|
CULTURAL HERITAGE COLLECTIONS American Memory Colorado Heritage (Colorado Digitization Program) HUMANITIES The Perseus Digital Library SCIENCES National Science Digital Library SMETE Digital Library (Science, Math, Engineering & Technology Education Digital Library)
ENC Online (Eisenhower National Clearinghouse for Mathematics and Science Education)
BEN: A Digital Library of the Biological Sciences for Biology Teaching DLESE : Digital Library for Earth System Education
|
FROM RESOURCE DIRECTORIES & REFERRATORIES TO PORTAL SERVICES
- Quality-controlled subject gateways
- Resource selection, discovery, annotation
PORTAL SERVICES
- Collaborative information research service
Elements
- Intuitive and customizable Web interface
- Personalized content presentation
- Security and Authentication
- Communication and collaboration
Components
- Single-search interface
- User authentication
- Resource linking
- Content enhancement
[Boss 2002]
|
PEER REVIEWED LEARNING RESOURCES Merlot (Multimedia Educational Resource for Online Learning & Teaching) EXPERT & MACHINE-GATHERED INTERNET RESOURCESInfoMine Scholarly Internet Resource CollectionsUK: Subject Portals Project of the Resource Discovery Network SCHOLAR-DESIGNED OAI PORTAL AmericanSouth RESEARCH LIBRARY PORTALS W/ ACCESS TO PROPRIETARY DATABASES
|
| CORE FUNCTION |
SERVICES |
-18-
| SPECIALIZED SEARCH ENGINES
|
SCIENCES LANL FEDERATED SEARCH IN-HOUSE PROPRIETARY + SELECTED PREPRINTS + LIBRARY CATALOG Flashpoint COMPUTER SCIENCE WEB CRAWLER W/ REFERENCE LINKING, CITATION ANALYSIS, & RECOMMENDER SYSTEM CiteSeer (aka ResearchIndex) ELSEVIER WEB CRAWLER: SELECTED OAI REPOS + PROPRIETARY + WEB Scirus |
6.0 Comparative Review By Function
6.1 Open Access E-Print Archives and Servers [9]
OPEN ACCESS E-PRINT ARCHIVES AND SERVERS
- Open access to full-content via the Internet
- Typically author or institutional self-archiving
- Include:
- Journal articles
- Preprints & post-prints
- Technical reports
- Book chapters
- Conference papers
- Research output, including theses and dissertations
- May or may not be refereed.
[Warner 2003 based on Pinfield et al. 2002]
|
PHYSICS: AUTHOR SELF-ARCHIVING W/OUT PEER REVIEW FOR RAPID DISSEMINATION arXiv TECHNICAL REPORTS NASA Technical Reports Server VOLUNTARY PUBLISHER-BASED JOURNAL ARCHIVE OF PEER-REVIEWED ARTICLES PubMed Central |
As Warner [2003] points out, definitions of e-prints vary widely from general meanings -- an e-print is a collection of digital documents -- to more restricted interpretations -- author self-archived preprints only. Warner uses the term e-print to "group together many forms of
-19-
scholarly literature for which there is open access to the full content via the Internet. E-prints may include: journal articles;
preprints; technical reports; books; theses; and dissertations." [10]
It is appropriate to begin this survey with e-print archives because the Open Archives Initiative grew "from the 1999 Santa
Fe Universal Preprint Service meeting [Ginsparg et al. 1999] and the Santa Fe Convention [Van de Sompel and Lagoze 2000],
with the intention of improving scholarly communication through improved interoperability between e-print archives." [11] Moreover, e-print repositories represent a significant percentage of OAI data providers. [12] According to a survey conducted by Warner in October 2002, 54% of registered OAI data providers include metadata about e-prints. [13]
6.1.1 Physics: arXiv
ArXiv is the earliest, largest and most successful example of a subject-based e-print archive. What began in the early 1990s "as
an experimental means of circumventing recognized inadequacies of research journals" quickly became the "primary means of
communicating ongoing research information in formal areas of high energy particle theory." [14] ArXiv is based on a process of author self-archiving without peer review. It was widely accepted by this research community because
of the pre-existing "preprint culture," which recognized the need for the rapid dissemination of research results without
awaiting the time delays involved in peer review and formal publication. In addition to physics, arXiv now also covers mathematics, nonlinear science, and computer science, all overseen by advisory boards. In 2002, there were
over 20 million full-text downloads from arXiv. [15] In the past year, monthly submissions to arXiv average from 3,000 to 3,500. (ArXiv is one of few repositories to make usage statistics readily available at its site.) ArXiv currently has about 230,000 items, all of which are full-text articles, technical reports, or theses.
In a 2003 submission, Can Peer Review be Better Focused?, Ginsparg [2003] discusses the characteristics of arXiv that account for its continued success: "From the outset, a variety of heuristic screening mechanisms have been in place
to ensure insofar as possible that submissions are at least of refereeable quality . . . These mechanisms are
-20-
an important -- if not essential -- component of why readers find the site so useful: though the most recently submitted articles
have not yet necessarily undergone formal review, the vast majority of the articles can, would, or do eventually satisfy editorial
requirements somewhere." [16] He goes on to suggest how various impact measures could be used in the preprint environment to bring greater efficiency to
the full peer review process by focusing it on a smaller subset of submissions, but also one with a higher likely acceptance
rate.
It is possible to search arXiv by date and sub-group within physics or by date and group for math, non-linear science, and computer science. It supports
searches by author, title, full record, comments, journal-reference, subject-class, or report number with Boolean operators.
Help and examples of search functions appear directly on the search page. There is also a forms-based interface to searching
that permits different views (new abstracts, last update, recent, etc.). A "catch-up" function allows users to review new
records, with or without abstracts, within the dates specified. There are several options to download files. The "Help" feature
contains general information as well as information about browsing and instructions for those submitting papers. There is
a FAQ. "What's new" informs users of changes to the site, e.g., on July 6, 2003: "A new and more sophisticated author registration
system has been put on-line. It provides greater administrative flexibility and better user support, including user ability
to maintain past submissions."
6.1.2 Technical Reports: NASA Technical Reports Server (NTRS)
NASA's technical reports server, NTRS, seeks to collect, disseminate, and archive the "unclassified, unlimited" NASA-authored scientific and technical literature
related to aeronautics. As discussed above, NTRS exemplifies some of the difficulties in making the technological transition from distributed searching to metadata harvesting.
Populating the database, in particular with NASA-authored data, remains a priority. The new NTRS version does provide a unified search interface to reports from ten NASA agencies and four non-NASA agencies, receiving about
6,000 to 7,000 searches monthly. Among the 555,358 records, NASA-authored reports account for less than half of the total
and only about 50% of all items are available in full text, of which NASA reports account for less than 5%. Search results,
which include extensive abstracts, clearly indicate if a digital version is available, and when it is not, provide information
about ordering it. Four key NASA agencies are not part of NTRS and must be searched separately. NTRS has a useful update feature where it is possible to search the records added to all of the archives or to specific archives
on a weekly basis (up to the past four weeks) or by entering a specific date stamp. Table 3 summarizes the contents as of August 19, 2003. [17]
-21-
Table 3: NTRS Contents
| NASA ARCHIVES |
NUMBER OF METADATA RECORDS |
| GENESIS (NASA Jet Propulsion Laboratory) |
All full text: 27 |
| NASA Ames Research Center |
Metadata indexed but full text quarantined because they haven't been reviewed. Records: 354 |
| NASA Center for AeroSpace Information (CASI) |
100 full-text documents out of 256,637 |
| NASA Goddard Institute for Space Studies |
All full text: 1,335 |
| NASA Goddard Space Flight Center |
Metadata indexed but full text quarantined because they haven't been reviewed. Records: 11 |
| NASA Johnson Space Center |
All full text: 128 |
| NASA Kennedy Space Center |
Metadata indexed but full text quarantined because they haven't been reviewed. Records: 82 |
| NASA Langley Research Center |
All full text: 3,948 |
| NASA Marshall Space Flight Center |
All full text: 498 |
| NASA Stennis Space Center |
Metadata indexed but full text quarantined because they haven't been reviewed. Records: 39 |
| National Advisory Committee for Aeronautics (NACA) |
All full text: 7639 |
| RIACS (NASA Ames Research Center) |
All full text: 61 |
|
|
| NON-NASA ARCHIVES |
METADATA RECORDS |
| Aeronautical Research Council (UK) |
All full text: 2,647 |
| arXiv Physics Eprint Server |
All full text: 243,707 |
| BioMed Central |
All full text: 17,507 |
| Energy Citation Database (OSTI) |
7,000 full-text articles out of 20,738 |
|
|
| PREVIOUSLY INCLUDED BUT NOT IN THE NEW OAI NTRS |
RELATED WEB SITES |
NASA Astrophysics Data System: (1) Astronomy & Astrophysics, (2) Physics & (3) Geophysics, Space Instrumentation or available
via: (4) The Astrophysics Data System (ADS) is a NASA-funded project which maintains four bibliographic databases containing more than
3.3 million records: Astronomy and Astrophysics, Instrumentation, Physics and Geophysics, and preprints in Astronomy. The
main body of data in the ADS consists of bibliographic records, which are searchable through our Abstract Service query forms,
and full-text scans of much of the astronomical literature which can be browsed though our Browse interface. |
Searchable at: (1) http://adsabs.harvard.edu/abstract_service.html (2) http://adsabs.harvard.edu/physics_service.html (3) http://adsabs.harvard.edu/instrumentation_service.html (4) http://adswww.harvard.edu/ |
| NASA Dryden Flight Research Center |
Searchable at: http://www.dfrc.nasa.gov/DTRS/ |
| NASA Glenn Research Center |
Searchable at: http://gltrs.grc.nasa.gov/ |
| NASA Jet Propulsion Laboratory |
Searchable at: http://gltrs.grc.nasa.gov/ |
-22-
6.1.3 Voluntary Publisher-based Journal Archive: PubMed Central
Launched in February 2000, PubMed Central (PMC) is a digital archive of life sciences journal literature maintained by the National Center for Biotechnology Information
(NCBI) at the U.S. National Library of Medicine. It provides free and unrestricted access to some 100,000 full-text articles
from over 130 journals. PMC contains all peer-reviewed primary research articles from every participating journal; other content is made available at
the discretion of the journal editor (e.g., letters, essays, and reviews). It strives to provide open access to this literature
in perpetuity. Journals may deposit the full text of articles with PMC and release it immediately upon publication or delay its release for a specified period. Participation in PubMed Central (PMC) is voluntary and open to any life sciences journal that either is covered by one of
the major abstracting and indexing services such as MEDLINE, Agricola, Biosis, Chemical Abstracts, EMBASE, PsycINFO or Science
Citation Index, or (if a new journal) has at least three members on its editorial board who currently are principal investigators
on research grants from major funding agencies (such as NIH) in the U.S. or abroad.
PMC provides unified search capability across more than 75 life science journals and all 57 core journals published by BioMed Central (BMC). PMC allows journals to maintain their distinct identity by supplying the journal logo at the top of each page (with a link to
the journal's own site) and by running the journal's "watermark" the length of each page. At present PMC's coverage is limited to English-language journals. All articles in PMC are also indexed in PubMed, the online index and abstracting service of the National Library of Medicine, which includes Medline. [18]
As explained by Edwin Sequeira of NCBI in a 2003 article: "The standard PMC search technique is labeled 'SmartSearch,' reflecting
the fact that it is based on an automated analysis of the title, abstract, and full text of each article. SmartSearch is intended
to increase the relevance of one's search results. It includes intelligent phrase recognition and does not search every word
in an article as a simple full-text search would do (although it is also possible to do the latter if one wishes)." PMC also offers extensive search features, including automatic term mapping that matches unqualified terms against a MeSH (Medical
Subject Headings) Translation Table, a Journals Translation Table, a Phrase List, and an Author Index. Terms can be qualified
using search field tags and date ranging. It is possible to limit your search to specific search fields, to preview the search
results before displaying the citations and to refine your search. Results can be sorted according to various options. You
can save a text file of citations on your computer with results up to a maximum of 10,000
-23-
items. There are options to print or e-mail results from the clipboard that holds up to 500 citations. These and other search
features are explained at length at the site's Help page: http://www.ncbi.nlm.nih.gov/entrez/query/Pmc/pmchelp.html.
PMC is also extending its content through a systematic scanning program of back runs of journal articles. Sequeira reports that
about a year ago: "NLM offered to scan any issues of a PMC journal that are not already available in electronic form, in return
for permanent rights to archive and distribute the scanned material freely. Almost all the current PMC journals that have
pre-electronic issues are participating in the project, as are the 20-plus specialist journals of the BMJ Publishing Group,
whose current content will be added to PMC later." The back issue digitization project is described more fully at the PMC
Web site: http://www.pubmedcentral.gov/about/scanning.html.
OAI access to PMC is anticipated by mid-September 2003, including access to much of the BMC content, as well as to the new PLoS Biology journal and any other open access journals. [19]
6.1.4 Summary of Issues
All three of these e-print archives support the concept of open access and self-archiving (by author, by agency, or by publisher).
ArXiv and NTRS are registered OAI data providers and OAI access to PMC is anticipated by mid-September 2003. While they also all aim to speed up access to research findings, each of them also
illustrates a different purpose: arXiv serves primarily for the rapid dissemination of research findings without peer review based on author self-archiving; NTRS aims to distribute scientific and technical literature quickly and widely through agency-based archiving; and PubMed Central promotes publisher-based archiving in order to preserve life sciences journal articles in electronic form. Both arXiv and PubMed Central provide access to full content only. NTRS on the other hand, is about 50% digital full text -- most of it from arXiv -- with many NASA reports requiring a purchase in hard copy or microfiche.
These three services also highlight disciplinary differences. While physics has a tradition of distribution of preprints without
peer review, acceptance varies even among its sub-fields [Brown 2002]. Lawal [2002] discusses some of the underlying reasons
for varying rates of adoption by researchers in nine scientific disciplines including chemistry, biological sciences, engineering,
cognitive science and psychology, mathematics and computer science, physics, and astronomy. She found widest adoption in physics,
followed by mathematics, and the least in chemistry. Publishers' policies are a primary factor in chemistry's non-use of preprint
archives. Brown [2003] surveyed authors of e-prints appearing in the Chemistry Preprint Server (CPS), operated by Elsevier and the editors of top chemistry journals about their acceptance of CPS e-prints. She notes that while authors found CPS "a convenient vehicle for dissemination
-24-
of research findings and for receipt of feedback before submitting to a peer-reviewed journal, reception of CPS e-prints by
editors of top chemistry journals is very poor." At the same time, she reports that "32 percent of the most highly rated,
viewed and discussed e-prints eventually appear in the journal literature, indicating the validity of the work submitted to
the CPS." Meanwhile, the two dominant publishers in Chemistry -- Elsevier and the American Chemical Society -- in 2003 announced
an even closer collaboration:
Elsevier and two divisions of the American Chemical Society -- Chemical Abstracts Service (CAS) and Publications -- have announced
that they have agreed to provide linking between their services for scientists. Under the agreement:
- 1. Users of Elsevier products and services (such as Science Direct®, MDL® databases and ChemWeb) will be able to link directly
to ACS scientific journals
- 2. Users of CAS products and services (SciFinder, STN®, and others) will be able to link, via ChemPort, directly to Elsevier
scientific journals. [20]
Researchers in the life sciences adhere to the tradition of peer review prior to dissemination of research papers, but readily
deposit genetic sequences into GenBank®, the National Institute of Health's annotated collection of all publicly available
DNA sequences. [21] The life sciences are also vigorously promoting open access to peer-reviewed literature and taking advantage of technology
and new pricing models (institution-based article-input fees as opposed to subscriber fees) to speed up dissemination. [22]
Although no examples of institutional archives were part of this review, MIT's DSpace has received attention for spurring the self-archiving movement. As reported in the New York Times it "will have 5,000 items archived by this fall, and plans call for adding 7,500 theses later this year. MIT estimates that
its free software has been downloaded 3,400 times and says it is aware of 100 research institutions that are evaluating DSpace
with an eye toward archiving their own faculty's publications." [23] The United Kingdom has also announced its plans to develop a national archive of e-print papers available from OAI-compliant
repositories provided by UK universities and colleges. According to the UK plan:
Metadata will be harvested using the OAI protocol into a single database hosted by UKOLN at the University of Bath, and
-25-
will then be passed to external web services -- OCLC and the University of Southampton -- where the records will be enhanced
with subject classification, name authority, and citation analysis. The enhanced records will be returned to the central database
from where they may be harvested by institutions or academic subject gateways. The project is funded by the JISC FAIR program.
[See: http://www.rdn.ac.uk/projects/eprints-uk/]
Day [2003] reviews the status of institutional and subject-based repositories in the UK using Eprints.org software and corroborates
the assertion of Pinfield [2003] that more effort now needs to focus on actually populating repositories:
Setting up an institutional repository and designing collection management policies are relatively straightforward; populating
the repository is not. The content of institutional repositories needs to come largely from researchers within the institution,
and persuading them to submit this content is a major challenge. Self-archiving requires a cultural change amongst researchers
that can only be achieved through significant advocacy activity, and even then it will probably happen only gradually. [24]
While the e-print and self-archiving movement may be gaining momentum, there are still obstacles to overcome, namely acceptance
by authors in sufficient numbers to develop repositories of sufficient size to be of interest, and finding efficient ways
to manage copyright issues. Turning again to the United Kingdom, the Joint Information Systems Committee (JISC) funded a project
(through August 31, 2003), RoMEO (Rights Metadata for Open archiving), to investigate the rights issues surrounding the "self-archiving"
of research in the UK academic community under the OAI-PMH. According to RoMEO's Web site:
It will perform a series of stakeholder surveys to ascertain how 'give-away' research literature (and metadata) is used, and
how it should be protected. Building on existing schemas and vocabularies (such as Open Digital Rights Language) a series
of rights elements will be developed. A demonstrator system will then be created to show how rights metadata might be assigned,
disclosed, harvested, and displayed to end users via the OAI Protocol for Metadata Harvesting [http://www.lboro.ac.uk/departments/ls/disresearch/romeo/index.html].
Meanwhile, Van de Sompel stated in a 2003 interview that the Open Archives Initiative expects to set up a technical committee
soon in collaboration with the JISC RoMEO project, "in the realm of expressing rights statements about metadata and content
in the OAI framework."
-26-
6.2 Cross-Archive Search Services and Aggregators
This category consists of three general OAI service providers -- Arc, OAIster and Cyclades; three examples of community-based aggregators -- Networked Digital Library of Theses & Dissertations, Open Language Archives Community, and theSheet Music Consortium, and four examples of subject-based repositories, one for cultural heritage and three in the sciences -- UIUC Digital Gateway to Cultural Heritage Materials, Grainger Engineering Library at University of Illinois at Urbana-Champaign, Citebase, and Archon. All of these services federate metadata of "varying degrees of richness" from heterogeneous sources, relying on the OAI-PMH,
and provide unified search and browse interfaces. They are all established or experimental in nature, typically with support
from external funding agencies. Most cover materials in multiple languages and formats. They represent
-27-
a mix of approaches to collection-level and item-level access. Several of the services have established special metadata standards.
Other differences are apparent in the level of sophistication of their search capabilities and their post-result processing
features.
6.2.1 General OAI Service Providers: Arc, OAIster, Cyclades
Arc, developed by Old Dominion University's Digital Library Research Group, is one of the first federated searching services
based on the OAI Protocol. [25] It serves as a technology demonstrator by harvesting from all OAI repositories without any limitations by the type or subject
of their holdings. As a result, it is the largest OAI service provider included in this review, currently harvesting from
163 archives, comprising a total of 6.4 million records, 4.3 million of which are derived from OCLC's XTCat (theses and dissertations
extracted from WorldCat). Arc serves as a testbed where Arc and other OAI service providers can experiment with the resulting federation. (For example, at present Arc is conducting a study on accession growth rates.) Arc is also "experimental" in the sense that it has no base funding to support a sustainable federation service. From a technical
standpoint, however, Arc is well established and its "code" for federated searching has proven to be extremely stable, robust, and error free. Its
harvesting and indexing software is available to download as Open Source software at sourceforge.net. [26] Arc's developers are committed to improve OAI services and are currently working on a major upgrade of the Open Source version
of Arc to upgrade services for its community of users. [27]
Arc offers simple keyword searching with Boolean operators where the user can specify how to group (by Archive, Discovery Year,
or Subject) and sort the results (by Relevance Ranking or Discovery Date). Advanced searches permit Author, Title, and Abstract
searches where the user can indicate if all or any instances of the specified terms should be retrieved. Advanced searches
can also be filtered by Archive, Subject, Date Stamp, or Discovery Date. The subject filter includes an interactive feature
where the user can input a term and receive a listing of related subjects and their archive group affiliation, making it possible
to further refine the subject search. Arc also offers a "browse" feature that lists records in alphabetical order by archive group; however, browsing by subject or
year returns incomplete results. All search results can be displayed in summary or detailed views. Following the links on
the detail page, lead the user to the particular document, residing at the local host site. When there are multiple pages
of returns, the user can traverse them. [28] The
-28-
"Help" page provides information about how to search the site with an e-mail address for additional questions.
In practice, I encountered a number of problems in conducting searches.
- There is no overall collection policy or statement about the scope of coverage. If the user clicks on the "browse" feature,
it is possible to view all the "archive groups" in the left frame with their individual records appearing in the main body
of the page. However, the left frame listing of "archives groups" includes two consecutive alphabetical listings because those
beginning with capital letters are filed separately from those in lower case letters. As a result, at first blush, the user
would think that "arXiv" is not included in an Arc search.
- The names of the "archives groups" are typically cryptic and most of them don't carry any meaning for the general user. For
example, only specialists would know that "AIM25" provides collection-level descriptions of archives in London or that "CPS"
is the Chemistry Preprint Server. Arc makes some effort to provide the fuller names of these repositories delivered via mouseovers to the list of abbreviated identifiers,
but it still leaves the user without many clues.
- By going to the "Administration" page, which is scarcely an intuitive choice, the user will also find a list of all the existing
archives, their "identifiers" and full names, along with the date on which they were last harvested. From this administrative
page, it is possible to link to the Web site of each of the archives, where the user can get an understanding of their scope
and coverage.
- Arc continues to harvest from both the current and prior versions of the OAI Protocol, resulting in duplicate archive groups.
For example, results are returned for two separate archives groups identified as "arXiv" (214,215 records) and "arXiv.org"
(240,164 records).
- Many returns don't actually link to full content, even at the host site. For example, the item "Harlem nocturne" only leads
to a description of Indiana University's DeVincent Sheet Music Collection without any direct link to the site. Even when the
user goes to this site and searches the database, there is no full content available, only the bibliographic record.
- It is not possible to revise a search.
- Searches return duplicate "hits" when an item is recorded by more than one repository or within a repository when it is still
represented in two versions (e.g., OAI-PMH 1.x and OAI-PMH 2.0).
OAIster, a project of the University of Michigan Digital Library Production Services, originally funded through a grant from the
Andrew W. Mellon Foundation [29], represents another broad-based OAI search service, but unlike Arc, the information resources that the metadata describe must have a corresponding Web-based digital representation
-29-
(e.g., records from Indiana's "Harlem nocturne" would not be retrieved from this site because this piece of sheet music itself
is not available in digital form.) As a result of this requirement, OAIster's coverage is narrower than Arc's -- as of August 28, 2003, OAIster included over 1.5 million items from 197 "institutions." (This is an increase over the July 3rd harvest of 1.4 million records
from 189 institutions.) All searches result in links to digital objects. OAIster is exemplary in its efforts to provide context and elaboration about its scope and operation. The annotated listing of institutions
from which OAIster harvests offers the user basic information about each repository, along with the number of records harvested. Search functions
are unified into a single search page that permits varying degrees of refinement from basic keyword searching with Boolean
operators to searching within particular fields (Title, Author/Creator, Subject, Resource type). This latter category is especially
useful in that it permits the user to limit the search to all types or to text, image, audio, or video formats. Results can
be sorted by: title, author/creator, date descending or date ascending, and by hit frequency or weighted hit frequency.
Like Arc, OAIster displays results counts by institution and makes it possible to link to a specific institution's results in the left frame.
An immediate full view of each result avoids the double-clicking to "more information" required in Arc. Other strengths of OAIster search capability are that it:
- provides the total number of returns
- permits users to revise the search
- permits post-search (re)sorting of results according to different criteria
- highlights the search term within the results
- offers ample "help" opportunities
- prominently acknowledges and explains the "duplicate records" problem
At this juncture, neither Arc nor OAIster offer post-result services such as printing, book marking, downloading, or incorporating the digital object into another
document or file, although OAIster notes that these are desired improvements. [30]
Some problematic notes about OAIster:
- For the benefit of regular users, when updates are made to OAIster, it would be helpful if a "What's new" column informed users of institutions added or removed, along with the number of items
associated with these changes. Right now this information is purged from the database on a monthly basis so there's no record
of changes.
- The listing of "institutions" is somewhat problematic since the organization of the list is sometimes by the name of the service
rather than the institution, e.g., Theoretical and Applied Linguistics
-30-
(TAAL) Eprints Archive, University of Edinburgh, files under "Theoretical" not "University of Edinburgh."
- Individual archives within aggregators lose their identity and are not specified in the annotations by institution. For example,
the Open Language Archives Community (OLAC) has 25 registered archives. These are covered by OAIster collectively through OLAC, but the user needs to go to the OLAC site to determine what the 25 archives are. Meanwhile, in some instances the individual archives are also covered separately
by OAIster, e.g., Talkbank and Ethnologue.
OAIster invites participation from potential data providers, encouraging them to make their collections better known through OAIster and informing them about services OAIster will offer them if they need assistance in making their metadata OAI-enabled.
Cyclades, a registered OAI service provider, is a system designed to provide an "open collaborative virtual archive environment,"
which supports individual users and communities of users with the ability to conduct searches across large, heterogeneous,
multidisciplinary OAI-compliant archives. It features value-added services including ad hoc or profile-based user query and
browse functions; mechanisms to build meaningful collections dynamically; filtering and recommendation services; and community
work areas to support collaborative work. It also provides personal document and collection storage space. Users need to |