random library quotation Link: Publications Forum Link: About DLF Link: News
Link: Digital Collections Link: Digital Production Link: Digital Preservation Link: Use, users, and user support Link: Build: Digital Library Architectures, Systems, and Tools
photo of books

DLF PARTNERS

""

DLF ALLIES

""

Comments

Please send the DLF Executive Director your comments or suggestions.


4.0 Review of Resources


4.1 Points of Reference: Open Access and the Open Archives Initiative

While OAI and Open Access are not synonymous, the Open Access movement relies heavily on the OAI protocol as the mechanism for communicating the availability of OA resources. Publishing in Open Access journals and self-archiving in OA archives are specified by the Budapest Open Access Initiative (BOAI), and further bolstered by the Berlin Declaration, as the major ways to make manifest OA research output. Moreover, institutional repositories (typically OA and OAI-compliant) are increasingly accepted as an essential component of a university's scholarly infrastructure (Lynch 2003).

When the 2003 report was written, it was difficult to identify OA (and OAI-compliant) journals and repositories. The Directory of Open Access Journals (DOAJ), launched in May 2003, marked an initial step towards making OA journals better known, but it was still in an early stage of development. In addition, there was no easy way for authors to identify the copyright polices and self-archiving regulations of publishers. Discovery of OA repositories was even more problematic. Assembling a composite picture was painstaking and idiosyncratic, made possible only by triangulating from data gathered from multiple sources-the official Open Archives Initiative's voluntary registry of OAI data and service providers, the technical OAI Repository Explorer validation system, and via the aggregators, such as Arc and OAIster.

Noting numerous difficulties in identifying OAI-compliant repositories and the deleterious impact on data providers, service providers and their users, the 2003 DLF report called for a user-friendly comprehensive registry (Brogan 2003, 75). In the intervening years, the situation has changed markedly. The registries, directories and indexes under consideration here, are the visible manifestation of OA and OAI growth.

New to the scene, the University of Illinois OAI-PMH Data Provider Registry now ably serves as a comprehensive interactive OAI identification system. Indeed, from a technical standpoint, the concept of registries has become an essential component of digital library architecture covering a wide spectrum of functions. Two new projects, one in the UK and the other in the US, are working in tandem to develop a framework for DL service registries that will help to automate the discovery of DL content and services (see section 2.1.1 and 3.2). Meanwhile, two OA repository registries, geared towards improving communication between developers, researchers and authors, have been developed in the UK. Concomitantly, the DOAJ has extended its services to include article-level access and two new directories now monitor journal and publisher copyright and self-archiving permissions (one outcome of the UK's RoMEO project cited in 2003). Arc continues to serve as a test bed for improving and extending OAI applications. OAIster, on the other hand, has become the de facto leader as a global OAI service provider, dispensing item-level digital content to end-users.

In addition to discussing these major services, this section reviews three new consortial metadata aggregators-the CIC Metadata Portal and the DLF's OAI and MODS portals and then turns to Germany as an exemplar of nationally-based OAI services. Critical issues and future directions round out the review of these services.

Interlocking Purposes

Collectively, the registries, directories and indexes under review serve the following purposes for an audience ranging from data and service providers to researchers, authors and end-users:

  • Raise awareness and visibility within the technical community so digital resources (or metadata) are publicized and harvested.
  • Offer technical validation systems to test OAI-PMH conformance.
  • Serve as a test bed for research and development to improve future OAI services.
  • Improve communication between data and service providers.
  • Provide mechanisms for the developer community to stay current (through email forums or RSS feeds).
  • Promote Open Access principles and promulgate institutional policies adhering to the BOAI and Berlin Declaration.
  • Publicize repositories upholding OA principles.
  • Monitor the status, growth and function of OA implementations across time, country, type of media and software.
  • Inform authors of OA journals or repositories where they can publish (or self-archive) their research output, thereby increasing its impact.
  • Inform authors of institutional or journal policies pertinent to self-archiving or copyright permissions.
  • Serve as a comprehensive directory of OA institutional participants and a feedback loop for constituents from developers to end-users.
  • Provide end-users with full-text article or digital object-level access to academic resources in a timely way through reliable services.
  • Monitor the impact of OA and OAI adoption and use.


Table 05: Summary of General OA and OAI Services: Size, Goal, and Core Audience (March 14, 2006)
TECHNICAL REGISTRIES
Open Archives Initiative
http://www.openarchives.org/
404 data providers and 23 service providers
Official, voluntary registry of OAI data and service providers to facilitate awareness, technical compliance, and community participation.
Core Audience: Developers
University of Illinois OAI-PMH Data Provider Registry
http://gita.grainger.uiuc.edu/registry/
1,047 repositories (955 actively responding)
Comprehensive interactive registry and database of OAI implementations for discovery, technical perusal, and community development.
Core Audience: Developers
DIRECTORIES OF OA JOURNALS AND SELF-ARCHIVING POLICIES
DOAJ: Directory of Open Access Journals (DOAJ)
http://www.doaj.org/
2,113 journals of which 567 provide access to 90,710 articles
Authoritative, comprehensive directory of scholarly journals adhering to BOAI open access principles with growing body of article-level access.
Core Audience: Service providers (libraries, aggregators, metadata harvesters), researchers, and authors.
Publisher Copyright Policies & Self-Archiving: SHERPA/RoMEO List
http://www.sherpa.ac.uk/romeo.php
135 publishers and circa 9,000 journals
List of copyright and self-archiving policies of scholarly journals and publishers. As part of SHERPA, the British Library provides current information about the link between publishers and particular journal titles.
Core Audience: Authors
Journal Policies-Self-Archiving Policies of Journals
http://romeo.eprints.org/
129 publishers and 8,698 journal titles
Directory of scholarly journals and publisher self-archiving policies extracted from SHERPA/RoMEO data. Extensive distinctive statistical data with literature citations in support of self-archiving and OA publishing to strengthen impact of research output.
Audience: Authors
DIRECTORIES OF OA REPOSITORIES
ROAR: Registry of Open Access Repositories
http://archives.eprints.org
640 archives from 40 countries and 3,728,201OAI records (from 480 archives in Celestial)
Registry to monitor overall growth in the number of e-print archives and maintain a list of GNU EPrints sites.
Core Audience: E-print community of developers and researchers.
ROARMAP: Registry of Open Access Repository Material Archiving Policies
http://www.eprints.org/openaccess/policysignup/
18 policies from 9 countries plus 1 European research agency
Directory of institutions with self-archiving policies with associated deposit growth charts, model statements and rationale in support of BOAI and Berlin Declaration.
Core Audience: OA research community.
OpenDOAR: Directory of Open Access Repositories (under development)
http://www.opendoar.org/
355 repositories
Comprehensive and authoritative list of institutional, subject- and funder-based repositories.
Core Audience: Developers and researchers.
CROSS-ARCHIVE SEARCH SERVICES AND INDEXES
Arc: Cross-Archive Search Service
http://arc.cs.odu.edu/
177 archives and more than 7 million records
The first implementation of a hierarchical OAI harvester (aggregator) serves as a cross-archive search service and R&D test bed to improve OAI services.
Core Audience: Developers
OAIster
http://www.oaister.org/
597 institutions from 42 countries and more than 7 million records
Search and discovery service providing access to OAI item-level digital objects, including some licensed restricted access materials.
Core Audience: Researchers

Technical registries serving a range of purposes are rapidly becoming key components of the standards and technology infrastructure supporting digital libraries. Facilitating interoperability through a low-barrier protocol, the official Open Archives Initiative site does not require either data or service providers to register in order to implement the protocol. Registration is optional and many developers simply do not take the time. In other instances they may deliberately choose not to register because the service is not yet in full production; they do not wish to publicize the availability of their resources; or they already have a known clientele. Registration, however, is not merely a matter of publicizing a new repository or service; it also typically entails testing archives for compliance with the OAI protocol. This helps to validate that metadata is appropriately configured to meet at least minimal standards for harvesting. OAIster, for example, requests new data providers to follow a series of steps before contacting them to harvest new content. OAIster's guidelines include official registration with the Open Archives Initiative where new data providers can obtain the OAI foundational documents, access basic OAI tools, and join community services, consisting of email forums and other registries for data and service providers. As a final step prior to contacting OAIster, new data providers are asked to email the administrator of the University of Illinois's Registry (described below) thereby helping to ensure that it has a complete listing of OAI repositories. OAIster's implementation steps help to reinforce the role and function of different OAI registries and validating services, leading to a more cohesive community of practice.

4.1.1 University of Illinois OAI-PMH Data Provider Registry

Developed under the auspices of the DLF's IMLS National Leadership Grant described earlier in this report, this registry primarily serves as a tool for OAI harvesters to discover and effectively use content in repositories upon which developers can build services. The UI Registry (announced in October 2003) strives to be comprehensive and deploys a systematic multi-faceted approach (that goes beyond self-registration) to achieve the goal of completeness (Habing et al. 2004; Shreeves et al. 2005). As of mid-May 2006, the Registry, with 1,042 repositories, is the most complete and useful OAI data provider discovery service for developers.

It automatically harvests an array of data elements from each repository, making "it possible to search for OAI repositories using various criteria and browse through different views of the registry [e.g., sets, metadata formats, records, identifiers, subjects] without any manual cataloging of the various OAI repositories" (Shreeves et al. 2005, 581). A new enhanced OAI data provider has been developed for the registry to allow not only simple Dublin Core records which describe each repository to be harvested, but also the much richer information that has been created manually along with the repository descriptions imported from OAIster (Cole and Habing 2006). [[24]] The metadata format for these richer descriptions conforms to the schema developed for UIUC's IMLS Digital Collections & Content project (see section 4.4.2). UIUC has also developed an OAI gateway application that provides a single point of harvest for all DLF-member repositories. Beyond the convenience of harvesting from a single base URL, individual repositories are organized as sets within the gateway with their own sets organized as subsets. Because each of these sets and subsets has rich collection-level metadata derived from the registry, it allows harvesters to easily associate collection-level metadata to individually harvested items. The DLF member OAI data providers are cataloged and browsable by GEM (Gateway to Education Materials) and LCSH (Library of Congress Subject Headings).

The UI Registry and OAIster collaborate to improve communication between OAI data and service providers, while also enhancing their respective services. Initially OAIster provided UI with additional metadata about all of its OAI repositories (e.g., title, description, home page, and historical record counts) and now it refers new data providers to UI for registration and validation before harvesting their metadata. This helps to ensure fuller coverage via the UI Registry while also resolving some technical validation problems prior to harvesting by OAIster. OAIster also sends its historical data to the Registry on a monthly basis. This makes it possible to access growth graphs for many repositories, although it does not match ROAR's growth charts in terms of user-friendliness and access. The Registry's syndication service (RSS) alerts users to recent changes, listing modifications and new additions over the past 30 days. In addition to OAI-PMH and RSS export functionality, it also supports the SRU protocol (CQL subset). UI is also developing Web-based search and browse interfaces for an OAI service provider registry that will list services developed from harvesting data via the OAI-PMH. Eventually, UI hopes to link the OAI service providers in the database to the OAI data providers from which they harvest. Project news, presentations, and documents, including the cataloging procedures and guidelines used for the DLF collections is available at the Registry's Web site.

4.1.2 DOAJ: Directory of Open Access Journals


Update Table 01: DOAJ based on DLF Survey responses, Fall 2005
DOAJ (Directory of Open Access Journals)
http://www.doaj.org/
ORGANIZATIONAL MODELHosted, maintained and partly funded by Lund University Libraries Head Office. Other current sponsors: Open Society Institute, SPARC Europe, BIBSAM (National Library of Sweden), Axiell AB.
SUBJECTCross-disciplinary
FUNCTIONCovers free, full-text, quality-controlled scientific and scholarly journals. All subjects and languages.
PRIMARY AUDIENCEService Providers, Research Community
STATUSEstablished
SIZE1,909 journals of which 467 are searchable at article level, comprising 80,687 articles.
USENo response
ACCOMPLISHMENTS1. Article metadata search.
2. Journal owner admin functions.
3. OAI-harvesting on both journal and article level.
CHALLENGES1. Add more content.
2. Include OA articles from hybrid journals
TOOLS OR RESOURCES NEEDEDNo tools needed.
GOALS OF NEXT GENERATION RESOURCEDissemination

Launched in May 2003 with 350 journals, DOAJ included more than 1,900 titles by December 2005 and quickly surpassed the 2,000 mark in early 2006. Article-level searching was introduced in June 2004 and as of mid-March 2006 exceeded 80,000 articles. According to the DOAJ Web site, "The Directory aims to be comprehensive and cover all open access scientific and scholarly journals that use a quality control system to guarantee the content." It defines open access journals as those that "use a funding model that does not charge readers or their institutions for access" and its selection criteria uphold reader's rights as put forward in the BOAI principles to "read, download, copy, distribute, print, search, or link to the full texts of these articles." In early 2006, DOAJ updated its selection criteria based on feedback from users.

    DOAJ Selection Criteria
  • Coverage:
    • Subject: all scientific and scholarly subjects are covered.
    • Types of resource: scientific and scholarly periodicals that publish research or review papers in full text.
    • Acceptable sources: academic, government, commercial, non-profit private sources are all acceptable.
    • Level: the target group for included journals should be primarily researchers.
    • Content: a substantive part of the journal should consist of research papers. All content should be available in full text.
    • All languages.
  • Access:
    • All content freely available.
    • Registration: Free user registration online is acceptable.
    • Open Access without delay (e.g. no embargo period).
  • Quality:
    • Quality control: for a journal to be included it should exercise quality control on submitted papers through an editor, editorial board and/or peer-review.
  • Periodical:
    • The journal should have an ISSN (International Standard Serial Number, for information see http://www.issn.org/).
(Source: http://www.doaj.org/articles/questions#definition)


Table 06: DOAJ Journal Subject Coverage (March 6, 2006)
DOAJ Titles
N = 2,081
Percent of Total
Agriculture and Food Sciences1225.9%
Arts and Architecture522.5%
Biology and Life Sciences23111.1%
Business and Economics562.7%
Chemistry522.5%
Earth and Environmental Sciences1597.6%
General Works-Multidisciplinary261.2%
Health Sciences71134.2%
History and Archaeology934.5%
Languages and Literatures1205.8%
Law and Political Science904.3%
Mathematics and Statistics954.6%
Philosophy and Religion713.4%
Physics and Astronomy773.7%
Science General60.3%
Social Sciences48323.2%
Technology and Engineering1597.6%

The DOAJ subject classification is expandable and offers links from topical categories to the journal titles. The two largest sub-categories are Medicine (General) with 194 titles (in Health Sciences) and Education with 148 titles (in Social Sciences). Users can search for journals via keywords or browse by title or subject. The article database supports basic Boolean operators to connect keyword or phrase searches across all fields or limited to title, journal title, author, ISSN, keyword or abstract. A search for articles using the keyword <tsunami> retrieves 13 documents, all with 2005 and 2006 publication dates. The entries provide basic bibliographic citations with the option to view the record or the full text article.

Information about harvesting DOAJ journal and article-level metadata (initiated in July 2004) as well as restrictions on metadata usage (DOAJ is licensed under the Creative Commons Attribution-ShareAlike License) is provided at the Web site's FAQ. DOAJ supports harvesting of broad subject-based sets. DOAJ actively solicits monetary contributions from users to continue to improve its functionality and keep it in continuous operation.

4.1.3 Directories of Journal and Publisher Copyright and Self-Archiving Policies

While DOAJ identifies Open Access journals and publishers it does not disclose their copyright or self-archiving policies. Authors can use the SHERPA/RoMEO List of Publisher Copyright Policies and Self-archiving to "find a summary of permissions that are normally given as part of each publisher's copyright transfer agreement." [[25]] The directory, hosted by the University of Nottingham, is searchable by journal title or publisher. Publishers are assigned a color code that reflects whether permission is granted to self-archive and at what stage in the publication process. According to the site's summary statistics in May 2006, 78 percent of the 154 publishers officially allow some form of self-archiving. An API is being developed to allow repository administrators and others to interface with the database, possibly as a stage in a repository's ingest procedure or similar process. The information is available for downloading by interested parties by special arrangement: for example, the listing hosted by Eprints.org is based on the SHERPA/RoMEO information. Reports and publications emanating from SHERPA affiliated projects and research are available from its Web site, http://www.sherpa.ac.uk/guidance/advocacy.html#reports.


Table 07: Statistics for the 135 publishers on SHERPA/RoMEO (March 2006)
RoMEO colorArchiving policyPublishers%
GreenCan archive pre-print and post-print5944
BlueCan archive post-print (i.e. final draft post-refereeing)3022
YellowCan archive pre-print (i.e. pre-refereeing)1410
Whitearchiving not formally supported3224

Source: http://www.sherpa.ac.uk/romeo.php?stats=yes (March 18, 2006)

EPrints.org has developed a similar directory, based on SHERPA/RoMEO's data of journals that have and have not already given "their green light to author self-archiving." Under rapid development, as of mid-March 2006, it contains 136 publishers and almost 8,900 journals. In contrast to the SHERPA/RoMEO's list, journals are given one of three different color codes:

  • Green: Permits post-print self-archiving
  • Pale Green: permits preprint self-archiving
  • Grey: Does not permit self-archiving
The site maintains summary statistics by journal as well as publisher. Amalgamating green and pale-green publishers results in 76 percent of publishers officially permitting self-archiving (the equivalent of SHERPA/RoMEO's green plus blue plus yellow publishers). In contrast to SHERPA's list, however, the EPrints.org site also provides the data based on journal titles, resulting in a much higher percentage of self-archiving permission rate: 93 percent of the 8,265 journals listed "green" (69 percent full green and 24 percent pale green). A more detailed statistics page highlights and updates the findings of seminal studies about self-archiving (Swan and Brown, 2004a,b; 2005; Harnad et al. 2004; and Harnad and Brody 2004) with charts depicting the current proportion of toll-access and OA articles and the current potential for immediate OA provision. [[26]]

Comparative Coverage: OA Journal Directories and Databases

It seems reasonable to expect that SPARC's OA journal titles would be well-represented in these OA journal directories, but a comparison of sample SPARC journal titles reveals inconsistent and incomplete coverage.


Table 08: SPARC Open Access Journals Represented in DOAJ, PubMed Central, SHERPA/RoMEO, EPrints.org List and EZB (March 18, 2006)
SPARC OPEN ACCESS JOURNALSDOAJPubMed CentralSHERPA/ RoMEO
Publisher Policies
EPrints.org
Journal Policies
EZB
Documenta MathematicaJournal onlyN/ANot listedNot listedGreen
Economics BulletinJournal onlyN/ANot listedNot listedGreen
Geometry & Topology and Algebraic & Geometric TopologyNot listed.N/ATitle onlyNot listedGreen
Journal of Insect ScienceYes, with content.Immediate free and OA without delay.Title onlyNot listedGreen
Journal of Machine Learning ResearchJournal onlyN/AYellowPale-GreenGreen
New Journal of PhysicsJournal onlyN/ANot listed.Not listedGreen
Optics ExpressJournal onlyN/ANot listed.GreenGreen
PLoS Biology
PLoS Computational Biology
PLoS Genetics
PLoS Medicine
PLoS Pathogens
Yes, all titles, with content.Immediate free and open access without delay.Green
Not listed

Not listed
Title only
Not listed
Green
Not listed

Not listed
Green
Not listed
All 5 Green
BioMed CentralYes, with majority of content.Immediate free and OA without delay except for five titles with 24-month delayed access to non-research articles.GreenGreen: 144 journalsGreen
Project Euclid Journals
6 OA titles (25 titles partially open after 3-5 years and 9 titles by subscription)
Of 6 OA titles:
4 not listed; 2 journal only.
N/AOf 6 OA titles:
1 Green; 1 title only; 4 not listed.
Of 6 OA titles: 2 Green, 4 not listed.All 6 Green

All of the PubMed Central titles indicated as free and open access by SPARC also have article-level access in DOAJ. However, with the exception of the fully-represented BioMed Central titles, coverage of other PMC titles is uneven in the two self-archiving policy directories. Of the six OA Project Euclid journal titles, only two are listed in DOAJ; one title is identified as green in SHERPA and two in EPrints.org. Among the sample OA titles: 5 are not listed in DOAJ; fifteen are either not listed or represented by title only (without any corresponding self-archiving policy information) in SHERPA; and twelve are not covered in EPrints.org. The German database of e-journals, EZB (Elektronische Zeitschriftenbibliothek), is the only source to contain all of SPARC's OA titles; moreover, they are correctly annotated in cases where only specific years are OA. EZB's coverage and coding scheme is described more fully below (see 4.1.9) but it does not include journal or publisher self-archiving policies.

4.1.4 ROAR: Registry of Open Access Repositories

Launched in fall 2003, ROAR (formerly known as the Institutional Archives Registry) has two main functions: "(1) to monitor overall growth in the number of e-print archives and (2) to maintain a list of GNU EPrints sites (the software the University of Southampton has designed to facilitate self-archiving)." [[27]] The ROAR FAQ lays out the goals for coverage, emphasizing OA and OAI-compliant research documents, predominantly preprints, postprints of peer-reviewed journal articles, or dissertations. In practice, it has few, editorial exclusions. [[28]]

Beyond research papers, ROAR includes other formats; for example, the University of Southampton's Crystal Report Structure Archive (http://ebank.eprints.org/), a repository that utilizes EPrints.org software to archive datasets "generated during the course of a structure determination from a single crystal x-ray diffraction experiment." It also includes records (46,000) from the Biblioteca "Dr. Jorge Villalobos Padilla, S.J." Instituto Tecnológico y de Estudios Superiores de Occidente, (ITESO), Mexico, excluded by OAIster because they report that many items refer to SFX links, hence they are not really OA. As stated elsewhere in this report, there are many "grey" areas in OAI-harvesting that make it difficult to reach uniform decisions about such parameters as "freely available" or "Open Access."

ROAR is a useful tool for analyzing the characteristics, size, and growth within and across OA e-print archives around the world. Archives are classified by country, system software, and content type. Searches can be filtered by any combination of these fields (e.g., Research Cross-Institution archives using DSpace in Belgium) and sorted by Name, Datestamp, or Total OAI Records. Results provide an annotated entry about the resource with links to the source site, an estimate of the percent of its content that is freely accessible, full text summary graphs charting its growth over time, and a thumbnail of the service's Web site.

Source: http://archives.eprints.org/ (February 28, 2006)

The Browse feature gives composite record counts by three major parameters: country, archive type and software. Record counts are limited to those archives registered and successfully harvested by Celestial; the figures are not restricted to full-text items but reflect all metadata records.


Table 09: ROAR Statistics of Archive Type
ARCHIVE TYPEArchivesIn CelestialRecordsMeanMedian
Research Institutional or Departmental314248757,2863,054272
e-Journal/Publication6643172,9054,021120
Research Cross-Institution63541,792,04833,186569
e-Theses6352333,0976,406674
Demonstration24125,53346128
Database1152,056411160
Other9460601,34510,022176
TOTAL6354743,664,27057,5611,999

Source: http://archives.eprints.org/index.php?action=browse (February 28, 2006)

ROAR's categorization of "archive types" is unique. Given its focus on e-prints, it is not surprising to find that "research institutional or departmental" deployments account for nearly half of ROAR's archives. There is little doubt that this broad category also subsumes some e-journal/publication and e-theses content. These three categories combined account for 70 percent of the archives but only 37 percent of the records, whereas "research cross-institution" accounts for less than 10 percent of the archives but nearly 50 percent of the records. The record count could be quite different if all the archives were fully represented in Celestial or if the archives in the "other" category (94) were assigned to a discrete category. [[29]]


Table 10: ROAR Statistics of System Software Deployments
SYSTEM SOFTWARE (# of deployments, if readily available from software Web site) [[30]]ArchivesIn CelestialRecordsMeanMedian
GNU EPrints (UK) (198)
http://www.eprints.org/software/archives/
196176120,513685164
DSpace (USA) (136)
http://wiki.dspace.org/DspaceInstances/
13182175,2272,137403
Bepress [Digital Commons] (44)
http://www.umi.com/proquest/digitalcommons/
432558,1782,327504
ETD-db (USA)
http://scholar.lib.vt.edu/ETD-db/
2218263,36414,6311,295
OPUS: Open Publications System (Germany) (39)
http://elib.unistuttgart.de/opus/doku/about.php?la=en
21185,07328279
DiVA (Sweden) (15)
http://www.diva-portal.org/about.xsql
14138,966690387
CDSware (Switzerland)
http://cdsware.cern.ch/cdsware/overview.html
85103,20120,6403,339
ARNO (Netherlands) (6)
http://arno.uvt.nl/~arno/site/
54171,40242,85116,801
DoKS: Document & Knowledge Sharing (Belgium)
http://doks.khk.be/wiki/index.php/Main_Page
332,170723226
HAL: Hyper articles en Ligne (France)
http://hal.ccsd.cnrs.fr/index.php
3352,65017,5501,089
Fedora (USA) (32)
http://www.fedora.info/community/
22208104104
eDoc (Greece)
http://www.edocplus.com/company/overview.htm
2239,77019,88519,885
MyCoRE (Germany)
http://www.mycore.de/
111,9351,9351,935
Other software (various)1841222,661,61321,817595
TOTAL6354743,664,270146,25746,806

Source: http://archives.eprints.org/index.php?action=browse (February 28, 2006)

ROAR offers easy access to information about which archives utilize specified system software. Almost all the archives deploying a handful of major repository software systems are fully represented in ROAR (e.g. GNU EPrints, DSpace, DiVA, ARNO, Digital Commons bepress). Although there are myriad IR systems in use worldwide, it would be helpful if more of the archives falling into the "other software" were reviewed and either placed into an existing or newly-created software category (e.g. Arc; Archimède; digitAlexandria-FreeScience and Archivemaker; DLXS; and the Public Knowledge Project's Open Journals and Open Conference Systems). At present the "other category" represents 29 percent of archives and a whopping 73 percent of ROAR's records. Among the top twenty largest archives in ROAR, thirteen presently fall into the "other software" category (e.g., CiteSeer, PubMed Central, arXiv, Library of Congress's American Memory).

As advocates of self-archiving and the Open Access principles set forth in BOAI and the Berlin Declaration, ROAR also operates a registry of institutional self-archiving policies, recently renamed ROARMAP (Registry of Open Access Repository Material Archiving Policies). As of mid-May 2006, 19 institutions in nine countries and one European-wide research institution had registered a policy commitment. Each entry includes a link to the institutional repository, its growth data, and details about its OA policy. Five institutions mandate self-archiving: CERN, University of Southampton, Queensland University of Technology, University of Minho, and University of Zurich. ROARMAP includes model self-archiving policy statements; model policies for national and private research funding agencies are also presented.

In May 2006, ROAR announced two new developments. First, in addition to the RSS, plain-text and ListFriends exports, its records (not their content) became OAI-compliant, initially available as Dublin Core. Secondly, as part of the Preserv project (http://preserv.eprints.org) they added support for Content Profiling institutional repositories-available for most GNU EPrints and DSpace repositories. Users can access links from ROAR entries to the Preserv Profile link for those repositories with functioning (and registered) OAI interfaces. This generates a graph showing the breakdown of all file formats contained in the repository. Users can click on a format's red bar to obtain a complete listing of identified records.

Source: http://archives.eprint.org (May 13, 2006)

4.1.5 OpenDOAR: Directory of Open Access Repositories

Launched to the public in late January 2006 by the University of Nottingham and University of Lund (developer of DOAJ), OpenDOAR is sponsored by the Open Society Institute (OSI), the UK's Joint Information Systems Committee (JISC), the Consortium of Research Libraries (CURL, British Isles), and SPARC Europe. Created to support the Open Access movement, OpenDOAR aims to categorize and build a "comprehensive and authoritative list" of OA research archives worldwide. [[31]] Ultimately, the directory will "serve not only as a discovery tool for scholars seeking original research papers or specific digital representations, but also as a developmental tool for repository administrators and service providers who want to build new services tailored to targeted user communities" (Hubbard 2005).

OpenDOAR staff verify the data about each repository, "noting new features and directions," in order to enrich and enhance future versions of the directory service. The repositories listed in OpenDOAR have been surveyed by researchers as opposed to automatically identified and listed. This approach is valuable (although initially resource-intensive) when compared to some auto-harvested listings, according to its proponents, because roughly 40 percent of repositories surveyed have been rejected as out-of-scope or non-functional.

As of May 2006, the directory lists 380 repositories and offers repository-level keyword searching or browsing with filters by country, content type or subject. Eventually OpenDOAR expects to classify repositories by other parameters and also offer the capacity to search within repositories. Results can be presented in full or short format.


Table 11: OpenDOAR Statistics of Content Type and Subjects (February 2006)
Content Type% of Total N=353Subjects% of Total N=353
Articles21861.8%Agriculture and Food Sciences6317.8%
Books11031.2%Arts and Architecture11332.0%
Chapters9827.8%Biology and Life Sciences14741.6%
Conference papers14641.4%Business and Economics14942.2%
Dissertations21260.1%Chemistry11632.9%
Learning objects277.6%Earth and Environmental Sciences13939.4%
Multimedia287.9%Health Sciences13438.0%
Patents72.0%History and Archaeology11332.0%
Posters236.5%Languages and Literatures13337.7%
Pre-print journal articles8925.2%Law and Political Science14240.2%
Presentations339.3%Mathematics and Statistics14841.9%
Reports14841.9%Philosophy and Religion11031.2%
Research datasets30.8%Physics and Astronomy12435.1%
Software61.7%Science General8223.2%
Undergraduate theses7220.4%Social Sciences23466.3%
Working papers6618.7%Technology and Engineering21861.8%

[[32]] Source: http://www.opendoar.org/ (February 28, 2006)

Unlike ROAR, categorizations are not mutually exclusive. According to OpenDOAR's data, the vast majority of repositories represent a mix of content types (an average of 3.6 different types of materials per repository) and subjects (an average of six different subjects per repository) and subjects (an average of six different subjects per repository). The utility of the present categories is questionable due to their scope and redundant use. Articles, dissertations, reports and conference papers dominate the content, with very few repositories registering datasets, software or patents. In terms of subject categories, the Social Sciences content surpasses all categories (perhaps reflecting redundancy with the Business and Economics, and Law and Political Science categories); Technology and Engineering is a close second. Most subject categories are quite evenly distributed (falling in the distribution range of 32 to 42 percent).

Aligning OpenDOAR's typologies with the repository descriptions is problematic and it is hard to imagine how a system that requires a high level of OpenDOAR staff intermediation will scale up. For example, an institutional repository of a research organization in France (ALADIN) working in the "humanities and social sciences" that "will include articles, technical reports, working papers, images, videos, and more," is coded by two subjects-Earth and Environmental Sciences, and Social Sciences-and by three content types-Articles, Working Papers and Reports. This narrower categorization evidently reflects OpenDOAR's initial focus on research papers and related materials (e.g., theses); expansion of content type listings is desired and intended, given continued funding for this initiative.


Figure 10: OpenDOAR sample search result for Social Sciences
ALADIN: Accès Libre aux Archives du Dépôt Institutionnel Numérique de la MSH-Alpes
Country: France
Organization: La Maison des Sciences de l'Homme-Alpes
Subjects: Earth and Environmental Sciences --- Social Sciences
Type: Articles --- Working papers --- Reports
OAI Base URL: http://dspace.msh-alpes.prd.fr/oai/
Description: ALADIN is a pilot project for publications produced by researchers and partners of MSH-Alpes. MSH-Alpes is a public basic-research organization working in the scientific field of humanities and social sciences (and depending upon CNRS and different Grenoble universities). Ultimately this repository will include articles, technical reports, working papers, images, videos, and more.

Source: http://www.opendoar.org/ (March 2006)

The Aristotle University of Thessealoniki Document Server in Greece "contains theses, articles, papers and photos" and is coded as Articles, Dissertations, and Multimedia and with four broad subject codes. This falls far short of characterizing the repository's content or alerting users to its collections (e.g., historical collection of Greek newspapers-1800 to present, photographic archive of traditional 18th-20th century art, or archaeological events in Greek press-1832 to1932). Nor is the repository retrieved when a user searches for Greek newspapers, newspapers Greece, newspapers or archaeology. Since there is only one repository from Greece, it can be retrieved by country.

Despite these shortcomings, OpenDOAR is in its early stages of deployment and aims eventually to serve multiple user groups "each with their own expectations, needs and perspectives" making it possible to search, filter, analyze and query the descriptions of each repository in customizable and meaningful ways. Closer collaboration-or an eventual merger-with ROAR seems desirable and would allow combining the best features of each service, as informed by user feedback.

4.1.6 Arc: Cross Archive Search Service

Update Table 02: Arc based on DLF Survey responses, Fall 2005


Arc
http://arc.cs.odu.edu/
ORGANIZATIONAL MODELOld Dominion University w/out base funding
SUBJECTMulti-disciplinary
FUNCTIONCross archive digital search service that harvests OAI-compliant repositories.
PRIMARY AUDIENCEResearch community; Digital library developers
STATUSExperimental research service
SIZE7,156,192 records (64% increase) from 177 archives (8.5% increase)
USENo response
ACCOMPLISHMENTS1. Maintained since 2003.
2. Successful experimentation on Lucene indexing to replace database indexing.
3. Successful experimentation on distributed storage on PC-cluster.
4. Arc open source software in SourceForge is used by other projects inside and outside ODU.
CHALLENGES1. Performance problems in grouping search results by archives, subjects, etc.
2. Large volume of data requires fundamental change of architecture.
3. Incremental complexity of source code calls for addressing extensibility.
TOOLS OR RESOURCES NEEDED1. Apache Struts framework to restructure into multi-layered MVC pattern.
2. Apache Lucene indexing framework to speed up the metadata searching and retrieving.
GOALS OF NEXT GENERATION RESOURCE1. Deployment of Lucene/cluster version.
2. Investigate how to provide richer service by integrating Web2.0 technology.

As was the case in 2003, users are informed that Arc "is an experimental research service of Digital Library Research group at Old Dominion University. Arc is used to investigate issues in harvesting OAI compliant repositories and making them accessible through a unified search interface. It is not a production service and may be subject to unscheduled service interruptions and anomalies." In fact, Arc was unstable during the five-month period while this report was written, making it difficult to evaluate fully. Arc researchers report that they have been working on a fast, parallel search-based, robust new version that should be available by mid-June 2006. It is based on Lucene parallel indexing.

Arc contains more than seven million metadata records, including 4.3 million from OCLC's XTCat (bibliographic records of dissertations and theses extracted from WorldCat, which has been static since its initial harvest several years ago). During the six-month period of this review, Arc remained static in size. Access to the "Administration" page that contained details about the last harvests when this service was reviewed in 2003 is now restricted and inaccessible.

With few exceptions, Arc's search and retrieval functions have not changed since the last report was released nor have the problems identified in conducting searches been addressed (further evidence that Arc is intended for R&D purposes-not for end-users). However, two new features are worth noting for their (as yet unrealized) potential usefulness. In advanced search mode, there is an option to "search the last results" or conduct a "new search." In addition, queries can be limited within a specified archive to particular "archive sets." In most instances, unfortunately, the archive only has the default option-"all sets"-available; however, two examples with "archive sets" illustrate the value of this feature. A search of the University of Nottingham's repository can be limited to one of eight constituent departmental archives; similarly the National Science Digital Library (NSDL) development site at Cornell can be filtered to eight different collections. In cases where repositories have meaningful sub-collections of materials, this filtering device would prove very useful.

In "Lessons learned with Arc, an OAI-PMH Service Provider," Liu et al. (2005) inform readers how Arc-which introduced the concept of "hierarchical harvesting" that formed the basis for OAI aggregators-has served as the platform for other projects including Archon (described in the 2003 DLF report and included in Appendix 3 of the current report), Kepler (enables self-archiving by means of an "archivelet"), the Networked Computer Science Technical Reference Library (NCSTRL), and DP9 (an OAI gateway service for Web crawlers). Among more recent initiatives undertaken by the Department of Computer Science at Old Dominion University, the Digital Library Grid, funded by The Andrew W. Mellon Foundation, is developing software tools that take advantage of grid computing so that costs associated with federating heterogeneous digital libraries are more effectively distributed, thereby improving sustainability. "Because of Arc's immense scale," these researchers rightfully conclude, "it has informed the community on a number of issues related to synchronization, scheduling, caching, and replication." Their current work will "merge OAI-PHM digital libraries with grid computing," helping to secure the technical architecture and infrastructure required by large-scale operations (Liu et al. 2005, 602).

4.1.7 OAIster


Update Table 03: OAIster based on DLF Survey responses, Fall 2005
OAIster
http://www.oaister.org/
ORGANIZATIONAL MODELU of Michigan w/ initial Mellon funding; now IMLS in collaboration w/ DLF, UIUC and Emory.
New Yahoo! Search and Google partnerships.
SUBJECTMulti-disciplinary
FUNCTIONCollection of freely available, difficult-to-access, academically-oriented digital resources which are easily searchable.
PRIMARY AUDIENCEAcademic community
STATUSEstablished
SIZE6,000,000 records (300% increase) from 550 institutions (182% increase)
USEPer month: 15K to 19K hits.
100s of 1,000s via Yahoo! Search
ACCOMPLISHMENTS1. Increase in size and use.
2. Development of OAI Best Practices.
3. Respect for OAI (e.g., most vendors incorporate it now).
4. Modifications to advanced search and inclusion of Book Bag feature.
CHALLENGES1. Changes in departmental focus may reduce OAIster priority.
2. Need to recruit programmer.
TOOLS OR RESOURCES NEEDED1. UTF-8 tools that permit harvester to verify if record is UTF-8 or not and communicate that effectively, with appropriate display, to data providers.
2. Streamlined method for maintenance and indexing.
GOALS OF NEXT GENERATION RESOURCE1. Z39.50/MetaLib integration (accomplished as of spring 2006).
2. Clustering analysis for better search and browse.
3. Better informed through user feedback.
4. Many interface and functionality tweaks.

With Arc serving primarily as a research test-bed, OAIster is the only large-scale OAI multidisciplinary aggregator operating as a full production service for the benefit of end-users. OAIster harvests metadata on a weekly basis and prominently notes new "institutions" and new record counts on its home page. (This was recommended in the 2003 DLF report.) Growing by leaps and bounds, as of mid-May 2006, OAIster harvested five times the number of metadata records from more than triple the number of institutions as it did in mid-2003.

A hallmark of OAIster is that it limits harvesting to OAI-compliant records that have full digital representation associated with the item (e.g., full text, digital image, etc.); however, it is important to note that OAIster's definition of "freely available" includes some full-text licensed resources. The most prominent example is the inclusion of the Institute of Physics' journal articles (210,000 records), but there are others such as African Journals Online (18,000 records). OAIster is currently re-thinking its collection parameters with the intent of broadening its scope to embrace items with restricted access to full-text. In addition to providing users with a collection development policy, it would be helpful if OAIster's search results marked items only accessible through licenses or if it permitted users to filter results by restricted versus non-restricted access.

Since 2003, three enhancements to OAIster's user interface stand out. First, "dataset" was added as a "resource type," making it possible to limit searches to this medium. A keyword search for "data" coupled with the filter to retrieve "datasets" returned 280,495 results. Searches can be refined or limited by selecting among the institutions highlighted in the left-hand frame. Twenty-one institutions hold datasets, and at a glance, it is evident that the vast majority of them (279,286) come from one source-PANGAEA: Publishing Network for Geoscientific and Environmental Data. The second enhancement dates from November 2005 when OAIster deployed a "bookbag" feature, enabling users to save records during a session and download or email them. Most recently, in March 2006, OAIster added "language" as a search field option. A search for <Afrikaans> returns one dissertation from the Netherlands but <German> returns more than 74,000 results. (More than half of these records are from Bibliotheksservice-Zentrum Baden-Wurttemberg although more than 120 different archives in OAIster hold German-language materials.)

OAIster makes a vast reservoir of digital content available, but constructing effective searches is not always straightforward, requiring, for example, an understanding of how terms are combined and nested. As evident from the following search results for dissertations on global warming, the first two terms are nested together and then coupled with the third term:

  • Global warming AND thesis OR dissertation
    • Retrieves 112,373 items
    • Interpreted as (global warming and thesis) or dissertation; thus retrieving any item tagged as a dissertation irrespective of the subject.
  • Thesis OR dissertation AND global warming
    • Retrieves 87 items
    • Interpreted as (thesis or dissertation) and global warming; thus retrieving either theses or dissertations about global warming.
Many entries are lengthy; users would benefit from the option to select short or full displays. The search query will also return items using the word "thesis" when it refers to an argument or proposition. If the data provider includes dissertation or thesis in the resource-type field, OAIster would normalize the metadata and these records could be retrieved by limiting the search by "text." If a record does not include those terms, of course, they will not be discoverable. OAIster's clustering effort (described below) aims to support more granular resource-type options via a drop-down menu (including "dissertation" and "thesis"). Admittedly, this is only a partial solution since OAIster must rely on what information the metadata record includes.

Many enhancements depend on the concerted efforts of data providers, achieved by conforming to accepted standards and best practices. For example, effective date searching hinges on more widespread uniformity in the metadata expressing dates. When asking, "why normalize," OAIster's Kat Hagedorn illustrates the wide variance in expressing dates in OAIster:

    WHY NORMALIZE?
  • Sample date values in OAIster:
  • <date>2-12-01</date>
  • <date>2002-01-01</date>
  • <date>0000-00-00</date>
  • <date>1822</date>
  • <date>between 1827 and 1833</date>
  • <date>18--?</date>
  • <date>November 13, 1947</date>
  • <date>SEP 1958</date>
  • <date>235 bce</date>
  • <date>Summer, 1948</date> (Hagedorn 2005b).

OAIster is exploring how to adapt CDL's date normalization utility to help overcome these inconsistencies. [[33]]

Browsing by topical categories relies on appropriate metadata subject tags from data providers. And searching within institutions/collections depends on archives providing "sets" that reflect meaningful sub-collections. For these reasons OAIster's developers are among the key proponents of improving and enriching metadata through DLF's best practices. OAIster is also experimenting with visualization and semantic clustering techniques based on work at Emory University (e.g., MetaCombine project, see SouthComb in section 4.4.9), [[34]] UIUC (e.g., refer to the prototype CIC Metadata Portal in section 4.1.8), and UC-Irvine.

Among the more vexing problems, not only for OAIster but affecting other aggregators as well, is managing duplicate records. As Khan et al. (2005) attest, duplication is easy to eradicate when two records have identical metadata fields, but difficult to detect when they differ slightly (for example, due to data entry errors or different practices in expressing an author's name). Using a subset of data from Arc as a test bed, the authors demonstrate a duplication detection algorithm they developed which might be applied to other large aggregations like OAIster.

OAIster has identified the improvements that it intends to make as time permits:

  • Show HTML embedded in records. Make HTML embedded in search results records viewable and linkable.
  • More relevancy sorting options. Potential to order results by proximity, institution frequency, among other options.
  • Date searching. Single date and date range searching.
  • Searching within institutions. Choice of institutions to search in.
  • Browsing capability. Browsing of broad topical categories of records.
  • Duplicate records. Handling of records that are the same among repositories.
  • Bugs to be fixed:
    • Highlight words or phrases in results list when punctuation exists.
    • Count resource type search hits in hit frequency and weighted hit frequency sorts.
    • Correct secondary sorting for date ascending and date descending sorts.
(Source: http://oaister.umdl.umich.edu/o/oaister/future.html.)

OAIster was among the first OAI data providers to collaborate with Yahoo! Search and Google; OAIster sends them metadata on a monthly basis. Yahoo! Search uses the complete metadata records in their search index, whereas Google uses the URLs included in the records to find pages for their search index. These partnerships facilitate deeper indexing than available via regular Web crawling. [[35]]

In March 2006, OAIster announced the availability of its metadata for use by federated search engines via SRU and created a Web page with instructions about how it use its metadata outside OAIster's interface (http://oaister.umdl.umich.edu/o/oaister/sru.html).

External referrals from general search engines may account for 20 or more times the number of queries than direct OAIster searches. [[36]] While precise data is scarce on the topic, ProQuest has analyzed Web traffic to its Digital Commons' repositories and reports that most users (95 percent) find their way to OAI content via general search engines. This trend decreases slightly over time as users become aware of the repository: after the first year of deployment, external referrals dropped to 75 percent. A growing number of institutional repositories, such as the University of Minho (Portugal), are starting to make OAIster directly searchable from their sites as illustrated by the screenshot below. [[37]]

Source: https://repositorium.sdum.uminho.pt/ (May 12, 2006)

The prominent inclusion of OAIster helps researchers see how their work fits into a larger scholarly communication framework, bringing increased visibility and the potential for wider impact. For instructions to replicate this integration, refer to "Using OAIster Metadata Outside this Interface" available from OAIster's home page.

4.1.8 Consortial Portals: CIC Metadata Portal, DLF Portal, DLF MODS Portal


Update Table 04: CIC Metadata Portal and DLF Portals based on DLF Survey responses, Fall 2005
CIC Metadata Portal
http://cicharvest.grainger.uiuc.edu/
http://nergal.grainger.uiuc.edu/cgi/b/bib/oaister
DLF OAI Portal
http://www.hti.umich.edu/iml/
DLF MODS Portal
http://www.hti.umich.edu/m/mods/
ORGANIZATIONAL MODEL
Collaboration with CIC member libraries.DLF members and allies with OAI records.DLF members and allies who publish OAI records that contain MODS metadata as well as the basic Dublin Core record.
SUBJECT
Cross-disciplinaryPredominantly humanities and cultural heritageCultural heritage
FUNCTION
Research issues relating to consortial metadata aggregation describing both freely available and restricted license content.To publicize publicly-accessible holdings of DLF member institutions.A testbed to demonstrate the value of MODS records in the provision of richer library services.
PRIMARY AUDIENCE
Academic CommunityAcademic CommunityAcademic Community
STATUS
Under DevelopmentUnder DevelopmentExperimental
SIZE
517,000 records from 171 academic collections from 10 CIC universities883,992 records from 44 repositories253,478 records from four repositories (Indiana, LC, OCLC, U of Chicago)
USE
Not availableNot availableNot available
ACCOMPLISHMENTS
1. Incorporation of rich collection descriptions into the search.
2. Generation of thumbnail images included with search results.
3. Incorporation of data from harvested resources (not just OAI) into search indexes.
4. Normalizing & enhancing metadata to support various browse & search interfaces.
1. Creating it.
2. Growing it.
3. Using it to solicit feedback from scholars on ways to improve.
1. Launching it.
2. Modifying it after meeting with DLF Scholars Advisory Panel in June 2005.
3. Added thumbnails; bookbag feature; improved sorting for date, title and author; simple vs. advanced searching modes.
CHALLENGES
1. Resources to maintain the service.1. Local OAI skills.
2. Willingness to make harvestable metadata a local priority.
1. Getting feedback from users.
2. Getting libraries to publish MODS records.
TOOLS OR RESOURCES NEEDED
MoneyTraining (which DLF is providing).Programmers.
GOALS OF NEXT GENERATION RESOURCE
Uncertain since it is a research project and not a production service.Roll it out to the public.
Grow it aggressively, both in bulk and quality.
To continue to prototype services as articulated by DLF user community and DLF Aquifer.

CIC Metadata Portal

Founded in 1958, the CIC is an academic consortium of the eleven institutional members of the Big Ten Athletic Conference plus the University of Illinois at Chicago and the University of Chicago. The CIC Metadata Portal is a collaborative pilot project undertaken to research issues related to aggregating metadata and testing different user interfaces. As of December 2005, the CIC metadata repository contained more than 550,000 records harvested from 187 digital collections held by eleven of the thirteen CIC member institutions. Nearly half of the records (267,000) are contributed by the University of Michigan; the University of Illinois at Urbana-Champaign accounts for another 22 percent (~125,000). Participating institutions adopt the general CIC collection policy and metadata guidelines. Resources include a wide spectrum of types of information. An estimated 70 percent of the records refer to digital objects (have a referring URL); an estimated 50 percent are restricted access, only available to those universities with licenses to access the content. [[38]]

The portal uses the University of Michigan DLXS software also deployed by OAIster, and therefore, exhibits similar advanced search functionality including searching by field, filtering by resource type, and user-control over the ways in which results are sorted. The CIC portal has several resource types not available via OAIster that allow users to limit their queries to sheet music, theses, software and Web sites (but not datasets). It also utilizes an automated process to generate thumbnails and thumbshots from the URLs pointed to in the metadata records (Foulonneau, Habing and Cole 2006). Thumbnails are provided at both the collection and item-level. As of December 2005 only an estimated 35,000 item-level records had thumbnails. [[39]]

Source: http://nergal.grainger.uiuc.edu/cgi/b/bib/oaister (April 30,2006)

From the CIC search portal, users can conduct simple searches, view "featured collections," or browse collection-level records by institution. Unlike OAIster and the DLF portal (described below), the CIC portal has not deployed a Book Bag function that permits users to save results within a session.

The CIC is experimenting with four innovative user interfaces:

Although the CIC Metadata Portal is not a production service, it has furthered research about effective collaboration and produced a number of promising applications (Foulonneau et al. 2006).

DLF OAI Portal

The DLF OAI Portal, in an early stage of development as of May 2006, is a metadata repository containing more than one million items from 45 DLF collections/institutions. DLF's membership includes major research libraries in the United States that are leading the way in digital library innovation, along with a small but influential number of international partners. As a result, this aggregation contains some of the finest digital collections, coming from such prestigious institutions as the Library of Congress, the California Digital Library, Cornell University, Emory University, the University of Chicago, the University of Illinois, Urbana-Champaign, and the universities of Indiana, Michigan, Pennsylvania and Virginia. Once fully developed with more complete holdings from repositories at the Bibliotheca Alexandrina, the British Library, Columbia, Harvard, New York Public Library, Princeton, Stanford and Yale, this portal will offer access to a rich aggregation of premier digital collections. [[40]]

Utilizing the DLXS software, the user interface has the unadorned look and feel of OAIster. It supports simple and advanced searches (Boolean operators applied to keyword, title, author/creator/, subject, and language) as well as delimiters by resource type (text, image, audio, video, and dataset).

As is the case with OAIster, "Browse Institutions" represents a mélange of both high-level composite general collection descriptions (e.g., Indiana University's Digital Library's multiple digital collections are represented by a composite entry) and specific digital collections within an institution (e.g., the University of Pennsylvania is represented as several "institutions" with separate entries for various digital projects). The descriptions represent both the specificity of information provided by the institution as well as the number of separate data repositories deployed within an institution. In short, there is one description in "Browse Institutions" for each repository in the portal. Users, however, would benefit from a more uniform representation of what constitutes a "collection." After updating its contents, the DLF Collections Registry (described in section 4.4.3) and the DLF OAI Portal need to harmonize their collection/institution descriptions. [[41]] The figures below show the difference in the way Indiana University is represented in the DLF OAI Portal (and OAIster), the DLF Collections Registry, and the CIC Metadata Repository. The user is at a loss to know how many "collections" IU's digital library hosts: three, eight or seventeen?


Figure 13a: Indiana University's digital collections (3 of them in bold typeface) as described by the DLF OAI Portal (and OAIster)
Indiana University Digital Library Program (26857 records)
http://dlib.indiana.edu/

The Indiana University Digital Library Program is dedicated to the selection, production, and maintenance of a wide range of high quality networked resources for scholars and students at Indiana University and elsewhere. The program provides OAI-enabled access to the U.S. Steel Gary Works Photograph Collection, 1906-1971, the Frank M. Hohenberger Collection and the Sam DeVincent Collection of American Sheet Music.

Sources: http://www.hti.umich.edu/i/imls/viewcolls.html; http://gita.grainger.uiuc.edu/dlfcollectionsregistry/browse/GemHostInst.asp?name=Indiana+University and http://cicharvest.grainger.uiuc.edu/colls/collections.asp (April 30, 2006).

DLF MODS Portal

The DLF MODS Portal, developed with funding from the DLF's current IMLS grant, is the testing ground for new features and functionality that are subsequently ported to OAIster. Among its accomplishments (noted in the description of OAIster above as well) are the inclusion of thumbnails, the bookbag feature, user-choice of simple or advanced searching modes, and improved capabilities for sorting results by date, title and author.

In an early stage of deployment, the DLF MODS Portal also serves as a prototype to test out the enriched Metadata Object Description Schema. The MODS element set is richer than Dublin Core but simpler than full MARC. As of mid-May 2006, this portal contains more than 250,000 MODS records from four institutions:

  • Indiana University Digital Library Program (certain sets)
  • Library of Congress Digitized Historical Collections
  • OCLC Research Publications
  • University of Chicago Library Metadata Repository (certain sets)

Source: www.hti.umich.edu/cgi/b/bib/bib-idx?c=imls;page=simple

Source: http://www.hti.umich.edu/m/mods/ (April 2006)

The screenshots above show the differences in record display for two different metadata implementations for the same object, A Yankee Trader in the Gold Rush; The Letters of Franklin A. Buck, from the Library of Congress American Memory collection. This comparison between the DLF OAI and DLF MODS portal reveals how the enriched MODS record, with its more specific tagged fields, makes possible enhanced search and retrieval functions.

The DLF Aquifer project (see section 4.4.8) will also require contributing institutions to use the MODS standard for bibliographic data. The DLF MODS Portal will continue to evolve based on needs of the DLF user community and the DLF Aquifer Project.

4.1.9 Germany: OA and OAI Access Points

DINI (Deutsche Initiative für Netzwerkinformation E.V.) in Germany exemplifies a coordinated national approach to OA and OAI adoption. In addition to organizing workshops to promote the Open Access and self-archiving, DINI maintains a centralized directory of OA repositories, establishes quality control through a repository certification process, and operates an OAIster-like search engine across German OA repositories. The directory can be searched or sorted by place, university, URL, contact person, OAI interface, and DINI certification.

The DINI certificate distinguishes the repository from common institutional web servers and assures potential users and authors of digital documents that a certain level of quality in repository operation is warranted. In addition, DINI sees its certificate as an instrument to support the Open Access concept. (Dobratz and Schoger 2005)
A separate search engine, DINI OAI Search Engine (OAI-suche) for German Open Access Repositories, currently conducts searches across 50 German libraries, archives and document servers, comprising 44,336 items. Repositories are harvested on a weekly basis and statistics about the number of records and most recent harvest dates are readily available. Content is searchable by author, title, keyword, or abstract and queries can be limited by language, date, date range or archive. Users can pre-select whether results should be returned by date and they can control the number of returns per page. A search for <wirtschaft> (economics) returns 740 results with briefly annotated entries and links to full-text content.

The Electronic Journals Library (EZB--Elektronische Zeitschriftenbibliothek), with nearly 31,000 titles (an estimated 12 percent are e-only), is arguably the world's largest database of scholarly electronic journals. Operated by the University of Regensburg, EZB represents a consortium of 343 libraries that pool bibliographic information and metadata about freely available and licensed e-journals subscriptions. Ninety-four percent of all German university libraries (n=77) participate along with 80 percent of German national and central subject libraries (e.g., constituents of the Max-Planck Institute). Full-text accessibility is indicated by color-coded dots. An estimated 41 percent of all titles are freely available in full text (i.e. Green).


Figure 17: Dot color-coding scheme in EZB

Full texts are freely accessible.

The library / research institute has a license for this journal; therefore it is accessible for the users of this institution.

The journal is not on subscription, thus full texts are not accessible. Mostly, however, tables of contents and in many cases abstracts are available free-of-charge.

The institution has no continuous subscription on this journal. Therefore, only some of the published volumes are accessible as full texts.

Source: http://rzblx1.uni-regensburg.de/ezeit/about.phtml

Journals are browsable by forty-one different subject areas or by title. Nine subject areas have 400 or more "green" titles (or 63.6 percent of the freely available full-text e-journals).


Table 12: EZB Subjects with 400 or more "green" titles
# of Titles% Free Full-text
Medicine5,52533.1%
Economics2,70639.9%
Biology1,58728.5%
Political Science1,40355.3%
Sociology1,06739.9%
History1,02761.0%
Law1,05655.7%
Agriculture & Forestry 88048.8%
Education75155.5%

Source: Based on data from the Electronic Journals Library: Annual Report 2005 (April 2006).

In contrast, Chemistry & Pharmacy is represented by eleven hundred titles but only 20 percent are freely available in full-text (221 titles).

Users can search for journals by various fields including title, keyword and publisher with the option to limit queries to specific subjects. Through the "preferences" Web page, users can select particular regions or institutions and conduct searches to display their holdings. EZB partnered with the German subject gateway, Vascoda, to incorporate e-journal titles into discipline-specific virtual libraries. [[42]]

Source: http://www.sub.uni-goettingen.de/vlib/history/ezb-journals.php (March 24, 2006)

More than forty information services incorporate EZB's content through OpenURL linking. Currently EZB is working with Vascoda to streamline authentication and permissions so only a single sign-on is required to access licensed resources. [[43]]

4.1.10 Current Issues and Future Directions

These services now contain a wealth of information. In general, they warrant more widespread marketing and use. At the same time, it would be beneficial to better understand the characteristics of their users and the nature of their uses.

"Open access" and "freely available" may carry different meanings in these services. Users are not as concerned about the fine points of definitions, but they would like to know the scope of coverage, what is or is not included. Items that are restricted to licensed users should be clearly indicated.

In many instances it is difficult to distinguish records representing metadata-only from those that also link to full-object representation. Users may wish to have access to the broader spectrum of resources, but should be able to decipher whether or not additional content is available and under what circumstances.

Application of visualization and clustering tools (by subject, geographic area, time period) helps users to interpret and navigate through large results sets.

The database management information from many of these resources is of great value to analyzing the growth in digital repositories worldwide. This data should be readily available for mining by any interested user, ranging from journalists to academics.

The synergistic relations between these services help to foster enhanced OAI-compliance, improved coverage, broader use of resources, and better communication between OAI data and service providers. Examples include cooperative efforts between DOAJ and OpenDOAR, OpenDOAR and ROAR, and the UI OAI Registry and OAIster. Further collaboration might lead to more uniform agreement of terminology and better delineation of service coverage while reducing redundancy (e.g., multiple technical registries for OAI-PMH and overlapping lists of publisher/journal self-archiving policies)

A recent comparative study (the first of its kind) that investigated coverage of the "OAI-PMH corpus" by three general search engines found that Yahoo indexed 65 percent, followed by Google with 44 percent, and MSN with 7 percent (McCown et al. 2005). According to the researchers, 21 percent of the resources were not indexed by any of the three search engines. The authors suggest that if these popular search engines supported OAI-PMH directly, it would increase interest in registering and implementing OAI-PMH repositories. They conclude: "Search engines would benefit by being able to index more content, and DLs would benefit by being able to share their contents with search engines without incurring web crawling overhead."

It might prove worthwhile to call a summit of the core OAI registries and general OAI search services to discuss how to better market their services, not only by extending the reach of their content into these generic popular search engines but also by attracting more users directly to their sites. This would build on various options already deployed such as RSS feeds, A9.com open search, Firefox search engine plug-in, and the development of OA toolbars like OASes, geared to academic users. [[44]]


4.2 Links in the Scholarly Communication Value Chain

Changes in the landscape of scholarly communication over the past few years come into sharp focus through a review of how e-print services are evolving. As discussed earlier in this report, in the short span of time since the original report appeared, the open access movement has gained international momentum and engendered a multitude of commitments from major funding agencies, intergovernmental organizations, private and public foundations, university and library consortia, publishers and single institutions. [[45]] Stemming in large part from self-archiving and harvesting of research output from e-print repositories, the aggregations described in this section represent various subject-based services, along with affiliated discovery and citation analysis tools. Connected together, they serve vital functions in the scholarly communication value chain supporting registration, certification, awareness, archiving and rewarding of intellectual capital (see figure 19, Van de Sompel et al. 2004).

The specific services reviewed here include four varieties of self-archiving and aggregating content: discipline-driven, centralized, author self-archiving of preprints (arXiv); research agency-driven, centralized archiving of technical reports and harvesting of related archives (NASA Technical Reports Server and CERN Document Server); semi-mandated author or publisher centralized self-archiving of peer-reviewed journal articles (PubMed Central); and community-driven centralized deposit of domain-based literature (Open Language Archives Community). Each of these services was also reviewed in the 2003 DLF survey; the discussion here updates and expands on the earlier report.

Special consideration is given to electronic theses and dissertations (ETDs) because they represent a prevalent form of research output. Often aggregated in repositories at the institutional level, ETDs also form the basis of an international community of practice via the Networked Digital Library of Theses and Dissertations. Recent activities to coordinate ETD deployment at the national and transnational level in Europe are described. Finally, tools for discovering ETDs are discussed, most notably Elsevier's Scirus ETD search engine.

The University of Illinois's Grainger Engineering Library OAI Aggregation serves as a cross-repository niche search engine, harvesting records from more than 50 data providers including other services discussed in this report (e.g., arXiv, CDS, DOAJ, NSDL). Covering similar territory, PerX, a pilot search engine developed in the UK for engineering, is briefly described. Future DLF studies should include discussion of the U.S., Department of Energy, Office of Scientific & Technical Information (OSTI) E-Print Network Search service (http://eprints.osti.gov/). [[46]]

CiteSeer and Citebase round out this section and represent services that support reference linking and citation analysis of research literature. CiteSeer focuses on computer science, aggregating literature via Web crawling and data mining techniques in addition to supporting self-archiving, whereas Citebase covers a broader subject domain in the sciences through OAI harvesting. It is beyond the scope of this report to examine recent parallel services such as Google Scholar (http://scholar.google.com/), Microsoft Academic Search (http://academic.live.com/), and Thomson Scientific's Web Citation Index (http://scientific.thomson.com/free/essays/selectionofmaterial/wci-selection/), but it is important to note that they draw their inspiration and to varying degrees, their core technology, from CiteSeer.

4.2.1 arXiv


Update Table 05: arXiv based on DLF Survey responses, Fall 2005
arXiv
http://www.arxiv.org
ORGANIZATIONAL MODELOriginally LANL, now Cornell with partial NSF support.
SUBJECTScience: physics, math, non-linear science, computer science, quantitative biology
FUNCTIONAutomated e-print archive server; rapid distribution system prior to peer review.
PRIMARY AUDIENCEResearch community
STATUSEstablished
SIZE340,000 articles (nearly 50% increase)
USEPer year: 16.8 million unique full-text downloads per year; Per month: 4,000 submission
ACCOMPLISHMENTS1. Creation of quantitative biology section.
2. Established user endorsement system.
3. New interface for computer science section (CoRR).
CHALLENGES1. Continuous heavy use.
2. Staff time & funding.
3. Integration of legacy features/code with new developments.
TOOLS OR RESOURCES NEEDEDMoney and time.
GOALS OF NEXT GENERATION RESOURCE1. Reduced admin time through better facilities.
2. Easier submission process for users.
3. Additional features: flexible alerting, dynamic classification, etc.
4. Better integration with other scholarly resources.

At fourteen years old, arXiv.org remains the earliest, largest and most successful example of a subject-based e-print archive, with readership and monthly submissions growing steadily. Warner reflects on "lessons learned" and charts arXiv's evolution from a "self-contained preprint redistribution service" to a key component of "an integrated global communication system" (2005, 58). ArXiv's content is integrated into federated searches and harvested by aggregators on a worldwide basis.

ArXiv was conceived as a means to formally communicate and rapidly disseminate research progress, not to replace peer-reviewed journals which are considered indispensable to certification and reward systems. Indeed, arXiv has served as a nexus of innovation by demonstrating "how conventional peer review can be implemented on top of an open access substrate," for example, through the creation of journals such as Advances in Theoretical and Mathematical Physics, Geometry and Topology, Logical Methods in Computer Science and all journals of the Institute of Mathematical Statistics (Warner, 2005, 58-59). Both the American Physical Society and the Institute of Physics (UK) accept direct electronic article submissions from arXiv.

Warner discusses the importance of "community" (through the creation of subject advisory boards) and "critical mass" to arXiv's success. To ensure high quality, relevant submissions, in January 2004, arXiv instituted an "endorsement system" that requires most new users to receive ratification from another user prior to submitting their first paper. To support this endorsement system and provide authors with a list of papers they have written, arXiv has established "authority records" that link a person's arXiv account with their papers.

In terms of rights and permissions, Warner explains that for many years "arXiv operated without any explicit statements about rights"; it was assumed that the act of submission granted arXiv the non-exclusive right to distribute the paper. Several years ago, arXiv instituted a license click-through as part of the submission process in which the author:

  • grants arXiv.org a license to distribute this article;
  • certifies the right to grant this license;
  • understands that submissions cannot be completely removed once accepted; and
  • understands that arXiv.org reserves the right to reclassify or reject any submission. (Warner 2005, 64)
Currently other options are under consideration-either simply granting arXiv a license to distribute or agreeing that a Creative Commons license applies, which provides the requisite permissions.

ArXiv created a proxy submission site in France as part of HAL (hypertext articles online at Center for Direct Scientific Communication in Lyon) whereby submissions in relevant subject categories are automatically transferred to arXiv (unless the depositor expressly prohibits it). Similarly, documents for which the full text is already available in arXiv (or TEL-French Theses online) do not need to be uploaded again into HAL; the insertion of a link in HAL makes the file visible. [[47]]

Using arXiv as the exemplar, in "Rethinking Scholarly Communication," Van de Sompel et al. (2004) postulate about new ways to combine the five functions of scholarly communication:

  • Registration, which allows claims of precedence for a scholarly finding.
  • Certification, which establishes the validity of a registered scholarly claim.
  • Awareness, which allows actors in the scholarly system to remain aware of new claims and findings.
  • Archiving, which preserves the scholarly record over time.
  • Rewarding, which rewards actors for their performance in the communication system based on metrics derived from that system. (Van de Sompel et al. 2004, citing the work of Roosendaal and Geurts 1997)
They depict the information flow of an e-print from its entry point in arXiv through "multiple services hubs that fulfill functions of the scholarly communication process." The authors illustrate how multiple players and pathways interact in the value chain of scholarly communication (Figure 19). Disciplinary archives, like arXiv may serve four of five functions, while services like Citebase (see section 4.2.10) discharge some of the reward functions through the provision of citation metrics.

Reproduced with permission of the authors.

When looking to the future, Warner suggests that it is too early to determine what impact institutional repositories will have on arXiv, speculating that the "intermediate stage will be for arXiv to act as a slave subject-based publishing venue with institutional repositories serving as the primary archives, or vice versa" (2005, 67). In the long term, the funding model of institutional repositories, which is more closely aligned with its direct beneficiaries, may prove more viable than arXiv's situation, where the Cornell community comprises only a minor constituency among arXiv's global authors and readers, but has fiduciary responsibility for operating the service with NSF contributing some research funding.

4.2.2 NTRS: NASA Technical Report Server

[[49]]


Update Table 06: NTRS based on DLF Survey responses, Fall 2005
NASA Technical Reports Server (NTRS)
ntrs.nasa.gov
ORGANIZATIONAL MODELNASA
SUBJECTScience: aerospace and other related scientific areas
FUNCTIONTechnical Report Server to collect, archive and disseminate scientific paper.
PRIMARY AUDIENCEResearch Community; Interested Public
STATUSEstablished
SIZE902,000 records (63% increase) of which ~495,000 full-text (~125,000 from NASA agencies; most not free).
USEPer day: 17K unique daily visits.
Per month: 30,000 full-text downloads. [[50]]
ACCOMPLISHMENTS1. Improved OAI tools (e.g., OAI GW to harvest data from master archive at NASA Center for AeroSpace Information).
2. Improved user interface.
3. Normalized data.
CHALLENGES1. Integrating video.
2. Integrating natural language query capabilities.
3. Indexing full text.
TOOLS OR RESOURCES NEEDEDCOTS applications to meet challenges and requirements.
GOALS OF NEXT GENERATION RESOURCE1. Better user interface.
2. Improved data mining capabilities.

The NASA Technical Report Server (NTRS) aggregates more than 900,000 metadata records from 18 agencies, 40 percent of which are derived from four external (non-NASA) services. Among the fourteen NASA agencies covered, the Center for AeroSpace Information (CASI) is by far the largest, contributing some 540,000 metadata records about 23 percent of which represent full-text documents. The significant growth in content aggregated by the NTRS is due primarily to an increase in records from CASI, the Jet Propulsion Laboratory (not covered in 2003), and the Department of Energy, Office of Scientific and Technical Information's "Information Bridge" (OSTI). Not only have CASI's metadata records nearly doubled but its full-text documents have grown from 100 to more than 90,000. Although according to its Web site, "NASA citations and full-text documents found on NTRS are unlimited, unclassified, and publicly available," most full-text technical reports are not free-of-charge, but can be ordered from NASA. Since the 2003 DLF survey, NTRS use has increased dramatically from an estimated 6,500 searches per month to 17,000 unique visits daily in late 2005.

Over the past two years, resources from one NASA agency have been removed due to unresolved copyright issues, the Goddard Institute for Space Studies, [[51]] and another added, the Dryden Flight Research Center (589 full-text papers). As evident from Table 13, five other NASA agency sites are static; NTRS has not recorded any harvests or updates since July 2004. Correspondence with NASA officials reveals that the records for four of the agencies (GENESIS, Goddard, Kennedy and Stennis) were obtained by isolated Web crawls and that RIACS (Research Institute for Advanced Computer Science) has ceased operation of its e-prints software system. [[52]] (RIACS technical reports can be downloaded directly from its Web site.)


Table 13: NTRS Constituent Archives
NASA ARCHIVES [[53]]# Records 2006# Records 2003% Full textDownloads of full text 4/28/03 to 6/30/04Download rank = # of documents N = 312,115Most recent harvest or update
Status on 2/7/2006
GENESIS (NASA Jet Propulsion Laboratory)3727100%403112/3/2006
NASA Ames Research Center3543540%52 [[54]]147/9/2004
NASA Center for AeroSpace Information (CASI)507,371256,63723% [[55]]1,269812/6/2005
NASA Dryden Flight Research Center589N/A100%N/AN/A2/3/2006
NASA Goddard Space Flight Center1111100%1