U.S. National Archives and Records Administration (NARA) Technical Guidelines for Digitizing Archival Materials for Electronic
Access: Creation of Production Master Files -- Raster Images For the Following Record Types --
Textual, Graphic Illustrations/Artwork/Originals, Maps, Plans, Oversized, Photographs, Aerial Photographs, and Objects/Artifacts
June 2004
Written by
Steven Puglia, Jeffrey Reed, and Erin Rhodes
Digital Imaging Lab, Special Media Preservation Laboratory, Preservation Programs U.S. National Archives and Records Administration 8601 Adelphi Road, Room B572, College Park, MD, 20740, USA Lab Phone: 301-837-3706 Email: preserve@nara.gov
Acknowledgements: Thank you to Dr. Don Williams for target analyses, technical guidance based on his extensive experience,
and assistance on the assessment of digital capture devices. Thank you to the following for reading drafts of these guidelines
and providing comments: Stephen Chapman, Bill Comstock, Maggie Hale, and David Remington of Harvard University; Phil Michel
and Kit Peterson of the Library of Congress; and Doris Hamburg,
Kitty Nicholson, and Mary Lynn Ritzenthaler of the U.S. National Archives and Records Administration.
SCOPE:
The NARA Technical Guidelines for Digitizing Archival Materials for Electronic Access define approaches for creating digital surrogates for facilitating access and reproduction; they are not considered appropriate
for preservation reformatting to create surrogates that will replace original records. The Technical Guidelines presented here are based on the procedures used by the Digital Imaging Lab of NARA's Special Media Preservation Laboratory
for digitizing archival records and the creation of production master image files, and are a revision of the 1998 "NARA Guidelines for Digitizing Archival Materials for Electronic Access", which describes the imaging approach used for NARA's pilot Electronic Access Project.
The Technical Guidelines are intended to be informative, and not intended to be prescriptive. We hope to provide a technical foundation for digitization
activities, but further research will be necessary to make informed decisions regarding all aspects of digitizing projects.
These guidelines provide a range of options for various technical aspects of digitization, primarily relating to image capture,
but do not recommend a single approach.
The intended audience for these guidelines includes those who will be planning, managing, and approving digitization projects,
such as archivists, librarians, curators, managers, and others. Another primary audience includes those actually doing scanning
and digital capture, such as technicians and photographers.
The following topics are addressed:
- Digital Image Capture -- production master files, image parameters, digitization environment, color management, etc.
- Minimum Metadata -- types, assessment, local implementation, etc. -- we have included a discussion of metadata to ensure a
minimum complement is collected/created so production master files are useable
- File Formats, Naming, and Storage -- recommended formats, naming, directory structures, etc.
- Quality Control -- image inspection, metadata QC, acceptance/rejection, etc.
The following aspects of digitization projects are not discussed in these guidelines:
- Project Scope -- define goals and requirements, evaluate user needs, identification and evaluation of options, cost-benefit
analysis, etc.
- Selection -- criteria, process, approval, etc.
-2-
- Preparation -- archival/curatorial assessment and prep, records description, preservation/conservation assessment and prep,
etc.
- Descriptive systems -- data standards, metadata schema, encoding schema, controlled vocabularies, etc.
- Project management -- plan of work, budget, staffing, training, records handling guidelines, work done in house vs. contractors,
work space, oversight and coordination of all aspects, etc.
- Access to digital resources -- web delivery system, migrating images and metadata to web, etc.
- Legal issues -- access restrictions, copyright, rights management, etc.
- IT infrastructure -- determine system performance requirements, hardware, software, database design, networking, data/disaster
recovery, etc.
- Project Assessment -- project evaluation, monitoring and evaluation of use of digital assets created, etc.
- Digital preservation -- long-term management and maintenance of images and metadata, etc.
In reviewing this document, please keep in mind the following:
- The Technical Guidelines have been developed for internal NARA use, and for use by NARA with digitizing projects involving NARA holdings and other
partner organizations. The Technical Guidelines support internal policy directive NARA 816 -- Digitization Activities for Enhanced Access, at http://www.nara-at-work.gov/nara_policies_and_guidance/directives/0800_series/nara816.html (NARA internal link only). For digitization projects involving NARA holdings, all requirements in NARA 816 must be met or
followed.
- The Technical Guidelines do not constitute, in any way, guidance to Federal agencies on records creation and management, or on the transfer of permanent
records to the National Archives of the United States. For information on these topics, please see the Records Management
section of the NARA website, at http://www.archives.gov/records_management/index.html and http://www.archives.gov/records_management/initiatives/erm_overview.html.
- As stated above, Federal agencies dealing with the transfer of scanned images of textual documents, of scanned images of photographs,
and of digital photography image files as permanent records to NARA shall follow specific transfer guidance (http://www.archives.gov/records_management/initiatives/scanned_textual.html and http://www.archives.gov/records_management/initiatives/digital_photo_records.html) and the regulations in 36 CFR 1228.270.
- The Technical Guidelines cover only the process of digitizing archival materials for on-line access and hardcopy reproduction. Other issues must be
considered when conducting digital imaging projects, including the long-term management and preservation of digital images
and associated metadata, which are not addressed here. For information on these topics, please see information about NARA's
Electronic Records Archive project, at http://www.archives.gov/electronic_records_archives/index.html.
- The topics in these Technical Guidelines are inherently technical in nature. For those working on digital image capture and quality control for images, a basic foundation
in photography and imaging is essential. Generally, without a good technical foundation and experience for production staff,
there can be no claim about achieving the appropriate level of quality as defined in these guidelines.
- These guidelines reflect current NARA internal practices and we anticipate they will change over time. We plan on updating
the Technical Guidelines on a regular basis. We welcome your comments and suggestions.
-3-
TABLE OF CONTENTS:
-5-
I. INTRODUCTION
These Guidelines define approaches for creating digital surrogates for facilitating access and reproduction. They are not considered appropriate
for preservation reformatting to create surrogates that will replace original records. For further discussion of the differences
between these two approaches, see Appendix A, Digitization for Preservation vs. Production Masters.
These guidelines provide technical benchmarks for the creation of "production master" raster image (pixel-based) files. Production
masters are files used for the creation of additional derivative files for distribution and/or display via a monitor and for
reproduction purposes via hardcopy output at a range of sizes using a variety of printing devices (see Appendix B, Derivative
Files, for more information). Our aim is to use the production master files in an automated fashion to facilitate affordable
reprocessing. Many of the technical approaches discussed in these guidelines are intended for this purpose.
Production master image files have the following attributes --
- The primary objective is to produce digital images that look like the original records (textual, photograph, map, plan, etc.)
and are a "reasonable reproduction" without enhancement. The Technical Guidelines take into account the challenges involved in achieving this and will describe best practices or methods for doing so.
- Production master files document the image at the time of scanning, not what it may once have looked like if restored to its
original condition. Additional versions of the images can be produced for other purposes with different reproduction renderings.
For example, sometimes the reproduction rendering intent for exhibition (both physical and on-line exhibits) and for publication
allows basic enhancement. Any techniques that can be done in a traditional darkroom (contrast and brightness adjustments,
dodging, burning, spotting, etc.) may be allowed on the digital images.
- Digitization should be done in a "use-neutral" manner, not for a specific output. Image quality parameters have been selected
to satisfy most types of output.
If digitization is done to meet the recommended image parameters and all other requirements as described in these Technical Guidelines, we believe the production master image files produced should be usable for a wide variety of applications and meet over
95% of reproduction requests. If digitization is done to meet the alternative minimum image parameters and all other requirements,
the production master image files should be usable for many access applications, particularly for web usage and reproduction
requests for 8"x10" or 8.5"x11" photographic quality prints.
If your intended usage for production master image files is different and you do not need all the potential capabilities of
images produced to meet the recommended image parameters, then you should select appropriate image parameters for your project.
In other words, your approach to digitization may differ and should be tailored to the specific requirements of the project.
Generally, given the high costs and effort for digitization projects, we do not recommend digitizing to anything less than
our alternative minimum image parameters. This assumes availability of suitable high-quality digitization equipment that meets
the assessment criteria described below (see Quantifying Scanner/Digital Camera Performance) and produces image files that
meet the minimum quality described in the Technical Guidelines. If digitization equipment fails any of the assessment criteria or is unable to produce image files of minimum quality, then
it may be desirable to invest in better equipment or to contract with a vendor for digitization services.
II. METADATA
NOTE: All digitization projects undertaken at NARA and covered by NARA 816 Digitizing Activities for Enhanced Access, including those involving partnerships with outside organizations, must ensure that descriptive information is prepared
in accordance with NARA 1301 Life Cycle Data Standards and Lifecycle Authority Control, at http://www.nara-at-work.gov/nara_policies_and_guidance/directives/1300_series/nara1301.html
(NARA internal link only), and its associated Lifecycle Data Requirements Guide, and added to NARA's Archival Research Catalog
(ARC) at a time mutually agreed-upon with NARA.
-6-
Although there are many technical parameters discussed in these Guidelines that define a high-quality production master image file, we do not consider an image to be of high quality unless metadata
is associated with the file. Metadata makes possible several key functions -- the identification, management, access, use,
and preservation of a digital resource -- and is therefore directly associated with most of the steps in a digital imaging
project workflow: file naming, capture, processing, quality control, production tracking, search and retrieval design, storage,
and long-term management. Although it can be costly and time-consuming to produce, metadata adds value to production master
image files: images without sufficient metadata are at greater risk of being lost.
No single metadata element set or standard will be suitable for all projects or all collections. Likewise, different original
source formats (text, image, audio, video, etc.) and different digital file formats may require varying metadata sets and
depths of description. Element sets should be adapted to fit requirements for particular materials, business processes and
system capabilities.
Because no single element set will be optimal for all projects, implementations of metadata in digital projects are beginning
to reflect the use of "application profiles," defined as metadata sets that consist of data elements drawn from different
metadata schemes, which are combined, customized and optimized for a particular local application or project. This "mixing
and matching" of elements from different schemas allows for more useful metadata to be implemented at the local level while
adherence to standard data values and structures is still maintained. Locally-created elements may be added as extensions
to the profile, data elements from existing schemas might be modified for specific interpretations or purposes, or existing
elements may be mapped to terminology used locally.
Because of the likelihood that heterogeneous metadata element sets, data values, encoding schemes, and content information
(different source and file formats) will need to be managed within a digital project, it is good practice to put all of these
pieces into a broader context at the outset of any project in the form of a data or information model. A model can help to
define the types of objects involved and how and at what level they will be described (i.e., are descriptions hierarchical
in nature, will digital objects be described at the file or item level as well as at a higher aggregate level, how are objects
and files related, what kinds of metadata will be needed for the system, for retrieval and use, for management, etc.), as
well as document the rationale behind the different types of metadata sets and encodings used. A data model informs the choice
of metadata element sets, which determine the content values, which are then encoded in a specific way (in relational database
tables or an XML document, for example).
Although there is benefit to recording metadata on the item level to facilitate more precise retrieval of images within and
across collections, we realize that this level of description is not always practical. Different projects and collections
may warrant more in-depth metadata capture than others; a deep level of description at the item level, however, is not usually
accommodated by traditional archival descriptive practices. The functional purpose of metadata often determines the amount
of metadata that is needed. Identification and retrieval of digital images may be accomplished on a very small amount of metadata;
however, management of and preservation services performed on digital images will require more finely detailed metadata --
particularly at the technical level, in order to render the file, and at the structural level, in order to describe the relationships
among different files and versions of files.
Metadata creation requires careful analysis of the resource at hand. Although there are current initiatives aimed at automatically
capturing a given set of values, we believe that metadata input is still largely a manual process and will require human intervention
at many points in the object's lifecycle to assess the quality and relevance of metadata associated with it.
This section of the Guidelines serves as a general discussion of metadata rather than a recommendation of specific metadata
element sets; although several elements for production master image files are suggested as minimum-level information useful
for basic file management. We are currently investigating how we will implement and formalize technical and structural metadata
schemes into our workflow and anticipate that this section will be updated on a regular basis.
Common Metadata Types:
Several categories of metadata are associated with the creation and management of production master image files. The following
metadata types are the ones most commonly implemented in imaging projects. Although these categories are defined separately
below, there is not always an obvious distinction between them, since each type contains elements that are both descriptive
and administrative in nature. These types are commonly broken down by what functions the metadata supports. In general, the
types of metadata listed below, except for descriptive, are usually found "behind the scenes" in databases rather than in
public access systems. As a result, these types of metadata tend to be less standardized and more aligned with local requirements.
-7-
Descriptive --
Descriptive metadata refers to information that supports discovery and identification of a resource (the who, what, when and
where of a resource). It describes the content of the resource, associates various access points, and describes how the resource
is related to other resources intellectually or within a hierarchy. In addition to bibliographic information, it may also
describe physical attributes of the resource such as media type, dimension, and condition. Descriptive metadata is usually
highly structured and often conforms to one or more standardized, published schemes, such as Dublin Core or MARC. Controlled
vocabularies, thesauri, or authority files are commonly used to maintain consistency across the assignment of access points.
Descriptive information is usually stored outside of the image file, often in separate catalogs or databases from technical
information about the image file.
Although descriptive metadata may be stored elsewhere, it is recommended that some basic descriptive metadata (such as a caption
or title) accompany the structural and technical metadata captured during production. The inclusion of this metadata can be
useful for identification of files or groups of related files during quality review and other parts of the workflow, or for
tracing the image back to the original.
Descriptive metadata is not specified in detail in this document; however, we recommend the use of the Dublin Core Metadata
Element [1] set to capture minimal descriptive metadata information where metadata in another formal data standard does not exist. Metadata
should be collected directly in Dublin Core; if it is not used for direct data collection, a mapping to Dublin Core elements
is recommended. A mapping to Dublin Core from a richer, local metadata scheme already in use may also prove helpful for data
exchange across other projects utilizing Dublin Core. Not all Dublin Core elements are required in order to create a valid
Dublin Core record. However, we suggest that production master images be accompanied by the following elements at the very
minimum:
Minimum descriptive elements
| Identifier |
Primary identifier should be unique to the digital resource (at both object and file levels) Secondary identifiers might include identifiers related to the original (such as StillPicture ID) or Record Group number (for
accessioned records)
|
| Title/Caption |
A descriptive name given to the original or the digital resource, or information that describes the content of the original
or digital resource
|
| Creator |
(If available) Describes the person or organization responsible for the creation of the intellectual content of the resource |
| Publisher |
Agency or agency acronym; Description of responsible agency or agent |
These selected elements serve the purpose of basic identification of a file. Additionally, the Dublin Core elements "Format"
(describes data types) and "Type" (describes limited record types) may be useful in certain database applications where sorting
or filtering search results across many record genres or data types may be desirable. Any local fields that are important
within the context of a particular project should also be captured to supplement Dublin Core fields so that valuable information
is not lost. We anticipate that selection of metadata elements will come from more than one preexisting element set -- elements
can always be tailored to specific formats or local needs. Projects should support a modular approach to designing metadata
to fit the specific requirements of the project. Standardizing on Dublin Core supplies baseline metadata that provides access
to files, but this should not exclude richer metadata that extends beyond the Dublin Core set, if available.
For large-scale digitization projects, only minimal metadata may be affordable to record during capture, and is likely to
consist of linking image identifiers to page numbers and indicating major structural divisions or anomalies of the resource
(if applicable) for text documents. For photographs, capturing caption information (and Still Photo identifier) is ideal.
For other non-textual materials, such as posters and maps, descriptive information taken directly from the item being scanned
as well as a local identifier should be captured. If keying of captions into a database is prohibitive, if possible scan captions
as part of the image itself. Although this information will not be searchable, it will serve to provide some basis of identification
for the subject matter of the photograph. Recording of identifiers is important for uniquely identifying resources and is
necessary for locating and managing them. It is likely that digital images will be associated with more than one identifier
-- for the image itself, for metadata or database records that describe the image, and for reference back to the original.
Dublin Core Metadata Initiative, (http://dublincore.org/usage/terms/dc/current-elements/). The Dublin Core element set is
characterized by simplicity in creation of records, flexibility, and extensibility. It facilitates description of all types
of resources and is intended to be used in conjunction with other standards that may offer fuller descriptions in their respective
domains.
For images to be entered into NARA's Archival Research Catalog (ARC), a more detailed complement of metadata is required.
For a more detailed discussion of descriptive metadata requirements for digitization projects at NARA,
-8-
we refer readers to NARA's Lifecycle Data Requirements Guide (LCDRG), at: http://www.archives.gov/research_room/arc/arc_info/lifecycle_data_requirements.doc (June 2004), and NARA internal link -- http://www.nara-at-work.gov/archives_and_records_mgmt/archives_and_activities/accessioning_processing_description/lifecycle/index.html (January 2002), which contains data elements developed for the archival description portion of the records lifecycle, and
associates these elements with many different hierarchical levels of archival materials from record groups to items. The LCDRG
also specifies rules for data entry. The LCDRG also requires a minimum set of other metadata to be recorded for raster image
files at the file level, including technical metadata that enables images to display properly in the ARC interface.
Additionally, enough compatibility exists between Dublin Core and the data requirements that NARA has developed for archival
description to provide a useful mapping between data elements, if a digital project requires that metadata also be managed
locally (outside of ARC), perhaps in a local database or digital asset management system that supports data in Dublin Core.
Please see Appendix C for a listing of mandatory elements identified in the Lifecycle Data Requirements Guide at the record
group, series, file unit and item level, with Dublin Core equivalents.
Because ARC will be used as the primary source for descriptive information about the holdings of permanent records at NARA,
we refer readers to the LCDRG framework rather than discuss Encoded Archival Description (EAD) of finding aids. NARA has developed
its own hierarchical descriptive structure that relates to Federal records in particular, and therefore has not implemented
EAD locally. However, because of the prevalence of the use of EAD in the wider archival and digitization communities, we have
included a reference here. For more information on EAD, see the official EAD site at the Library of Congress at http://lcweb.loc.gov/ead/; as well as the Research Library Group's Best Practices Guidelines for EAD at http://www.rlg.org/rlgead/eadguides.html.
Administrative --
The Dublin Core set does not provide for administrative, technical, or highly structured metadata about different document
types. Administrative metadata comprises both technical and preservation metadata, and is generally used for internal management
of digital resources. Administrative metadata may include information about rights and reproduction or other access requirements,
selection criteria or archiving policy for digital content, audit trails or logs created by a digital asset management system,
persistent identifiers, methodology or documentation of the imaging process, or information about the source materials being
scanned. In general, administrative metadata is informed by the local needs of the project or institution and is defined by
project-specific workflows. Administrative metadata may also encompass repository-like information, such as billing information
or contractual agreements for deposit of digitized resources into a repository.
For additional information, see Harvard University Library's Digital Repository Services (DRS) User Manual for Data Loading,
Version 2.04 at http://hul.harvard.edu/ois/systems/drs/drs_load_manual.pdf, particularly Section 5.0, "DTD Element Descriptions" for application of administrative metadata in a repository setting;
Making of America 2 (MOA2) Digital Object Standard: Metadata, Content, and Encoding at http://www.cdlib.org/about/publications/CDLObjectStd-2001.pdf; the Dublin Core also has an initiative for administrative metadata at http://metadata.net/admin/draft-iannella-admin-01.txt in draft form as it relates to descriptive metadata. The Library of Congress has defined a data dictionary for various formats
in the context of METS, Data Dictionary for Administrative Metadata for Audio, Image, Text, and Video Content to Support the
Revision of Extension Schemas for METS, available at http://lcweb.loc.gov/rr/mopic/avprot/extension2.html.
Rights --
Although metadata regarding rights management information is briefly mentioned above, it encompasses an important piece of
administrative metadata that deserves further discussion. Rights information plays a key role in the context of digital imaging
projects and will become more and more prominent in the context of preservation repositories, as strategies to act upon digital
resources in order to preserve them may involve changing their structure, format, and properties. Rights metadata will be
used both by humans to identify rights holders and legal status of a resource, and also by systems that implement rights management
functions in terms of access and usage restrictions.
Because rights management and copyright are complex legal topics, the General Counsel's office (or a lawyer) should be consulted
for specific guidance and assistance. The following discussion is provided for informational purposes only and should not
be considered specific legal advice.
Generally, records created by employees of the Federal government as part of their routine duties, works for hire created
under contract to the Federal government, and publications produced by the Federal government are all in
-9-
the public domain. However, it is not enough to assume that if NARA has physical custody of a record that it also owns the
intellectual property in that record. NARA also has custody of other records, where copyright may not be so straightforward
-- such as personal letters written by private individuals, personal papers from private individuals, commercially published
materials of all types, etc. -- which are subject to certain intellectual property and privacy rights and may require additional
permissions from rights holders. After transfer or donation of records to NARA from other federal agencies or other entities,
NARA may either: own both the physical record and the intellectual property in the record; own the physical record but not
the intellectual property; or the record is in the public domain. It is important to establish who owns or controls both the
physical record and the copyright at the beginning of an imaging project, as this affects reproduction, distribution, and
access to digital images created from these records.
Metadata element sets for intellectual property and rights information are still in development, but they will be much more
detailed than statements that define reproduction and distribution policies. At a minimum, rights-related metadata should
include: the legal status of the record; a statement on who owns the physical and intellectual aspects of the record; contact
information for these rights holders; as well as any restrictions associated with the copying, use, and distribution of the
record. To facilitate bringing digital copies into future repositories, it is desirable to collect appropriate rights management
metadata at the time of creation of the digital copies. At the very least, digital versions should be identified with a designation
of copyright status, such as: "public domain;" "copyrighted" (and whether clearance/permissions from rights holder has been
secured); "unknown;" "donor agreement/contract;" etc.
Preservation metadata dealing with rights management in the context of digital repositories will likely include detailed information
on the types of actions that can be performed on data objects for preservation purposes and information on the agents or rights
holders that authorize such actions or events.
For an example of rights metadata in the context of libraries and archives, a rights extension schema has recently been added
to the Metadata Encoding and Transmission Standard (METS), which documents metadata about the intellectual rights associated
with a digital object. This extension schema contains three components: a rights declaration statement; detailed information
about rights holders; and context information, which is defined as "who has what permissions and constraints within a specific
set of circumstances." The schema is available at: http://www.loc.gov/standards/rights/METSRights.xsd.
For additional information on rights management, see: Peter B. Hirtle, "Archives or Assets?" at http://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/cul.lib/2003-2; June M. Besek, Copyright Issues Relevant to the Creation of a Digital Archive: A Preliminary Assessment, January 2003 at
http://www.clir.org/pubs/reports/pub112/contents.html; Adrienne Muir, "Copyright and Licensing for Digital Preservation," at http://www.cilip.org.uk/update/issues/jun03/article2june.html; Karen Coyle, Rights Expression Languages, A Report to the Library of Congress, February 2004, available at http://www.loc.gov/standards/Coylereport_final1single.pdf; MPEG-21 Overview v.5 contains a discussion on intellectual property and rights at http://www.chiariglione.org/mpeg/standards/mpeg-21/mpeg-21.htm; for tables that reference when works pass into the public domain, see Peter Hirtle, "When Works Pass Into the Public Domain
in the United States: Copyright Term for Archivists and Librarians," at http://www.copyright.cornell.edu/training/Hirtle_Public_Domain.htm and Mary Minow, "Library Digitization Projects: Copyrighted Works that have Expired into the Public Domain" at http://www.librarylaw.com/DigitizationTable.htm; and for a comprehensive discussion on libraries and copyright, see: Mary Minow, Library Digitization Projects and Copyright at http://www.llrx.com/features/digitization.htm.
Technical --
Technical metadata refers to information that describes attributes of the digital image (not the analog source of the image)
and helps to ensure that images will be rendered accurately. It supports content preservation by providing information needed
by applications to use the file and to successfully control the transformation or migration of images across or between file
formats. Technical metadata also describes the image capture process and technical environment, such as hardware and software
used to scan images, as well as file format-specific information, image quality, and information about the source object being
scanned, which may influence scanning decisions. Technical metadata helps to ensure consistency across a large number of files
by enforcing standards for their creation. At a minimum, technical metadata should capture the information necessary to render,
display, and use the resource.
Technical metadata is characterized by information that is both objective and subjective -- attributes of image quality
that can be measured using objective tests as well as information that may be used in a subjective assessment of an image's
value. Although tools for automatic creation and capture of many objective components
-10-
are badly needed, it is important to determine what metadata should be highly structured and useful to machines, as opposed
to what metadata would be better served in an unstructured, free-text note format. The more subjective data is intended to
assist researchers in the analysis of digital resource or imaging specialists and preservation administrators in determining
long-term value of a resource.
In addition to the digital image, technical metadata will also need to be supplied for the metadata record itself if the metadata
is formatted as a text file or XML document or METS document, for example. In this sense, technical metadata is highly recursive,
but necessary for keeping both images and metadata understandable over time.
Requirements for technical metadata will differ for various media formats. For digital still images, we refer to the NISO
Data Dictionary -- Technical Metadata for Digital Still Images at http://www.niso.org/standards/resources/Z39_87_trial_use.pdf. It is a comprehensive technical metadata set based on the Tagged Image File Format specification, and makes use of the data
that is already captured in file headers. It also contains metadata elements important to the management of image files that
are not present in header information, but that could potentially be automated from scanner/camera software applications.
An XML schema for the NISO technical metadata has been developed at the Library of Congress called MIX (Metadata in XML),
which is available at http://www.loc.gov/standards/mix/.
See also the TIFF 6.0 Specification at http://partners.adobe.com/asn/developer/pdfs/tn/TIFF6.pdf as well as the Digital Imaging Group's DIG 35 metadata element set at http://www.i3a.org/i_dig35.html; and Harvard University Library's Administrative Metadata for Digital Still Images data dictionary at http://hul.harvard.edu/ldi/resources/ImageMetadata_v2.pdf.
A new initiative led by the Research Libraries Group called "Automatic Exposure: Capturing Technical Metadata for Digital Still Images" is investigating ways to automate the capture of technical metadata specified in the NISO Z39.87 draft standard. The initiative
seeks to build automated capture functionality into scanner and digital camera hardware and software in order to make this
metadata readily available for transfer into repositories and digital asset management systems, as well as to make metadata
capture more economically viable by reducing the amount of manual entry that is required. This implies a level of trust that
the metadata that is automatically captured and internal to the file is inherently correct.
See http://www.rlg.org/longterm/autotechmetadata.html for further discussion of this initiative, as well as the discussion on Image Quality Assessment, below.
Initiatives such as the Global Digital Format Registry (http://hul.harvard.edu/gdfr/) could potentially help in reducing the number of metadata elements that need to be recorded about a file or group of files
regarding file format information necessary for preservation functions. Information maintained in the Registry could be pointed
to instead of recorded for each file or batch of files.
Structural --
Structural metadata describes the relationships between different components of a digital resource. It ties the various parts
of a digital resource together in order to make a useable, understandable whole. One of the primary functions of structural
metadata is to enable display and navigation, usually via a page-turning application, by indicating the sequence of page images
or the presence of multiple views of a multi-part item. In this sense, structural metadata is closely related to the intended
behaviors of an object. Structural metadata is very much informed by how the images will be delivered to the user as well
as how they will be stored in a repository system in terms of how relationships among objects are expressed.
Structural metadata often describes the significant intellectual divisions of an item (such as chapter, issue, illustration,
etc.) and correlates these divisions to specific image files. These explicitly labeled access points help to represent the
organization of the original object in digital form. This does not imply, however, that the digital must always imitate the
organization of the original -- especially for non-linear items, such as folded pamphlets. Structural metadata also associates
different representations of the same resource together, such as production master files with their derivatives, or different
sizes, views, or formats of the resource.
Example structural metadata might include whether the resource is simple or complex (multi-page, multi-volume, has discrete
parts, contains multiple views); what the major intellectual divisions of a resource are (table of contents, chapter, musical
movement); identification of different views (double-page spread, cover, detail); the extent (in files, pages, or views) of
a resource and the proper sequence of files, pages and views; as well as different technical (file formats, size), visual
(pre- or post-conservation treatment), intellectual (part of a larger collection or work), and use (all instances of a resource
in different formats --
TIFF files for display, PDF files for printing, OCR file for full text searching) versions.
-11-
File names and organization of files in system directories comprise structural metadata in its barest form. Since meaningful
structural metadata can be embedded in file and directory names, consideration of where and how structural metadata is recorded
should be done up front. See Section V. Storage for further discussion on this topic.
No widely adopted standards for structural metadata exist since most implementations of structural metadata are at the local
level and are very dependent on the object being scanned and the desired functionality in using the object. Most structural
metadata is implemented in file naming schemes and/or in databases that record the order and hierarchy of the parts of an
object so that they can be identified and reassembled back into their original form.
The Metadata Encoding and Transmission Standard (METS) is often discussed in the context of structural metadata, although
it is inclusive of other types of metadata as well. METS provides a way to associate metadata with the digital files they
describe and to encode the metadata and the files in a standardized manner, using XML. METS requires structural information
about the location and organization of related digital files to be included in the METS document. Relationships between different
representations of an object as well as relationships between different hierarchical parts of an object can be expressed.
METS brings together a variety of metadata about an object all into one place by allowing the encoding of descriptive, administrative,
and structural metadata. Metadata and content information can either be wrapped together within the METS document, or pointed
to from the METS document if they exist in externally disparate systems. METS also supports extension schemas for descriptive
and administrative metadata to accommodate a wide range of metadata implementations. Beyond associating metadata with digital
files, METS can be used as a data transfer syntax so objects can easily be shared; as a Submission Information Package, an
Archival Information Package, and a Dissemination Information Package in an OAIS-compliant repository (see below); and also
as a driver for applications, such as a page turner, by associating certain behaviors with digital files so that they can
be viewed, navigated, and used. Because METS is primarily concerned with structure, it works best with "library-like" objects
in establishing relationships among multi-page or multi-part objects, but it does not apply as well to hierarchical relationships
that exist in collections within an archival context.
See http://www.loc.gov/standards/mets/ for more information on METS.
Behavior --
Behavior metadata is often referred to in the context of a METS object. It associates executable behaviors with content information
that define how a resource should be utilized or presented. Specific behaviors might be associated with different genres of
materials (books, photographs, Powerpoint presentations) as well as with different file formats. Behavior metadata contains
a component that abstractly defines a set of behaviors associated with a resource as well as a "mechanism" component that
points to executable code (software applications) that then performs a service according to the defined behavior. The ability
to associate behaviors or services with digital resources is one of the attributes of a METS object and is also part of the
"digital object architecture" of the Fedora digital repository system. See http://www.fedora.info/documents/master-spec-12.20.02.pdf for a discussion of Fedora and digital object behaviors.
Preservation --
Preservation metadata encompasses all information necessary to manage and preserve digital assets over time. Preservation
metadata is usually defined in the context of the OAIS reference model (Open Archival Information System, http://ssdoo.gsfc.nasa.gov/nost/isoas/overview.html), and is often linked to the functions and activities of a repository. It differs from technical metadata in that it documents
processes performed over time (events or actions taken to preserve data and the outcomes of these events) as opposed to explicitly
describing provenance (how a digital resource was created) or file format characteristics, but it does encompass all types
of the metadata mentioned above, including rights information. Although preservation metadata draws on information recorded
earlier (technical and structural metadata would be necessary to render and reassemble the resource into an understandable
whole), it is most often associated with analysis of and actions performed on a resource after submission to a repository.
Preservation metadata might include a record of changes to the resource, such as transformations or conversions from format
to format, or indicate the nature of relationships among different resources.
Preservation metadata is information that will assist in preservation decision-making regarding the long-term value of a digital
resource and the cost of maintaining access to it, and will help to both facilitate archiving strategies for digital images
as well as support and document these strategies over time. Preservation metadata is commonly linked with digital preservation
strategies such as migration and emulation, as well as more "routine" system-level actions such as copying, backup, or other
automated processes carried out on large numbers of objects. These strategies will rely on all types of pre-existing metadata
and will also generate and record new
-12-
metadata about the object. It is likely that this metadata will be both machine-processable and "human-readable" at different
levels to support repository functions as well as preservation policy decisions related to these objects.
In its close link to repository functionality, preservation metadata may reflect or even embody the policy decisions of a
repository; but these are not necessarily the same policies that apply to preservation and reformatting in a traditional context.
The extent of metadata recorded about a resource will likely have an impact on future preservation options to maintain it.
Current implementations of preservation metadata are repository- or institution-specific. We anticipate that a digital asset
management system may provide some basic starter functionality for low-level preservation metadata implementation, but not
to the level of a repository modeled on the OAIS.
See also A Metadata Framework to Support the Preservation of Digital Objects at http://www.oclc.org/research/projects/pmwg/pm_framework.pdf and Preservation Metadata for Digital Objects: A Review of the State of the Art at http://www.oclc.org/research/projects/pmwg/presmeta_wp.pdf, both by the OCLC/RLG Working Group on Preservation Metadata, for excellent discussions of preservation metadata in the context
of the OAIS model. A new working group, "Preservation Metadata: Implementation Strategies," is working on developing best
practices for implementing preservation metadata and on the development of a recommended core set of preservation metadata.
Their work can be followed at http://www.oclc.org/research/projects/pmwg/.
For some examples of implementations of preservation metadata element sets at specific institutions, see: OCLC Digital Archive
Metadata, at http://www.oclc.org/support/documentation/pdf/da_metadata_elements.pdf; Florida Center for Library Automation Preservation Metadata, at http://www.fcla.edu/digitalArchive/pdfs/Archive_data_dictionary20030703.pdf; Technical Metadata for the Long-Term Management of Digital Materials, at http://dvl.dtic.mil/metadata_guidelines/TechMetadata_26Mar02_1400.pdf; and The National Library of New Zealand, Metadata Standard Framework, Preservation Metadata, at http://www.natlib.govt.nz/files/4initiatives_metaschema_revised.pdf.
Image quality assessment (NARA-NWTS Digital Imaging Lab proposed metadata requirement)-
The technical metadata specified in the NISO Data Dictionary -- Technical Metadata for Digital Still Images contains many metadata fields necessary for the long-term viability of the image file. However, we are not convinced that
it goes far enough in providing information necessary to make informed preservation decisions regarding the value and quality
of a digital still raster image. Judgments about the quality of an image require a visual inspection of the image, a process
that cannot be automated. Quality is influenced by many factors -- such as the source material from which the image was scanned,
the devices used to create the image, any subsequent processing done to the image, compression, and the overall intended use
of the image. Although the data dictionary includes information regarding the analog source material and the scanning environment
in which the image was created, we are uncertain whether this information is detailed enough to be of use to administrators,
curators, and others who will need to make decisions regarding the value and potential use of digital still images. The value
of metadata correlates directly with the future use of the metadata. It seems that most technical metadata specified in the
NISO data dictionary is meant to be automatically captured from imaging devices and software and intended to be used by systems
to render and process the file, not necessarily used by humans to make decisions regarding the value of the file. The metadata
can make no guarantee about the quality of the data. Even if files appear to have a full complement of metadata and meet the
recommended technical specifications as outlined in these Technical Guidelines, there may still be problems with the image file that cannot be assessed without some kind of visual inspection.
The notion of an image quality assessment was partly inspired by the National Library of Medicine Permanence Ratings (see
http://www.nlm.nih.gov/pubs/reports/permanence.pdf and http://www.rlg.org/events/pres-2000/byrnes.html), a rating for resource permanence or whether the content of a resource is anticipated to change over time. However, we focused
instead on evaluating image quality and this led to the development of a simplified rating system that would: indicate a quality
level for the suitability of the image as a production master file (its suitability for multiple uses or outputs), and serve
as a potential metric that could be used in making preservation decisions about whether an image is worth maintaining over
time. If multiple digital versions of a single record exist, then the image quality assessment rating may be helpful for deciding
which version(s) to keep.
The rating is linked to image defects introduced in the creation of intermediates and/or introduced during digitization and
image processing, and to the nature and severity of the defects based on evaluating the digital
-13-
images on-screen at different magnifications. In essence, a "good" rating for image files implies an appropriate level of
image quality that warrants the effort to maintain them over time.
The image quality assessment takes into account the attributes that influence specifications for scanning a production master
image file: format, size, intended use, significant characteristics of the original that should be maintained in the scan,
and the quality and characteristics of the source material being scanned. This rating system could later be expanded to take
into account other qualities such as object completeness (are all pages or only parts of the resource scanned?); the source
of the scan (created in-house or externally provided?); temporal inconsistencies (scanned at different times, scanned on different
scanners, scan of object is pre- or post-conservation treatment?), and enhancements applied to the image for specific purposes
(for exhibits, cosmetic changes among others).
This rating is not meant to be a full technical assessment of the image, but rather an easy way to provide information that
supplements existing metadata about the format, intent, and use of the image, all of which could help determine preservation
services that could be guaranteed and associated risks based on the properties of the image. We anticipate a preservation
assessment will to be carried out later in the object's lifecycle based on many factors, including the image quality assessment.
Image quality rating metadata is meant to be captured at the time of scanning, during processing, and even at the time of
ingest into a repository. When bringing batches or groups of multiple image files into a repository that do not have individual
image quality assessment ratings, we recommend visually evaluating a random sample of images and applying the corresponding
rating to all files in appropriate groups of files (such as all images produced on the same model scanner or all images for
a specific project).
Record whether the image quality assessment rating was applied as an individual rating or as a batch rating. If a batch rating,
then record how the files were grouped.
-14-
Image Quality Assessment Ratings
| Rating |
Description |
Use |
Defect Identification |
| 2 |
- No obvious visible defects in image when evaluating the histogram and when viewed onscreen, including individual color channels,
at:
100% or 1:1 pixel display (micro) and actual size (1"=1") and full image (global)
|
Generally, image suitable as production master file. |
|
| 1 |
- No obvious visible defects in image when evaluating the histogram and when viewed onscreen, including individual color channels,
at:
actual size (1"=1") and full image (global)
- Minor defects visible at:
100% or 1:1 pixel display (micro)
|
Image suitable for less critical applications (e.g., suitable for output on typical inkjet and photo printers) or for specific
intents (e.g., for access images, uses where these defects will not be critical).
|
Identify and record the defects relating to intermediates and the digital images -- illustrative examples:
Intermediates
- out of focus copy negative
- scratched microfilm
- surface dirt
- etc.
Digital images
- oversharpened image
- excessive noise
- posterization and quantization artifacts
- compression artifacts
- color channel misregistration
- color fringing around text
- etc.
|
| 0 |
- Obvious visible defects when evaluating the histogram and when viewed on-screen, including individual color channels, at:
100% or 1:1 pixel display (micro) and/or actual size (1"=1") and/or full image (global)
|
Image unsuitable for most applications. In some cases, despite the low rating, image may warrant long-term retention if··image
is the "best copy available" ··known to have been produced for a very specific output
|
Identify and record the defects relating to intermediates and the digital images illustrative examples:
Intermediates
- all defects listed above
- uneven illumination during photography
- under- or over-exposed copy transparencies
- reflections in encapsulation
- etc.
Digital images
- all defects listed above
- clipped highlight and/or clipped shadow detail
- uneven illumination during scanning
- reflections in encapsulation
- image cropped
- etc.
|
As stated earlier, image quality assessment rating is applied to the digital image but is also linked to information regarding
the source material from which it was scanned. Metadata about the image files includes a placeholder for information regarding
source material, which includes a description of whether the analog source is the original or an intermediate, and if so,
what kind of intermediate (copy, dupe, microfilm, photocopy, etc.) as well as the source format. Knowledge of deficiencies
in the source material (beyond identifying the record type and format) helps to inform image quality assessment as well.
The practicality of implementing this kind of assessment has not yet been tested, especially since it necessitates a review
of images at the file level. Until this conceptual approach gains broader acceptance and consistent implementation within
the community, quality assessment metadata may only be useful for local preservation decisions. As the assessment is inherently
technical in nature, a basic foundation in photography and imaging is
-15-
helpful in order to accurately evaluate technical aspects of the file, as well as to provide a degree of trustworthiness in
the reviewer and in the rating that is applied.
Records management/recordkeeping --
Another type of metadata, relevant to the digitization of federal records in particular, is records management metadata. Records
management metadata is aligned with administrative-type metadata in that its function is to assist in the management of records
over time; this information typically includes descriptive (and, more recently, preservation) metadata as a subset of the
information necessary to both find and manage records. Records management metadata is usually discussed in the context of
the systems or domains in which it is created and maintained, such as
Records Management Application (RMA) systems. This includes metadata about the records as well as the organizations, activities,
and systems that create them. The most influential standard in the United States on records management metadata is the Department
of Defense's Design Criteria Standard for Electronic Records Management Software Applications (DOD 5015.2) at http://www.dtic.mil/whs/directives/corres/html/50152std.htm. This standard focuses on minimum metadata elements a RMA should capture and maintain, defines a set of metadata elements
at the file plan, folder, and record levels, and generally discusses the functionality that an RMA should have as well as
the management, tracking, and integration of metadata that is held in RMAs.
Records Management metadata should document whether digital images are designated as permanent records, new records, temporary
records, reference copies, or are accorded a status such as "indefinite retention." A determination of the status of digital
images in a records management context should be made front at the point of creation of the image, as this may have an effect
on the level and detail of metadata that will be gathered for a digital object to maintain its significant properties and
functionality over the long term. Official designation of the status of the digital images will be an important piece of metadata
to have as digital assets are brought into a managed system, such as NARA's Electronic Records Archive (ERA), which will have
extensive records management capabilities.
In addition to a permanent or temporary designation, records management metadata should also include documentation on any
access and/or usage restrictions for the image files. Metadata documenting restrictions that apply to the images could become
essential if both unrestricted and restricted materials and their metadata are stored and managed together in the same system,
as these files will possess different maintenance, use and access requirements. Even if restricted files are stored on a physically
separate system for security purposes, metadata about these files may not be segregated and should therefore include information
on restrictions.
For digitization projects done under NARA 816 guidance, we assume classified, privacy restricted, and any records with other
restrictions will not be selected for digitization. However, records management metadata should still include documentation
on access and usage restrictions -- even unrestricted records should be identified as "unrestricted." This may be important
metadata to express at the system level as well, as controls over access to and use of digital resources might be built directly
into a delivery or access system.
In the future, documentation on access and use restrictions relevant to NARA holdings might include information such as: "classified"
(which should be qualified by level of classification); "unclassified" or "unrestricted;" "declassified;" and "restricted,"
(which should be qualified by a description of the restrictions, i.e., specific donor-imposed restrictions), for example.
Classification designation will have an impact on factors such as physical storage (files may be physically or virtually stored
separately), who has access to these resources, and different maintenance strategies.
Basic records management metadata about the image files will facilitate bringing them into a formal system and will inform
functions such as scheduling retention timeframes, how the files are managed within a system, what types or levels of preservation
services can be performed, or how they are distributed and used by researchers, for example.
Tracking --
Tracking metadata is used to control or facilitate the particular workflow of an imaging project during different stages of
production. Elements might reflect the status of digital images as they go through different stages of the workflow (batch
information and automation processes, capture, processing parameters, quality control, archiving, identification of where/media
on which files are stored); this is primarily internally-defined metadata that serves as documentation of the project and
may also serve also serve as a statistical source of information to track and report on progress of image files. Tracking
metadata may exist in a database or via a directory/folder system.
-16-
Meta-metadata --
Although this information is difficult to codify, it usually refers to metadata that describes the metadata record itself,
rather than the object it is describing, or to high-level information about metadata "policy" and procedures, most often on
the project level. Meta-metadata documents information such as who records the metadata, when and how it gets recorded, where
it is located, what standards are followed, and who is responsible for modification of metadata and under what circumstances.
It is important to note that metadata files yield "master" records as well. These non-image assets are subject to the same
rigor of quality control and storage as master image files. Provisions should be made for the appropriate storage and management
of the metadata files over the long term.
Assessment of Metadata Needs for Imaging Projects:
Before beginning any scanning, it is important to conduct an assessment both of existing metadata and metadata that will be
needed in order to develop data sets that fit the needs of the project. The following questions frame some of the issues to
consider:
- Does metadata already exist in other systems (database, finding aid, on item itself) or structured formats (Dublin Core, local
database)?
If metadata already exists, can it be automatically derived from these systems, pointed to from new metadata gathered during
scanning, or does it require manual input? Efforts to incorporate existing metadata should be pursued. It is also extremely
beneficial if existing metadata in other systems can be exported to populate a production database prior to scanning. This
can be used as base information needed in production tracking, or to link item level information collected at the time of
scanning to metadata describing the content of the resource. An evaluation of the completeness and quality of existing metadata
may need to be made to make it useful (e.g., what are the characteristics of the data content, how is it structured, can it
be easily transformed?)
It is likely that different data sets with different functions will be developed, and these sets will exist in different systems.
However, efforts to link together metadata in disparate systems should be made so that it can be reassembled into something
like a METS document, an Archival XML file for preservation, or a Presentation XML file for display, depending on what is
needed. Metadata about digital images should be integrated into peer systems that already contain metadata about both digital
and analog materials. By their nature, digital collections should not be viewed as something separate from non-digital collections.
Access should be promoted across existing systems rather than building a separate stand-alone system.
- Who will capture metadata?
Metadata is captured by systems or by humans and is intended for system or for human use. For example, certain preservation
metadata might be generated by system-level activities such as data backup or copying. Certain technical metadata is used
by applications to accurately render an image. In determining the function of metadata elements, it is important to establish
whether this information is important for use by machines or by people. If it is information that is used and/or generated
by systems, is it necessary to explicitly record it as metadata? What form of metadata is most useful for people? Most metadata
element sets include less structured, note or comment-type fields that are intended for use by administrators and curators
as data necessary for assessment of the provenance, risk of obsolescence, and value inherent to a particular class of objects.
Any data, whether generated by systems or people, that is necessary to understand a digital object, should be considered as
metadata that may be necessary to formally record. But because of the high costs of manually generating metadata and tracking
system-level information, the use and function of metadata elements should be carefully considered. Although some metadata
can be automatically captured, there is no guarantee that this data will be valuable over the long term.
- How will metadata be captured?
Metadata capture will likely involve a mix of manual and automated entry. Descriptive and structural metadata creation is
largely manual; some may be automatically generated through OCR processes to create indexes or fulltext; some technical metadata
may be captured automatically from imaging software and devices; more sophisticated technical metadata, such as image quality
assessment metadata used to inform preservation decisions, will require visual analysis and manual input.
An easy-to-use and customizable database or asset management system with a graphical and intuitive front end, preferably structured
to mimic a project's particular metadata workflow, is desirable and will make for more efficient metadata creation.
-17-
- When will metadata be collected?
Metadata is usually collected incrementally during the scanning process and will likely be modified over time. At least, start
with a minimal element set that is known to be needed and add additional elements later, if necessary.
Assignment of unique identifier or naming scheme should occur front. We also recommend that descriptive metadata be gathered
prior to capture to help streamline the scanning process. It is usually much more difficult to add new metadata later on,
without consultation of the originals. The unique file identifier can then be associated with a descriptive record identifier,
if necessary.
A determination of what structural metadata elements to record should also occur prior to capture, preferably during the preparation
of materials for capture or during collation of individual items. Information about the hierarchy of the collection, the object
types, and the physical structure of the objects should be recorded in a production database prior to scanning. The structural
parts of the object can be linked to actual content files during capture. Most technical metadata is gathered at the time
of scanning. Preservation metadata is likely to be recorded later on, upon ingest into a repository.
- Where will the metadata be stored?
Metadata can be embedded within the resource (such as an image header or file name) or can reside in a system external to
the resource (such as a database) or both. Metadata can be also encapsulated with the file itself, such as with the Metadata
Encoded Transmission Standard (METS). The choice of location of metadata should encourage optimal functionality and long-term
management of the data.
Header data consists of information necessary to decode the image, and has somewhat limited flexibility in terms of data values
that can be put into the fields. Header information accommodates more technical than descriptive metadata (but richer sets
of header data can be defined depending on the image file format). The advantage is that metadata remains with the file, which
may result in more streamlined management of content and metadata over time. Several tags are saved automatically as part
of the header during processing, such as dimensions, date, and color profile information, which can serve as base-level technical
metadata requirements. However, methods for storing information in file format headers are very format-specific and data may
be lost in conversions from one format to another. Also, not all applications may be able to read the data in headers. Information
in headers should be manually checked to see if data has transferred correctly or has not been overwritten during processing.
Just because data exists in headers does not guarantee that it has not been altered or has been used as intended. Information
in headers should be evaluated to determine if it has value. Data from image headers can be extracted and imported into a
database; a relationship between the metadata and the image must then be established and maintained.
Storing metadata externally to the image in a database provides more flexibility in managing, using, and transforming it and
also supports multi-user access to the data, advanced indexing, sorting, filtering, and querying. It can better accommodate
hierarchical descriptive information and structural information about multi-page or complex objects, as well as importing,
exporting, and harvesting of data to external systems or other formats, such as XML. Because metadata records are resources
that need to be managed in their own right, there is certainly benefit to maintaining metadata separately from file content
in a managed system. Usually a unique identifier or the image file name is used to link metadata in an external system to
image files in a directory.
We recommend that metadata be stored both in image headers as well as in an external database to facilitate migration and
repurposing of the metadata. References between the metadata and the image files can be maintained via persistent identifiers.
A procedure for synchronization of changes to metadata in both locations is also recommended, especially for any duplicated
fields. This approach allows for metadata redundancy in different locations and at different levels of the digital object
for ease of use (image file would not have to be accessed to get information; most header information would be extracted and
added into an external system). Not all metadata should be duplicated in both places (internal and external to the file).
Specific metadata is required in the header so that applications can interpret and render the file; additionally, minimal
descriptive metadata such as a unique identifier or short description of the content of the file should be embedded in header
information in case the file becomes disassociated from the tracking system or repository. Some applications and file formats
offer a means to store metadata within the file in an intellectually structured manner, or allow the referencing of standardized
schemes, such as Adobe XMP or the XML metadata boxes in the JPEG 2000 format. Otherwise, most metadata will reside in external
databases, systems, or registries.
- How will the metadata be stored?
Metadata schemes and data dictionaries define the content rules for metadata creation, but not the format in which metadata
should be stored. Format may partially be determined by where the metadata is stored (file headers,
-18-
relational databases, spreadsheets) as well as the intended use of the metadata -- does it need to be human-readable, or
indexed, searched, shared, and managed by machines? How the metadata is stored or encoded is usually a local decision. Metadata
might be stored in a relational database or encoded in XML, such as in a METS document, for example. Guidelines for implementing
Dublin Core in XML are also available at: http://dublincore.org/documents/2002/09/09/dc-xml-guidelines/.
Adobe's Extensible Metadata Platform (XMP) is another emerging, standardized format for describing where metadata can be stored
and how it can be encoded, thus facilitating exchange of metadata across applications. The XMP specification provides both
a data model and a storage model. Metadata can be embedded in the file in header information or stored in XML "packets" (these
describe how the metadata is embedded in the file). XMP supports the capture of (primarily technical) metadata during content
creation and modification and embeds this information in the file, which can then be extracted later into a digital asset
management system or database or as an XML file. If an application is XMP enabled or aware (most Adobe products are), this
information can be retained across multiple applications and workflows. XMP supports customization of metadata to allow for
local field implementation using their Custom File Info Panels application. XMP supports a number of internal schemas, such
as Dublin Core and EXIF (a metadata standard used for image files, particularly by digital cameras), as well as a number of
external extension schemas. The RLG initiative, "Automatic Exposure: Capturing Technical Metadata for Digital Still Images," mentioned earlier is considering the use of XMP to embed technical metadata in image files during capture and is developing
a Custom File Info Panel for NISO Z39.87 technical metadata. XMP does not guarantee the automatic entry of all necessary metadata
(several fields will still require manual entry, especially local fields), but allows for more complete customized, and accessible
metadata about the file.
See http://www.adobe.com/products/xmp/main.html for more detailed information on the XMP specification and other related documents.
- Will the metadata need to interact or be exchanged with other systems?
This requirement reinforces the need for standardized ways of recording metadata so that it will meet the requirements of
other systems. Mapping from an element in one scheme to an analogous element in another scheme will require that the meaning
and structure of the data is shareable between the two schemes, in order to ensure usability of the converted metadata. Metadata
will also have to be stored in or assembled into a document format, such as XML, that promotes easy exchange of data. METS-compliant
digital objects, for example, promote interoperability by virtue of their standardized, "packaged" format.
- At what level of granularity will the metadata be recorded?
Will metadata be collected at the collection level, the series level, the imaging project level, the item (object) level,
or file level? Although the need for more precise description of digital resources exists so that they can be searched and
identified, for many large-scale digitization projects, this is not realistic. Most collections at NARA are neither organized
around nor described at the individual item level, and cannot be without significant investment of time and cost. Detailed
description of records materials is often limited by the amount of information known about each item, which may require significant
research into identification of subject matter of a photograph, for example, or even what generation of media format is selected
for scanning. Metadata will likely be derived from and exist on a variety of levels, both logical and file, although not all
levels will be relevant for all materials. Certain information required for preservation management of the files will be necessary
at the individual file level. An element indicating level of aggregation (e.g., item, file, series, collection) at which metadata
applies can be incorporated, or the relational design of the database may reflect the hierarchical structure of the materials
being described.
- Adherence to agreed-upon conventions and terminology?
We recommend that standards, if they exist and apply, be followed for the use of data elements, data values, and data encoding.
Attention should be paid to how data is entered into fields and whether controlled vocabularies have been used, in case transformation
is necessary to normalize the data.
Local Implementation:
Because most of what we scan comes to the Imaging Lab on an item-by-item basis, we are capturing minimal descriptive and technical
metadata at the item level only during the image capture and processing stage. Until a structure into which we can record
hierarchical information both about the objects being scanned and their higher-level collection information is in place, we
are entering basic metadata in files using Adobe Photoshop. Information about the file is added to the IPTC (International
Press Telecommunications Council) fields in Photoshop in anticipation of mapping these values to an external database. The
IPTC fields are used as placeholder fields only. This information is embedded in the file using Adobe XMP (Extensible Metadata
Platform: http://www.adobe.com/products/xmp/main.html). Primary identifier is automatically imported into the "File
-19-
Info" function in Photoshop from our scanning software. We anticipate implementing the Custom Panel Description File Format
feature available in XMP to define our own metadata set and then exporting this data into an asset management system, since
the data will be stored in easily migratable XML packets.
The following tables outline minimal descriptive, technical, and structural metadata that we are currently capturing at the
file level (table indicates the elements that logically apply at the object level):
Descriptive/Structural Placeholder Fields -- Logical and/or File Attributes
| Element Name |
Note |
Level (Object, File) of Metadata |
| Primary Identifier |
Unique identifier (numerical string) of the digital image. This identifier also serves as the identifier for an associated
descriptive metadata record in an external database. May be derived from an existing scheme. This identifier is currently
"manually" assigned. We anticipate a "machine" assigned unique identifier to be associated with each image as it is ingested
into a local repository system; this will be more like a "persistent identifier." Since multiple identifiers are associated
with one file, it is likely that this persistent identifier will be the cardinal identifier for the image.
|
Object, File |
| SecondaryIdentifier(s) |
Other unique identifier(s) associated with the original |
Object, File |
| Title |
Title [informal or assigned] or caption associated with the resource |
Object |
| Record Group ID |
Record Group Identifier (if known) |
Object |
| Record Group Descriptor |
Title of Record Group (if known) |
Object |
| Series |
Title of Series (if known) |
Object |
|