Evaluating Metadata Quality for Digital Preservation at Harvard Library
Explore the importance of metadata quality in digital preservation with insights from Harvard Library's experience managing their Digital Repository Service. Discover challenges faced with user-contributed metadata and the necessity for systematic validation and tracking to ensure accurate preservation for over 47 million files.
- Metadata Quality
- Digital Preservation
- Harvard Library
- User-Contributed Metadata
- Systematic Validation
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Even v More Better Metadata SAA 2014 Panel: Metadata and Digital Preservation: How Much Do We Really Need? Andrea Goethals, Harvard Library
How much metadata do we really need? That depends on the quality of the metadata...
Context of my remarks Experience developing for and now managing Harvard Library s Digital Repository Service (DRS) (In production from 2000 Present) ~ 47 million files Recent multi-year overhaul of repository to the new DRS Provided chance to analyze metadata & rethink approach
Prior to the new DRS Most all metadata was user-contributed Expertise ranged from professional labs to curators, archivists and other staff Very little validation of user-contributed metadata Metadata elements had grown organically rather than systematically. For example...
Some elements werent specific enough File format one of: ICC, GIF, JPEG, TIFF, TDF, TEXT, PCD, AIFF, RealAudio, APP, WAV, WFR, JP2, JPF, ZIP, GZIP, PDF Format variations and versions not recorded
Some elements were too specific Text abstract character repertoire one of: US-ASCII , Unicode Text character map one of: ISO_646.irv:1983 , UTF-8 These weren t validated so in reality the text could be in any character set but would be recorded as one of these regardless
Some generic elements only tracked for certain formats For images only: enhancements history methodology producer production software system And the above elements allowed free-text, leading to a variety of interpretation over time
Errors in relationship metadata Missing relationships (e.g. referenced in the METS descriptor file but lacking explicit relationships) Redundant relationships (files related more than once to the same files) Illogical relationships (only discoverable because of redundant metadata) Examples: Target images related to other target images Non-target images described as target images A METS descriptor file described as a scanned image Objects merged into themselves
Strategies in the new DRS for improving metadata Pull descriptive metadata from catalogs at ingest or on request Automated format ingest, validation & metadata extraction at ingest Validation when files or ingested, added or removed or relationship metadata is changed F I T S Sync with catalogs, check and improve metadata on migration
File Information Tool Set (FITS) Identifies many file formats Validates a few file formats Extracts metadata from files Aggregates metadata from many tools Calculates basic file info (file size, MD5, etc.) Outputs technical metadata Community-standard metadata schemas Identifies problem files Conflicting tool opinions on format, metadata values Unidentifiable file formats Encrypted, rights metadata embedded in files
File Information Tool Set (FITS) FITS wrapper + XSL JHOVE c o n s o l i d a t o r FITS XML FITS wrapper + XSL e x p o r t e r DROID FITS XML FITS wrapper + XSL NLNZ ME FITS XML Any file FITS XML Standard XML FITS wrapper + XSL ExifTool FITS XML FITS wrapper + XSL File utility FITS XML FITS wrapper + XSL FFIdent FITS XML + Tika, OIS Audio Information, ADL Tool, OIS File Information, OIS XML Metadata
FITS configured to get high quality metadata Metadata normalization JPEG2000 = JPEG 2000 = JPEG 2000 image inches = 2 = in. Plays to strengths of tools and downplays their weaknesses Overall trust tool x over tool y Don t run tool x for format z Format tree (hierarchy of related formats) OpenDocument is more specific than Zip
Example of what we know about a file pre- and post-FITS adoption at ingest Pre-FITS (user-contributed metadata) Post-FITS adoption at Ingest Format = PDF Format = Portable Document Format MIME media-type = application/pdf Format version = 1.4 Format registry record: Registry: PRONOM Registry key: fmt/18 Page count: 24 Date created by application: 2013-04- 02T17:43:27-04:00 Title: JPCDHEP492 Creation application: ComSquare ImPDF Library v0.89 Admin flag: INHIBITOR
Additional strategies in the new DRS Move away from overly restrictive metadata elements where needed Examples: Allow free text for format names Any text character set Add elements at the format-agnostic file level when they can apply to files in any format, e.g. producer or methodology Flag suspicious metadata (and content) for later analysis
Administrative flags Help pinpoint incorrect metadata, problem content or where metadata tools need improvement Some examples: FAILED_METADATA_EXTRACTION FORMAT_ID_CONFLICT INCORRECT_METADATA INHIBITOR RIGHTS_METADATA
They said it better It is quality rather than quantity that matters. Lucious Annaeous Senegal Quality is not an act, it is a habit. Aristotle Quality is never an accident. It is always the result of intelligent effort. John Ruskin