Preserv Format Profiling
Introduction
The first step in preserving digital material is to develop a policy to inform decisions and actions. For open access repositories preservation policy is particularly difficult due to the number of stake holders, the differing types of material and the open-ended commitment to preservation that institutions are required to make.
Institutional repositories typically cater to the needs of their faculty, which results in supporting a wide range of research and teaching activities, as well as (potentially) supporting interactions with external entities e.g. through hosting workshop and conference materials.
The IR administrator therefore needs to formulate a preservation policy that helps to determine the types of material to accept, what metadata to capture, what preservation activities may be necessary (e.g. the migration of file formats) and whether to employ external services or to support preservation 'in-house'.
As part of developing a preservation policy we envisage a key step being to identify the type of material contained in a repository. While IR software may provide a tool to achieve this internally, we would like to - as far as possible - support preservation through service-based tools. By building services that address specific preservation needs repository software can focus on the tasks of storage, authorisation and user interaction (which are all dependent on institutional infrastructure and culture). Service providers can provide the preservation expertise, determined by best-practise, enabling administrators to focus on supporting and capturing content from their faculty.
Purpose and Aims
The motivation for building a service to identify the type of material contained in repositories is due to file formats needing to be 1) identified, 2) verified and 3) migrated in the event of obsolescence. While the file format may not be exactly the same as identifying the 'type' of material (e.g. JPEG is a file format for images which may be astronomical data) it is at least a start to identifying the preservation needs of a repository.
Using the OAI-PMH we have built a service that examines the content of repositories and builds a file format profile based on downloading the contained files and identifying them with the PRONOM Droid file format identification tool.
The goal of this service is to provide the administrator with a summary of the content in their repository and, when the PRONOM database supports it, to provide a 'technology watch' service that can warn administrators of file formats that are at risk of becoming inaccessible. It may also be important to identify missing data - many repositories contain 'stub' records that only contain metadata. To improve the long-term access to indexed material the administrator may want to notify the depositing user to deposit the 'full-text' (or equivalent material). Even if the full-text isn't made publicly accessible, that it is archived in the repository will improve its long-term accessibility (e.g. beyond the restrictions placed on it by copyright).
Using the Preserv Profile
The Preserv Profile tool is integrated into the Registry of Open Access Repositories (ROAR). To access the profile for a known repository requires locating it in the ROAR and, if a profile is available, clicking the Preserv Profile link in the repository record display.
If a profile is not available no link is provided, see Limitations for why profiles for many repositories are not available.
Following the Preserv Profile link displays the profile page:
The profile page consists of brief notes describing what the profile is, a link to an alerting service, a list of OAI-PMH URLs that the profile is generated from and a histogram of the number of files identified per (Pronom-defined) file format.
The profile is based on OAI-PMH records harvested from the repository's registered OAI interface, rather the web view of the repository. Because of this the URL(s) provided are the URLs of the OAI interfaces the profile is generated from, and not the URL of the web interface for the repository. As will be seen later multiple repositories can be aggregated together to generate a single profile, in which case the URLs are shown. In this case only a single OAI interface is represented by the repository:
The file format histogram gives an instant overview of the file formats contained in the repository, rank-ordered by the total number of files found. The No files found column is purely a visual cue - it is actually the number of OAI records for which no file could be found, whereas the other columns are the number of files found (of which there will likely be more than one per record!).
The distinction between records and files is due to repositories using a number of different abstract methods of representing digital objects, with the concept of a record potentially representing very different things. As we are concerned about preservation we are primarily interested in the formats used for individual file formats rather than the more abstract concept of the record format (e.g. an HTML record may contain files in HTML format as well as images in JPEG or PNG). We may assume an HTML file is a (mostly) preservable format, however inline content may not be as sustainable, so its important to show all of the files contained in the repository, and not just the formats at the abstract record level.
Each column is a clickable button that provides a complete breakdown for the files identified as that format:
Files are grouped together by OAI record; the OAI record identifiers are given in the first table column with a link to the record in the repository's OAI interface. The second column contains the URL of the identified file(s) and the last column the last-modified date as reported by the repository's web server (not the record datestamp).
If a record contains more than one file each file is shown on the breakdown page:
The current format is emboldened to distinguish it from files in other formats. In this example the record contains two files, one in Adobe PDF 1.4 format and the other in Hypertext Markup Language (i.e. HTML). A possible use of the breakdown is to list file formats identified as a preservation 'risk' to check what other parallel formats may be available. Currently this is difficult to automatically, because the relationship between multiple files contained in a single record isn't captured (see Limitations).
Email Alerts
The alerting service linked to from the profile page:
Sends periodic emails (set by the user) that notify the user of new records and files in the subscribed-to repository. The first part of the email consists of the date period the report is for and any errors that have occurred during the OAI-PMH harvest:
The main part of the email consists of a report showing the number of new OAI records and files that have been identified in the repository:
New Metadata Records 20 oai_dc 20 uketd_dc New Fulltexts 1 Fixed Width Values Text File 2 Hypertext Markup Language (4.01) 1 IBM DisplayWrite Document (2) 1 IBM DisplayWrite Document (3) 1 MS-DOS Text File 1 MS-DOS Text File with line breaks 1 Macintosh Text File 1 OLE2 Compound Document Format 1 Plain Text File 3 Portable Document Format (1.3) 5 Portable Document Format (1.4) 1 Portable Document Format - Archival (1) 1 Rich Text Format (1.5) 1 Rich Text Format (1.6) 1 Tab-Delimited Text File 1 Unicode Text File
This email alert service is a rudimentary precursor to a 'technology watch'-like service that could allow administrators to monitor and act on documents entering their repository.
How it Works
The Web interface to the Profiling tool has been described in the previous section, and is part of the ROAR. However, the data collection is part of the Celestial service (an OAI-PMH harvesting/caching tool). While the ROAR contains entries for repositories, discovering the content they contain requires a separate 'harvest' process that downloads all of the metadata records and performs any other analytical processes e.g. determining the format of files linked to in those records.
Celestial contains a registry of OAI-PMH interfaces and periodically visits each one to download new or altered records. For GNU EPrints and DSpace-based repositories the Dublin Core records are inspected to locate the URLs of the full-text(s):
If a full-text link is located it is retrieved (files larger than 2MB are ignored - see Limitations) and stored temporarily. The PRONOM Droid tool is run on the downloaded file, which uses heuristics to determine the file format and, for some formats, the file format version. Some files may match more than one file format heuristic, in which case only the first format is registered, and others may not be unidentified at all, in which case they are flagged as 'Unknown Format'.
In addition to the file format - as identified by Droid - the mime-type and last-modified-time as returned by the web server are recorded.
The file format data is stored in Celestial's database, parallel to the metadata records. ROAR access Celestial's database directly to retrieve the file format data and present it to the end user.
The alerting service is part of Celestial, because (currently) alerts are generated for OAI interfaces, and not repository records in ROAR (which may have more than one OAI interface).
Limitations
There are several limitations in the current profiling service.
Repository Support
Currently GNU EPrints and DSpace repository softwares are 'wrapped' - Celestial has rules to locate and identify the URLs of full-texts in Dublin Core as used by these repository softwares. Other archive softwares aren't supported.
Initial support has been built for the METS metadata format, however it isn't widely used in the repository community.
Performance Limitations
The profiling tool is essentially a web crawler, and can similarly generate a lot of traffic. In these initial test stages the maximum size of a file that will be downloaded has been constrained to 2MB or less because files and collections can potentially be very large.
Access to Hidden and Very Large Content
It is anticipated that some kind of technical agreement will be needed between repositories and services. Such an agreement would allow the preservation service to download very large objects without impacting on the repository service. If a repository has non-public items, but still wants to preserve them, it would be necessary to grant the preservation service a 'back door' to access those items.
Page last modified at Du 12 oct 2008 18:44:16 BST.