Preserv Format Profiling: PRONOM-ROAR
An illustrated guide
Introduction to Preservation Services
The Preserv project is investigating preservation services for institutional repositories. Service providers can provide preservation expertise, determined by best-practice, enabling repository administrators to focus on supporting and capturing content. By building services that address specific preservation needs, repository software such as EPrints can, apart from some offering some elementary preservation support, focus on the primary tasks of user interaction, authorisation, storage, and access.
The choice and provision of preservation services will be informed by repository policies, including policy on preservation. A key step in developing a preservation policy is to identify the types of material contained in a repository in terms of technical structure, or file formats (e.g. PDF, HTML). The National Archives curates a database of file formats, PRONOM, and this can help to identify repository content by using TNA's Digital Record Object Identification (DROID) open source software, which can be downloaded and applied by any repository.
Format identification is only a first step towards preservation, however. The question is what you do with this information. Format IDs need to be verified, and file formats may need to be migrated to other formats in the event of obsolescence. This is where preservation services can help.
PRONOM-ROAR: a Web-based Preservation Service
Effective preservation services will ultimately be customised to individual repositories, but simpler, initial services such as format ID can be provided to many repositories simultaneously. What's needed to supplement PRONOM-DROID is a database of repositories and a means to interact with the repositories. This is provided by the Registry of Open Access Repositories (ROAR) and the Celestial service (an OAI-PMH harvesting/caching tool), both developed by Tim Brody of the Preserv team at Southampton University.
Using the OAI-PMH we have built a service that examines the content of repositories and has produced file format profiles (Preserv profiles) for 200+ repositories based on downloading the contained files and identifying them with PRONOM-DROID. The goal of this service is to provide repository administrators with a summary of their content and, when the PRONOM database supports it, to provide a 'technology watch' service that can warn them of file formats that are at risk of becoming inaccessible.
By demonstrating a preservation service - PRONOM-ROAR repository file format identification - based on interacting Web services, Preserv has begun to redefine the role and nature of preservation services for repositories.
Using the Preserv Profile
The Preserv Profile tool is integrated into the ROAR. To access the profile for a known repository requires locating the repository in the ROAR and, if a profile is available, clicking the Preserv Profile link in the record for the repository. (If a profile is not available no link is provided - see Limitations, Repository Support for why profiles may not be available for some repositories).
To reproduce the results shown in the figures in this section find the ROAR entry for Archive libre de l'Université Louis Pasteur in this list.
Following the Preserv Profile link displays the profile page, which consists of brief notes describing what the profile is, a link to an alerting service, the OAI-PMH URL that the profile is generated from, and a histogram of the number of files identified per (PRONOM-defined) file format.
Reproduce this profile.
The profile is based on OAI-PMH records harvested from the repository's registered OAI interface, rather than the Web view of the repository. Thus the URL(s) shown is that of the OAI interface, not the URL of the Web interface. Multiple repositories can be aggregated to generate a single profile, in which case the URLs for all aggregated repositories are shown. In the example illustrated the repository is represented by a single OAI interface:
The file format histogram gives an instant overview of the file formats contained in the repository, rank-ordered by the total number of files found. (The No files found column is purely a visual cue - it is actually the number of OAI records for which no deposited file could be found.)
The distinction between records and files is due to repositories using different abstract methods of representing digital objects. The concept of a record potentially represents very different things. Typically when a new item is deposited in an EPrints repository a (metadata) record is created that describes the item and links to the deposited content files. The record might link to many content files: an HTML file might contain images, or there may be a number of versions of a file.
For preservation purposes we are primarily interested in the formats used for deposited files rather than the more abstract concept of the record format. It is important to show all the files contained in the repository, and not just the formats at the record level.
Each histogram bar is a clickable and provides a breakdown of the files identified with that format:
Files are grouped by OAI record; the OAI record identifiers are given in the first table column with a link to the record in the repository's OAI interface. The second column contains the Web URL of the identified file(s) and the final column gives the last-modified date as reported by the repository's Web server (not the record datestamp).
If a record contains more than one file each file is shown on the breakdown page:
The format selected from the histogram is emboldened to distinguish it from files in other formats from the same OAI record. In this example the record contains two files, one in Adobe PDF 1.4 format and the other in HTML. A possible use of the breakdown is to list file formats identified as a preservation 'risk' to check what other parallel formats may be available. Currently this is difficult to do automatically, because the relationship between multiple files contained in a single record is not captured (e.g. HTML and inline images or subsequent versions).
Email Alerts: Towards 'Technology Watch'
The alerting service linked from the profile page sends periodic emails (set by the user) that notify the user of new records and files in the subscribed-to repository. This email alert service is a rudimentary precursor to a 'technology watch'-like service that could allow administrators to monitor and act on documents entering their repository.
The first part of the email shows the date period covered by the report, and may show any errors that have occurred during the OAI-PMH harvest:
The main part of the email shows the number of new OAI records and files that have been identified in the repository during the period covered:
New Metadata Records 20 oai_dc 20 uketd_dc New Fulltexts 1 Fixed Width Values Text File 2 Hypertext Markup Language (4.01) 1 IBM DisplayWrite Document (2) 1 IBM DisplayWrite Document (3) 1 MS-DOS Text File 1 MS-DOS Text File with line breaks 1 Macintosh Text File 1 OLE2 Compound Document Format 1 Plain Text File 3 Portable Document Format (1.3) 5 Portable Document Format (1.4) 1 Portable Document Format - Archival (1) 1 Rich Text Format (1.5) 1 Rich Text Format (1.6) 1 Tab-Delimited Text File 1 Unicode Text File
Harvesting Files for Identification: Celestial
The Web interface to the profiling tool is part of ROAR; data collection is by Celestial. While ROAR contains entries for repositories, discovering the repository contents requires a separate 'harvest' process that downloads all the metadata records and performs any other analytical processes, e.g. determining the format of files linked to in those records.
Celestial contains a registry of OAI-PMH interfaces and periodically visits each one to download new or altered records. For EPrints and DSpace-based repositories the Dublin Core records are inspected to locate the URLs of the full-text(s):
If a full-text link is located the file is retrieved (files larger than 2MB are ignored - see Performance Limitations) and stored temporarily. The PRONOM-DROID tool is run on the downloaded file, using heuristics to determine the file format and, for some formats, the file format version. Some files may match more than one format heuristic, in which case only the first format is registered. Other files may not be identified at all, in which case they are flagged as 'Unknown Format'.
In addition to the file format - as identified by DROID - the MIME-type and last-modified-time as returned by the Web server are recorded.
The file format data are stored in Celestial's database, linked to the repository records. ROAR accesses Celestial's database directly to retrieve the file format data and present it to the end user.
The alerting service is part of Celestial, because (currently) alerts are generated for OAI interfaces, and not repository records in ROAR (which may have more than one OAI interface).
Limitations
Repository Support
Currently EPrints and DSpace repository softwares are 'wrapped' - Celestial has rules to locate and identify the URLs of full-texts in Dublin Core as used by these softwares. Other archive softwares are not supported.
Support has been built in to Celestial to harvest metadata in the Metadata Encoding and Transmission Standard (METS) format, although this format is not yet widely used in institutional repositories.
Performance Limitations
The profiling tool is essentially a Web crawler, and can generate a lot of Web 'traffic'. To minimize data downloads in the initial test stages the maximum size of a downloaded file was constrained to 2MB or less because files and collections can potentially be very large.
Access to Large Files and Hidden Content
It is anticipated that some kind of technical agreement will be needed between repositories and preservation services. Such an agreement would allow the service to download very large objects without impacting on the repository service. If a repository wants to preserve non-public items it would be necessary to grant the service a 'back door' to access those items.
Page last modified at S� 20 mar 2010 12:23:25 GMT.
Attachments
- preserv_link.png (114.2 kB) -
Preserv Profile Link
, added by tdb01r on 2006-11-13 12:51:21. - profile_page_graph.png (7.4 kB) -
Profile Graph
, added by tdb01r on 2006-11-13 13:20:20. - profile_page_urls.png (2.0 kB) -
Profile Page URLs
, added by tdb01r on 2006-11-13 13:20:32. - profile_records.png (17.9 kB) -
OAI Records Detail
, added by tdb01r on 2006-11-13 13:23:34. - profile_records_example.png (3.4 kB) -
OAI Records Example
, added by tdb01r on 2006-11-13 13:31:12. - aggregated_graph.png (7.2 kB) -
Aggregated Profile Graph
, added by tdb01r on 2006-11-13 13:31:22. - aggregated_profile.png (27.8 kB) -
Aggregated Profile Link
, added by tdb01r on 2006-11-13 13:31:36. - aggregated_urls.png (9.1 kB) -
Aggregated Profile URLs
, added by tdb01r on 2006-11-13 13:31:43. - profile_page.png (27.7 kB) -
Profile Page
, added by tdb01r on 2006-11-13 15:39:17. - alert_link.png (2.1 kB) -
Link to Alerting Service
, added by tdb01r on 2006-11-13 15:41:16. - alert_email_top.png (19.3 kB) - added by tdb01r on 2006-11-13 15:44:55.
- how_oai_pmh.png (60.0 kB) -
OAI-PMH Harvest from Repository
, added by tdb01r on 2006-11-14 15:50:55.







