Академический Документы
Профессиональный Документы
Культура Документы
and Droid
Adam Farquhar, 10-2010
Pronom and Droid, developed primarily at the National Archives (TNA) of the United
Kingdom, have been a key contribution to the digital preservation community.
Pronom is a registry of information about file formats. The TNA provides access to
the Pronom registry on-line at http://www.nationalarchives.gov.uk/PRONOM and
maintains the information. Droid is a software application that uses some of the file
format information to identify the type of specific digital objects. Droid is available
on-line through SourceForge at http://droid.sourceforge.net/ and is managed as an
open source project.
In October, I spent some time looking closely at both Pronom and Droid to get a
better understanding of them and evaluate ways that they could be improved. In
this series of blog posts, I’ll be reporting on the results of this investigation – which
brought several surprises along with it.
The TNA have built a lovely web interface to interact with Pronom, whose contents
are stored in a rather complex database. This is a great way to look at the
information they have on a specific file format. You can search for the format you
want and read through all of the information. As of 22-Oct-2010, Pronom has
information about 731 file formats.
The most important benefit that Pronom provides is to give each of these formats a
persistent unique identifier. This is a truly important contribution. There is no other
registry in the world that provides persistent unique identifiers to digital object
formats without regard to origin at an appropriate level of detail for digital
preservation. We can contrast Pronom and Droid other initiatives. IANA manages
the mimetype names. Mimetype is heavily used by browsers and email applications
to decide how to display files that are downloaded, for example. The mimetype is,
however, very coarse-grained level. For example, there is only one mimetype for all
versions of PDF – even though the different PDF versions have substantially
different features as PDF has developed over the decades. Mimetype also covers a
relatively small number of format types, and provides no method to recognise the
formats. The Unix ‘file’ command provides a fast and robust method to identify
many types of digital objects, but it does not provide a persistent unique identifier
for each type that it recognises. The person who implements the recognition
routine for each type is free to print out whatever seems useful and there is no
guarantee that the output format will be persistent.
For me and some other users of the information, however, Pronom has two major
shortcomings. First, I don’t have much need for a handsome web interface to
access the information about a single format at a time! I need to use the
information in an automated way – and for as many formats as possible. Second,
the coverage is much more limited than it looks at first. Most of the registered
formats have only outline descriptions. This means that they provide a name and
identifier, and some useful textual information about the format, but no method for
recognising an instance of a format.
The Droid 5.0 application provides a nice user interface that enables a user to point
to a set of files or a directory, identify the likely file types, and explore them. The
format is placed into a database and the user can run reports, filter, sort, and so on.
One can also export the results as a csv file that can be imported into a spreadsheet
application for further analysis. In addition, there is a command-line version of the
tools to support automated processes.
In order to identify file formats, Droid uses a Signature File. This is an XML file that
contains a substantial subset of the information in Pronom. It contains an element
for each of the file formats known to Pronom.