Вы находитесь на странице: 1из 3

Identifying file formats – taking a closer look at Pronom

and Droid
Adam Farquhar, 10-2010
Pronom and Droid, developed primarily at the National Archives (TNA) of the United
Kingdom, have been a key contribution to the digital preservation community.
Pronom is a registry of information about file formats. The TNA provides access to
the Pronom registry on-line at http://www.nationalarchives.gov.uk/PRONOM and
maintains the information. Droid is a software application that uses some of the file
format information to identify the type of specific digital objects. Droid is available
on-line through SourceForge at http://droid.sourceforge.net/ and is managed as an
open source project.
In October, I spent some time looking closely at both Pronom and Droid to get a
better understanding of them and evaluate ways that they could be improved. In
this series of blog posts, I’ll be reporting on the results of this investigation – which
brought several surprises along with it.
The TNA have built a lovely web interface to interact with Pronom, whose contents
are stored in a rather complex database. This is a great way to look at the
information they have on a specific file format. You can search for the format you
want and read through all of the information. As of 22-Oct-2010, Pronom has
information about 731 file formats.
The most important benefit that Pronom provides is to give each of these formats a
persistent unique identifier. This is a truly important contribution. There is no other
registry in the world that provides persistent unique identifiers to digital object
formats without regard to origin at an appropriate level of detail for digital
preservation. We can contrast Pronom and Droid other initiatives. IANA manages
the mimetype names. Mimetype is heavily used by browsers and email applications
to decide how to display files that are downloaded, for example. The mimetype is,
however, very coarse-grained level. For example, there is only one mimetype for all
versions of PDF – even though the different PDF versions have substantially
different features as PDF has developed over the decades. Mimetype also covers a
relatively small number of format types, and provides no method to recognise the
formats. The Unix ‘file’ command provides a fast and robust method to identify
many types of digital objects, but it does not provide a persistent unique identifier
for each type that it recognises. The person who implements the recognition
routine for each type is free to print out whatever seems useful and there is no
guarantee that the output format will be persistent.
For me and some other users of the information, however, Pronom has two major
shortcomings. First, I don’t have much need for a handsome web interface to
access the information about a single format at a time! I need to use the
information in an automated way – and for as many formats as possible. Second,
the coverage is much more limited than it looks at first. Most of the registered
formats have only outline descriptions. This means that they provide a name and
identifier, and some useful textual information about the format, but no method for
recognising an instance of a format.
The Droid 5.0 application provides a nice user interface that enables a user to point
to a set of files or a directory, identify the likely file types, and explore them. The
format is placed into a database and the user can run reports, filter, sort, and so on.
One can also export the results as a csv file that can be imported into a spreadsheet
application for further analysis. In addition, there is a command-line version of the
tools to support automated processes.
In order to identify file formats, Droid uses a Signature File. This is an XML file that
contains a substantial subset of the information in Pronom. It contains an element
for each of the file formats known to Pronom.

I am particularly interested in how Droid recognises specific file formats. At the


British Library, we need this to be fairly efficient, accurate, and comprehensive.
The Droid Signature File was the starting point for my exploration. Both it and the
underlying Pronom XML (more on this later) are very clearly documented in
http://www.nationalarchives.gov.uk/aboutapps/fileformat/pdf/automatic_format_iden
tification.pdf. This important paper lays out the signature language, the Droid
algorithms, and more with considerable precision.
The Signature file includes several key pieces of information in addition to the
format name and identifier. First, it includes the typical file extensions for the
format. For example, PDF files typically end with a ‘pdf’ extension. Pronom calls
these ‘external signatures’. Second, it includes patterns that can be used to
recognise a file format. For example, a PDF file starts with %%PDF and ends with %
%EOF. Pronom calls these ‘internal signatures’. Third, it includes some relationships
between formats. For example, PDF is a supertype of PDF 1.1, 1.2, and so on. This
means that any object that is an instance of the PDF 1.1 format is also an instance
of PDF.
When I looked more closely at the Signature file, I had two surprises. First, the
Signature file included patterns to recognise only 208 of the formats – less than a
third. This means that the effective coverage of DROID is much smaller than I had
first expected. Second, I couldn’t make any sense of the patterns! I was expecting
to see something like
%PDF-1.0
Instead, I encountered:
<InternalSignature ID="123" Specificity="Specific">
<ByteSequence Reference="BOFoffset">
<SubSequence MinFragLength="0" Position="1"
SubSeqMaxOffset="0" SubSeqMinOffset="0">
<Sequence>255044462D312E30</Sequence>
<DefaultShift>9</DefaultShift>
<Shift Byte="30">1</Shift>
<Shift Byte="2E">2</Shift>
<Shift Byte="31">3</Shift>
<Shift Byte="2D">4</Shift>
<Shift Byte="46">5</Shift>
<Shift Byte="44">6</Shift>
<Shift Byte="50">7</Shift>
<Shift Byte="25">8</Shift>
</SubSequence>
</ByteSequence>
<ByteSequence>
<SubSequence MinFragLength="0" Position="1" SubSeqMinOffset="0">
<Sequence>2525454F46</Sequence>
<DefaultShift>6</DefaultShift>
<Shift Byte="46">1</Shift>
<Shift Byte="4F">2</Shift>
<Shift Byte="45">3</Shift>
<Shift Byte="25">4</Shift>
</SubSequence>
</ByteSequence>
</InternalSignature>
This was substantially more complicated than I had anticipated and it sent me back
to the definition of the Droid Signature language! It turns out that the internal
signatures in this XML document are not the patterns as held in Pronom. Instead,
they are the result of compiling those patterns into a form that can be used for
efficient pattern matching. You have to go back to Pronom to find the original
pattern.
In this very simple case, the pattern is:
255044462D312E30
Again, this is not quite what I was expecting. I needed to go back to the
documentation again to learn that this is a sequence of bytes coded as pairs of hex
digits.
25 50 44 46 2D 31 2E 30
We can use a table of character encodings to recognise this as:
% P D F - 1 . 0
This is great, but the InternalSignature specification seems like a very complicated
way of saying “look for ‘%PDF-1.0’ at the start of the file”.
After reviewing the Droid signature file, I was convinced that I needed to go back to
the source in Pronom. The problem was how to extract all of the signatures in a
form that I could work with. While it is not obvious how to accomplish this, the
engineers who developed Pronom have made it possible. In the next post, I’ll show
how to get every bit of information out of Pronom in XML format and we’ll take a
closer look at some of the signatures.

Вам также может понравиться