Project Data Archiving –

Lessons from a Case Study

March 1998

The University of Reading

Statistical Services Centre
Biometrics Advisory and
Support Service to DFID

Statistical Services Centre staff members Savitri Abeyasekera and Carlos Barahona
involved in the case study work described below, are pleased to acknowledge their
indebtedness to members of the ELUS project team in Malawi, 1995-1997: Joanne
Bosworth, Rural Sociologist, Jasper Steele, Farming Systems Economist, and especially to
Steve Gossage, ELUS Project Team Leader, for their generous help, encouragement and
hospitality during the creation of this data archive. Thanks are also due to Harry Potter,
DFID Natural Resources Adviser in Malawi, for his support to this project.

© 1998 Statistical Services Centre, The University of Reading, UK.

1. Introduction
An integral part of many research projects is the collection of survey or experimental
data at considerable expense, time and effort. Great care may be taken to produce
good quality data at both the data collection and computerisation stages, but there is
usually little emphasis on ensuring that the data are available to other users in a form
that will allow the data to be readily understood and correctly used in subsequent
studies. Ideally the creation of an archive must be integrated with the ongoing work of
a project rather than being an afterthought, on which time may run out when the team
is dispersed.
This brief note is intended to raise awareness of the importance of preserving project
data, to discuss characteristics that make for a good data archive, and to provide an
example of a successful archive.

2. Why preserve project data?
In the past, the generally-available records of project data have often been only in
publications where limited space was available to summarise key features relevant to
the specific slants of the papers concerned. Modern computing and data storage
facilities mean there is now no technical reason why much more detailed data should
not be preserved and readily reproduced, in a form where it can be accessed and used
by others.
As part of the case for support for certain projects, proposers may argue that
quantitative information produced would be of relatively long-term or wide-ranging
interest and certainly producing results of only ephemeral value should not recommend
a project. This implies a duty on the project team to document and archive data
collected in the course of the work.
Given a worthwhile project, such a record is potentially valuable to secondary users
and later workers, if they are given the opportunity to extract information in a form
where it will make their own work more effective. Guaranteeing to add this value
strengthens the case for funding the initial project.

3. What should be archived?
There are three main types of information which need to be accurately recorded:
• the project main data themselves, not just summary tables;
• the record of how and why data were acquired, and what they represent; and
• documentation about computer files which will allow later data retrieval.
To be useful beyond the project lifespan, archives need to be in an organised form, in
almost all cases computerised.
Files should be backed up, with a securely stored master version, and should have a
system set up during the project to make full or partial copies accessible to legitimate
users thereafter.
General principles of data quality control apply at all stages, e.g. during the definition
and development of the data to be collected, data acquisition, and the creation of final
computerised datafiles.

4. What is a good data archive?
Many characteristics determine the production of a good data archive. In brief:
• Accessibility, so that users can reach the stored information via widely-available
• Ease of use, by ensuring that (i) the data archiving structure is simple so that the
relationship between the forms used in the field and the computerised information
is evident; (ii) there are clear definitions of variables stored in the archive (e.g.
units of measurement) and codes used (labels for categorical variates, etc.); and
(iii) there is consistency in names, codes, units of measurement, and abbreviations
throughout the archive.
• Reliability must be ensured with the archive as free of errors as can be managed
within the timescale and budget of the project.
• Documentation viz. (i) procedures used for data collection including sampling
methodology and sampling units used, (ii) the structure of the archive, e.g. how
different files link together, (iii) a list of computer files comprising the archive,
(iv) a full list of all variables including notes on how missing values are treated,
(v) summary statistics that allow the user to cross-check if the information
retrieved corresponds to that required, and (vi) relevant warnings and comments
relating to any part of the database.
• Preservation of anonymity or any conditions of confidentiality with which the data
sets were made available by the sources.
• Completeness as far as that is possible and useful. The archive should include a
computer file copy of (i) the field forms; (ii) the data management log-book;
(iii) descriptions of derived variables, and (iv) special comments and observations.

5. Medium of dissemination
The medium for dissemination of project data has to be considered in planning any
form of archiving. So does the choice of items to be disseminated which clearly may
be selective, e.g. because of confidentiality of some data.
The argument made at the present time is that:
it is easier and cheaper to duplicate floppy disks than to photocopy lots of reports, and
it is easier to re-use numerical information if it is disseminated in computer-readable
In future it should become possible to disseminate data on CDs, including GIS data,
but perhaps also large images with built-in software to let the user view them. At
present this more advanced work looks in many cases too expensive, and too
demanding to expect of the facilities available to project staff or those who will use the
For the moment it seems that items like aerial photographs should be lodged in a place
where they can be preserved safely for a reasonably long time, and which has the
capacity to copy negatives and positives for legitimate users. Of course details of how
to obtain copies should be part of the archive information!

6. Example of a successful data archiving exercise
The Statistical Services Centre (SSC) at the University of Reading was closely
involved with statistical aspects of the Estate Land Utilisation Study (ELUS) in
Malawi – a large nationwide survey carried out over the period from mid-1995 to mid-
1997 and funded by ODA (now DFID). As part of this involvement, a proposal was
made for archiving the large volume of detailed information collected about the socio-
economic structure and utilisation of land within the estate sector of Malawi. The
proposal was supported by the NR Adviser in Malawi and the ELUS project team.
The main survey involved a 125- and a 411-response questionnaire, while three sub-
sequent and more detailed sub-sample studies used longer questionnaires. Two
additional but smaller surveys were also a part of the project.
A member of SSC staff, who had considerable experience of all the computer packages
concerned, visited Malawi for three weeks in February/March 1997 to carry out the
archiving exercise. An additional week was needed after the visit to complete the
archive and its documentation. The time needed depends on the length of the
questionnaires concerned and the quality of the data available to the data archiving
consultant. While the ELUS team had paid close attention to ensuring that their main
datasets were as free as possible from errors, we note on the basis of other experiences
that data-cleaning can be immensely time-consuming.
The archiving exercise involved a one-day workshop to a few identified users of the
ELUS database. The aim was to familiarise the participants with the archive structure
and organisation and to get their views on ways to improve the archiving procedure.
Archiving the ELUS database successfully in the time span described was possible as
the work was approved, funded and completed within the life span of the project, as
SSC staff were familiar with the ELUS data structure, and because of cooperation by
the ELUS team in making all relevant information available on disk at the appropriate

7. What does the archive comprise?
In the case of ELUS, a large A4 ringbinder contains three write-protected floppy disks
of zipped (compressed) files, and software to allow these to be automatically restored
into 15Mb of hard disk space. The data files are all included in duplicate – in two
common formats – as SPSS portable files and dBase IV files, at least one of which
should be accessible to users for many years to come. Word 6 was used for text files,
including the full description of the sampling schemes. On paper, there is a ten-page
introduction and a summary sampling report.
There are then several hundred pages giving details, for each questionnaire used, of
every file and every variable. For each file the description includes the number of
cases, the number of variables, the full list of variables declared, and of variable labels
and value labels. For each variable the description includes the variable name and
label, the minimum and maximum values and the number of valid (non-missing) cases
stored. The volume weighs 1.65 Kg.
Thirty copies were prepared, five including some information rated as confidential for
commercial reasons or to protect the anonymity of respondents. Both types of copy
were appropriately distributed e.g. amongst offices of the Government of Malawi,
academic institutions, and DFID. Legitimate users and authorised researchers should
be able to find a copy of the data in a form where they can for example (a) perform
further analyses; (b) with appropriate access, use the information in the archive as an
extremely detailed sampling frame, through which to revisit sub-samples of the ELUS
estates; (c) integrate ELUS data with their own later findings for longitudinal analysis.

8. Further Information
It is hoped that this document provides some initial ideas on issues of importance in
data archiving. The Statistical Services Centre intends to work towards more detailed
guidelines on archiving procedures to potential researchers initiating projects, to their
appraisers, and perhaps to government agencies in countries where projects may be
done. We would be pleased to hear from researchers with examples or experience of
work similar to that described here, from successful – or frustrated – users of data
archives, or anyone with ideas to share on the issues involved.

