Академический Документы
Профессиональный Документы
Культура Документы
is a Senior Lecturer in
Conceptual data modelling
Bioinformatics in the School of
Biological Sciences, University
of Manchester, UK. His
for bioinformatics
research interests past and Erich Bornberg-Bauer and Norman W. Paton
Date received (in revised form): 5th March 2002
present range from the theory
of biopolymer evolution to
modelling biochemical
pathways, from the analysis of Abstract
modular composition of Current research in the biosciences depends heavily on the effective exploitation of huge
genomes and proteins to the
amounts of data. These are in disparate formats, remotely dispersed, and based on the
integration of biological data.
different vocabularies of various disciplines. Furthermore, data are often stored or distributed
Norman W. Paton
using formats that leave implicit many important features relating to the structure and
is a Professor of Computer
Science at the University of semantics of the data. Conceptual data modelling involves the development of implementation-
Manchester where he co-leads independent models that capture and make explicit the principal structural properties of data.
the Information Management Entities such as a biopolymer or a reaction, and their relations, eg catalyses, can be formalised
Group. His research interests
166 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002
Conceptual data modelling for bioinformatics
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002 167
Bornberg-Bauer and Paton
168 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002
Conceptual data modelling for bioinformatics
implementation data structure from the well-known families are the entity–
conceptual model, such as the creation relationship (ER) models2,3 and the
of a collection of table definitions for object-oriented models.4–6 In our
use in a relational database system. experience the success or failure of a
conceptual modelling activity is not
An important feature of the process generally dependent on the conceptual
illustrated in Figure 1 is that it is iterative modelling notation used. Thus the sorts of
– the construction of the conceptual issues that can appropriately influence the
model may raise issues that need to be selection of a CDM are local experience
clarified by revisiting the tasks to be in the use of different techniques,
supported or the sources of information to availability of appropriate modelling tools,
be described. and likely implementation platform. In
The role of the conceptual data model terms of the latter, ER models are
Conceptual data models in the design process is to allow precise targeted principally at database
allow precise
statements
statements to be made about the data of applications, and the mapping of such
interest in a manner that can be models onto relational database systems is
well understood. However, if an
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002 169
Bornberg-Bauer and Paton
170 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002
Conceptual data modelling for bioinformatics
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002 171
Bornberg-Bauer and Paton
172 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002
Conceptual data modelling for bioinformatics
Figure 5: Example
UML model. Three
concepts (Enzyme,
Protein,
TertiaryStructure) from
three domains
(biochemistry, molecular
biology and structural
biology respectively) are
integrated in one schema
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002 173
Bornberg-Bauer and Paton
174 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002
Conceptual data modelling for bioinformatics
additional properties that are not shared between using the attribute type, rather
by all SecondaryStructureElements. In than through the introduction of
particular, Helix has the attribute type additional subclasses.
which could, for example, be Æ or In terms of modelling practice, UML is
threeten, and each Strand is related to the relaxed in many things. For example, in
Sheet of which it is part. Figure 6 not all relationships are given
The modelling of names or role names – it is hoped that the
SecondaryStructureElements illustrates a use of aggregation will allow the user to
common dilemma in modelling – infer names such as consistsOf and
whether to use the generalisation isPartOf rather than these having to be
hierarchy or an attribute to categorise the given explicitly; classes are given
members of a collection. For example, the attributes, but the types of the attributes
addition of an attribute type on are not specified in the diagram – in
SecondaryStructureElement could be general it is good practice when
seen as removing the need for the modelling at least to identify the attributes
subclasses Loop, Helix and Strand – the associated with different classes, as this is
type attribute could then take on a value important in clarifying exactly what data
that indicates whether a specific each class actually models; and no classes
SecondaryStructureElement is a Loop, a are given operations – the original
Helix or a Strand. A criterion that schema was produced to describe the data
encourages the use of a generalisation and not the way the data are used, so the
hierarchy in this sort of situation is the emphasis was not on the behaviour
presence of attributes or relationships in associated with the classes.
the subclasses that are not applicable to
the superclass (eg the relationship UML model for genome
between Strand and Sheet). As there are sequences
no attributes or relationships specific to The UML model provided in Figure 7,
different kinds of helices in Figure 6, the which is originally from Paton et al.,13 can
different kinds of Helix are distinguished be used to describe sequence information
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002 175
Bornberg-Bauer and Paton
176 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002
Conceptual data modelling for bioinformatics
exists, but it may additionally be useful to are shown shaded, do not contribute to
have a next/previous relationship to the SplicedTranscripts, but several of the
directly associate related regions. SplicedTranscriptComponents can
The class NonTranscribedRegion is potentially end up as part of differently
used to represent features such as constituted SplicedTranscripts. The many-
promoters, centromeres and telomeres, to-many relationship between
which are only distinguished between in SplicedTranscriptComponent and
the description attribute – these could SplicedTranscript represents the fact
generally benefit from more detailed that a SplicedTranscript is commonly
modelling than is provided here. The composed of more than one
DNA sequence associated with each SplicedTranscriptComponent, and that
ChromosomeFragment is modelled by on occasion there may be several
recording the start and end positions of alternative SplicedTranscripts that can
the sequence of each fragment in result from a collection of
ChromosomeFragment. These start and SplicedTranscriptComponents.
end positions can be used to obtain the This model for genome sequences
actual sequence of a illustrates the extent to which the purpose
Figure 8: Illustration of
alternative splicing. See
the section on ‘UML
model for genome
sequences’ for further
details
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002 177
Bornberg-Bauer and Paton
178 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002
Conceptual data modelling for bioinformatics
Figure 9: Mapping of
Genome and
Chromosome to POET
Java classes
keys). By contrast, UML class diagrams diagrams are not targeted at any specific
are only one of a collection of diagrams category of application, mapping of these
that together support the object-oriented diagrams onto implementation platforms
design of complete applications. As class can be less direct or systematic than in the
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002 179
Bornberg-Bauer and Paton
180 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002