Вы находитесь на странице: 1из 15

Erich Bornberg-Bauer

is a Senior Lecturer in
Conceptual data modelling
Bioinformatics in the School of
Biological Sciences, University
of Manchester, UK. His
for bioinformatics
research interests past and Erich Bornberg-Bauer and Norman W. Paton
Date received (in revised form): 5th March 2002
present range from the theory
of biopolymer evolution to
modelling biochemical
pathways, from the analysis of Abstract
modular composition of Current research in the biosciences depends heavily on the effective exploitation of huge
genomes and proteins to the
amounts of data. These are in disparate formats, remotely dispersed, and based on the
integration of biological data.
different vocabularies of various disciplines. Furthermore, data are often stored or distributed
Norman W. Paton
using formats that leave implicit many important features relating to the structure and
is a Professor of Computer
Science at the University of semantics of the data. Conceptual data modelling involves the development of implementation-
Manchester where he co-leads independent models that capture and make explicit the principal structural properties of data.
the Information Management Entities such as a biopolymer or a reaction, and their relations, eg catalyses, can be formalised
Group. His research interests using a conceptual data model. Conceptual models are implementation-independent and can

Downloaded from bib.oxfordjournals.org by guest on October 21, 2010


have principally been in
databases, in particular active be transformed in systematic ways for implementation using different platforms, eg traditional
databases, spatial databases, database management systems. This paper describes the basics of the most widely used
deductive object-oriented conceptual modelling notations, the ER (entity–relationship) model and the class diagrams of
databases and user interfaces the UML (unified modelling language), and illustrates their use through several examples from
to databases. He is currently
working on parallel object
bioinformatics. In particular, models are presented for protein structures and motifs, and for
databases, spatio-temporal genomic sequences.
databases, distributed
information systems and
information management for
bioinformatics.
INTRODUCTION One important aspect of data
Driven by genome projects and the recent management is coming to a clear
development of other new techniques, understanding of the nature of the
Keywords: conceptual data such as proteomics or ligand screening, available data. What different kinds of
model (CDM), database, the amount of biological data is rapidly data are generated by specific
entity–relationship (ER), increasing (see Baxevanis1 for an overview experimental techniques? How do these
unified modelling language of current databases). However, not only relate to other data produced in the
(UML), biological data
are there ever-increasing quantities of data laboratory or beyond? What information
available in biology, the variety and is derived from the primary data? What
complexity of the different information data need to be stored in perpetuity, and
resources are also tending to increase with what can be summarised and then
time. backed-up or discarded? What additional
Large centralised repositories for data, information is required to validate or
such as SWISS-PROT or Genbank, are analyse different data sets? What quantities
carefully managed, often using modern of information are likely to be produced
data management techniques. However, and will need to be stored as a result of an
the increasing prevalence of experimental experimental activity? Obtaining answers
techniques that generate large quantities to questions such as these in a systematic
of data means that ever more laboratories way is much more straightforward in the
are faced with information management context of clear and comprehensive
Erich Bornberg-Bauer, challenges. Managing large quantities of models of the relevant data. Conceptual
School of Biological Sciences, complex data in a systematic and efficient data models make explicit the structural
University of Manchester,
2.205 Stopford Building,
manner is not straightforward, and ad-hoc properties of data, and as such are useful
Oxford Road, techniques that may have sufficed in the for capturing, refining and
Manchester M13 9PT, UK past will increasingly be a barrier to communicating details about the data in a
Tel: +44 (0) 161 275 7396
effective integration and analysis of laboratory or a database.
E-mail: ebb@bioinf.man.ac.uk experimental results in the future. Once a conceptual model of data has

166 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002
Conceptual data modelling for bioinformatics

been produced, it tends to look like implementation, whether relational,


common sense. However, constructing object-oriented or semi-structured, and
conceptual models is a challenging, often eventually populated, ie ‘filled’, with the
iterative process. This paper seeks to actual data.
increase the profile of conceptual Many notations have been proposed for
modelling techniques in bioinformatics, conceptual modelling, but most have two
and to collect together representative principal notions: entity types and
examples of conceptual models that can relationships.
be used or built upon to support
The principal notions in information management tasks in different • Entity types: an entity type provides a
conceptual modelling parts of bioinformatics. The target description of the properties that are
are: entity type and audience is primarily (computational) shared by a collection of entities in a
relationships biologists and bioinformaticians with domain. For example, Protein could be
minimal experience of data modelling and an entity type, with attributes including
database design, who nevertheless find sequence, name, molecular weight,
themselves involved with the accession number and species. A single
development of biological databases. Thus entity type is expected to have many

Downloaded from bib.oxfordjournals.org by guest on October 21, 2010


we hope to improve the ease with which instances, each of which gives values to
bioinformaticians can communicate with the attributes specified in the
experts in data modelling, and then to corresponding type. For example,
judge independently if conceptual human Æ-haemoglobin and whale
modelling is appropriate for their needs or myoglobin are the names of two
not. instances of the entity type Protein.
The paper is structured as follows. The The values of their attribute species are
next section provides some background human and whale respectively.
material on conceptual models and design
processes, and gives some definitions • Relationships: a relationship
which are valid across most modelling represents an association between two
notations. The section following on (or more) entity types. For example, a
‘Entity–relationship modelling’ gives an Protein could interact with several
overview of ER modelling, and illustrates other Proteins, or could be a member
the approach using a model of sequence of a family. There may be different
patterns. The section ‘Unified modelling categories of relationship, which
language’ gives an overview of UML class characterise the nature of the
diagrams, and illustrates such diagrams relationship. For example, there may be
using models of protein structure and a notation to represent the fact that one
genome sequences. The final section entity type is-part-of another (eg a Beta
discusses the role of conceptual modelling Strand is part of a Sheet in the
in bioinformatics. secondary structure of a Protein) or that
one entity type is-a-kind-of another (eg
CONTEXT an Enzyme is a kind of Protein).
Definitions
A conceptual data model (CDM) provides Once constructed, a conceptual data
a notation by which the structural model of a domain describes the data in
properties of data (the structuring of data the domain, and can be seen to place
and their relationships) from a certain constraints on the attributes and
domain (a field of knowledge such as relationships that are valid within the
biochemistry or structural molecular domain. A conceptual data model is often
biology) can be described in a precise but developed in the context of an application
implementation-independent manner.
The resulting model can then be more or  (For the sake of simplicity we ignore the fact
less directly translated into the actual that RNA can be an enzyme.)

& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002 167
Bornberg-Bauer and Paton

or a collection of tasks that is to be carried language. A design process provides a


out. However, many biological databases sequence of steps through which the
contain data on a particular species or developers proceed when constructing a
resulting from a particular type of conceptual model. In the same way as
experimental or analytical activity, and are there is no universally accepted notation
thus not focused on particular uses of such for conceptual modelling, there is also no
data. In general, though, making clear and universally accepted design process.
appropriate modelling decisions depends Figure 1 outlines a possible design
on the purpose for which the data are process based on one provided in Elmasri
required. For example, in a protein and Navathe.2 This process depicts three
database it may be necessary only to know tasks:
the name of each species of each Protein.
There is no universally If so, the name of the species can be an • requirements analysis – the
accepted standard for attribute of Protein. However, if it is identification of the needs of the
either the design
process or the notation
necessary to know more about the species application and the sources of
(eg its taxonomic name as well as its information that the modelling activity
common name, the areas in the world seeks to support;

Downloaded from bib.oxfordjournals.org by guest on October 21, 2010


where it resides, etc.), then species is
better modelled as an entity type, with • conceptual design – the use of a
attributes and relationships of its own. conceptual data modelling notation to
describe the data identified in
The design process requirements analysis; and
A design activity involves a combination
of a design process and a modelling • logical design – the development of an

Figure 1: Example of a conceptual


modelling process. Especially at the level of a
conceptual model many inconsistencies and
neccessary amendments may become
obvious such that a revision at this stage may
be necessary and useful to avoid more
demanding revisions at a later stage. See text
for more details

168 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002
Conceptual data modelling for bioinformatics

implementation data structure from the well-known families are the entity–
conceptual model, such as the creation relationship (ER) models2,3 and the
of a collection of table definitions for object-oriented models.4–6 In our
use in a relational database system. experience the success or failure of a
conceptual modelling activity is not
An important feature of the process generally dependent on the conceptual
illustrated in Figure 1 is that it is iterative modelling notation used. Thus the sorts of
– the construction of the conceptual issues that can appropriately influence the
model may raise issues that need to be selection of a CDM are local experience
clarified by revisiting the tasks to be in the use of different techniques,
supported or the sources of information to availability of appropriate modelling tools,
be described. and likely implementation platform. In
The role of the conceptual data model terms of the latter, ER models are
Conceptual data models in the design process is to allow precise targeted principally at database
allow precise
statements
statements to be made about the data of applications, and the mapping of such
interest in a manner that can be models onto relational database systems is
communicated to others. The well understood. However, if an

Downloaded from bib.oxfordjournals.org by guest on October 21, 2010


comprehensibility of a conceptual model implementation is likely to end up using
is important, as it is used both in object-oriented programming,
discussions with subject experts whose middleware or database techniques,
understanding of the relevant data is to be working from an object modelling
described, and by the developers of language is likely to be most appropriate.
software that makes use of the data. A Details on how to develop
general remark on conceptual models is implementations from object models over
that they are usually much easier to read several different platforms are provided in
than to construct. Blaha and Premerlani.7
In our experience, conceptual data
modelling is often best conducted as a ENTITY–RELATIONSHIP
collaborative process, in which models are MODELLING
constructed incrementally, for example, An ER schema consists of entity types,
on a whiteboard, so that different people relationships and attributes. As described
can provide input on the features of the above in the section on ‘Definitions’, an
data being modelled. In general, we have entity type provides a description of the
been involved in modelling activities in properties that are shared by a collection
bioinformatics in which computer of instances in a domain. An entity type is
scientists and biologists work together on drawn as a rectangle that encloses its
models. This sort of joint approach is name. For example, in Figure 2, DNA,
probably the most effective in practice, as Protein, Enzyme, Reaction and
experienced modellers should be able Biopolymer are all entity types. The
both to ask pertinent questions that guide instances of an entity type are known as
the development of a model and avoid entities. The attributes of an entity type
modelling errors. Developing models that indicate what values can be stored to
describe all the data that are valid and identify or describe an instance of the
relevant, while prohibiting the inclusion type. The attributes of an entity type are
of data that are invalid or do not occur in depicted in ovals directly connected to
practice is generally both a challenging the entity type. For example, the entity
and rewarding process. type Biopolymer in Figure 2 has attributes
accno (for accession number), name,
Selection of conceptual data species and sequence.
models One or more of the attributes of an
There are many different conceptual data entity type may be designated as the key,
modelling notations, although the most which is depicted by the name(s) of the

& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002 169
Bornberg-Bauer and Paton

Figure 2: ER notation for some biological


concepts. Reaction is related to Enzyme
through a many-to-many relationship (see the
section on ‘Entity–relationship modelling’ for
more details). Each enzyme is a protein,
which is depicted by the IsA relationship
between Enzyme and Protein. Both DNA
and Protein are kinds of Biopolymer as is
depicted by the IsA relationship. This
particular schema is not part of any published
model, but has been designed for illustration

Downloaded from bib.oxfordjournals.org by guest on October 21, 2010


purposes

attribute(s) being underlined in the leading to more concise models. For


diagram. For example, accno is the key of example, in Figure 2, the attributes
Biopolymer in Figure 2. This means that associated with Biopolymer are all
there can be at most a single instance of inherited by Protein and DNA through
Biopolymer with a given accno. If no the IsA relationship. Secondly, the is-a-
single attribute of an entity type can be kind-of relationship makes explicit
used as the key, then it is possible that relationships between the collections of
several attributes can be used together to instances of different entity types. For
uniquely identify the instances of the example, from Figure 2 one can deduce
type, in which case that group of that every instance of Enzyme is also an
attributes can be underlined in the schema instance of Protein, and that every
diagram. A key with several components instance of Protein is also an instance of
is known as a compound key. Biopolymer.
It is common for several entity types Any relationship other than the is-a-
within a schema to share attributes and kind-of relationship between two types is
relationships. For example, both Protein depicted by a rhombus that encloses the
and DNA have a species. Such sharing of name of the relationship, and which is
attributes can be represented using is-a- linked to the related entity types.
kind-of relationships between entity types, Although some ER models allow a single
which are depicted in Figure 2 by arrows relationship to be between more than two
from the more specialised type to the entity types, it is often considered good
more general type through a circle practice to use only binary relationships.
containing IsA. For example, both DNA The catalysis relationship in Figure 2 is
and Protein are kinds of Biopolymer in an example of a binary relationship
Figure 2, and Enzyme is a kind of Protein. between the entity types Enzyme and
Biopolymer is said to be the supertype of Reaction that indicates which reactions
both DNA and Protein, and both DNA are catalysed by which enzymes. A single
and Protein are subtypes of Biopolymer. Enzyme (eg peroxidase) may catalyse
Such relationships have two principal many reactions (as, for example,
roles. Firstly, the properties of a supertype peroxidase acts on several substrates), and
are inherited by its subtypes, thereby a Reaction (eg peroxide degradation) may

170 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002
Conceptual data modelling for bioinformatics

be catalysed by many different enzymes ER model for fingerprints


(eg peroxidase and catalse). Therefore, Conceptualising the relationship between
this relationship is a many-to-many sequences, their similarity and function is
relationship. This is depicted by the M essential to exploiting the predictive
and the N in the figure, which essentially power of comparative functional
denote arbitrary numbers of participants inference. PRINTS is a database of
in the relationship. The cardinality fingerprints, grouped sequence patterns that
specified for a relationship can be left are characteristic of specific protein
open ended, using M or N, or can be families.8 Its major advantage lies in the
specified to be a particular value. Where a possibility of selecting levels of groupings
specific value of 1 is used, this gives rise to for family definitions by choosing
one-to-one or one-to-many relationships, different collections of motifs from a
the latter in particular being very fingerprint. This is important because the
common (eg a Genome has many definition of a protein family may vary
Chromosomes, but a Chromosome is with the level of stringency one decides to
specific to a single Genome). choose. Patterns are derived from SWISS-
Relationships can themselves have PROT and TrEMBL.9 PRINTS is a

Downloaded from bib.oxfordjournals.org by guest on October 21, 2010


attributes. It is appropriate for a typical example of a database in
relationship to have an attribute if the bioinformatics that was first implemented
value to be recorded describes the using ASCII files, and then migrated to a
relationship, and is not an attribute of relational platform for performance and
either of the participating entity types. For functionality reasons. Part of the
example, the Michaelis constant would be migration process involved the
specific for the pair of the reactant and the development of a conceptual model for
enzyme involved. A relationship may the data. PRINTS currently contains
relate an entity type to itself. For example, 1,210 entries and 7,200 motifs.8
an interaction relationship could be used The schema consists of three basic
to indicate which Proteins are known to entity types, namely Sequence,
interact with each other. Fingerprint and Motif, as shown in Figure
Given the constructs described above, 3. Each Sequence may contain any
there are likely to be many plausible number of Motifs, each of which must
conceptual models for describing a data appear in one or more sequences (many-
set. Such variety can derive from to-many relationship). Each Fingerprint is
differences in the purpose to which the a collection of Motifs, but each Motif
data is to be put, or the stylistic must appear in one and only one
preferences of the modeller. As a case in Fingerprint. This is because within
point, relationships can be more or less PRINTS a Motif is defined as a pattern
precisely modelled. For example, in that appears in a specific Fingerprint.
Figure 2, the catalysis relationship However, a mobile protein fragment
between the entity types Enzyme and could cause exactly the same pattern to
Reaction models the role of a specific appear in another Fingerprint, which is
kind of Biopolymer in a Reaction. handled in this context by the definition
However, the fact that an Enzyme is of more than one Motif associated with
related in some way to a Reaction could the same pattern. The relationship
have been represented by a much more between a Fingerprint and a particular
general relationship, with a name such as Sequence represents a functional
participatesIn between Biopolymer and characterisation of the protein.
Reaction. Which approach is most Depending on the number of Motifs in a
suitable depends on the nature of the data Fingerprint which match a particular
to be modelled and the purpose for sequence, this assignment is considered to
which the resulting database is being be more or less reliable. If all motifs occur
developed. in the right order, the relationship is

& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002 171
Bornberg-Bauer and Paton

Figure 3: ER model for the PRINTS-S


8
database

considered to be a ‘true positive’. If fewer constructs of the ER model into those of


motifs are matched in the sequence the the relational model.
possible values are ‘true partial’ or ‘false Many constructs of the ER model can
partial’. This is denoted by an attribute be mapped quite directly onto relational

Downloaded from bib.oxfordjournals.org by guest on October 21, 2010


(accuracy) which is associated with the tables for implementation. For example,
Assignment relationship, which can take each entity type is represented by a table
any of these three values. Thus accuracy in the relational model. Thus in the
is appropriately an attribute of the relational implementation of PRINTS,
relationship, as the accuracy is a there are Fingerprint, Sequence and Motif
characteristic of the relationship between tables, as shown in Figure 4.
a Sequence and a Fingerprint, and not a The attributes of an entity type are
characteristic of either of these entity mapped to attributes of the corresponding
types themselves. table, and the key of an entity type
generally becomes the key of its
Implementing from ER models corresponding table.
Although in principle the ER model is The way a relationship in the ER
implementation platform-independent, it model is mapped onto tables depends
is most commonly used in conjunction principally on the cardinality of the
with relational databases, and design relationship. For example, a one-to-many
environments supporting ER often relationship is represented by storing the
include facilities for generating table key of the table at the 1 end of the
definitions from an ER model. As the relationship as an attribute of the table at
relational model provides different the many end (a foreign key). For example,
modelling facilities from the ER model, in Figure 3, the FingerprintAccession
implementing an ER model using a attribute of Motif is a foreign key of
relational database involves mapping the the Fingerprint table. By contrast, a

Figure 4: Examples of tables generated


from the ER model of Figure 3. See section
on ‘Implementing from ER models’ for
further details

172 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002
Conceptual data modelling for bioinformatics

many-to-many relationship is represented depicted within the rectangle that


by a table that has a compound key represents the class. For example, Figure 5
consisting of the keys of the related tables, includes the class Protein, which has
as in the case of the table Assignment. attributes name and accessionNumber,
both of which are of type String. The class
UNIFIED MODELLING Protein also has an operation display(),
LANGUAGE (UML) which can be expected to print or draw
UML is a standard UML5 is the industry standard object objects that are instances of Protein. By
object modelling
modelling language. In this paper the convention, the names of classes start with
language
focus is on class diagrams, which are used capital letters, and the names of attributes
to model the structural aspects of data start with lower case letters. Where a
within UML. However, UML contains name of a class or attribute is constructed
many different modelling notations, from several words, the later words start
which also allow the behaviour of an with capital letters, as in
application to be described in different accessionNumber. Unlike ER models,
ways, so UML can be seen as providing UML class diagrams do not support keys
comprehensive application modelling – this is probably because typical

Downloaded from bib.oxfordjournals.org by guest on October 21, 2010


facilities. For example, case diagrams can implementation platforms for UML
be used during requirements analysis to classes do not support keys, whereas ER
identify the principal categories of users models are typically mapped to relational
making use of a system and the activities databases for implementation, where keys
carried out by those user groups. have a prominent role.
As an object-oriented modelling UML supports the description of
language, the central notation in UML is several different kinds of relationships
the class diagram. A class is a description of between classes. An is-a-kind-of
the attributes, operations and relationships relationship is depicted by a closed headed
of a set of objects. As such, a class in UML arrow, eg as from Enzyme to Protein in
is analogous to an entity type in ER Figure 5. Protein is said to be a superclass
modelling. The instances of a class are of Enzyme, and Enzyme a subclass of
referred to as objects, and are analogous to Protein. All the attributes, relationships
entities in ER modelling. and operations of a superclass are
A class is depicted as a rectangle, within inherited by its subclasses. A class may
which is stated the name of the class. have zero, one or more superclasses or
Optionally, the names and types of subclasses.
attributes and operations can also be Relationships (other than the

Figure 5: Example
UML model. Three
concepts (Enzyme,
Protein,
TertiaryStructure) from
three domains
(biochemistry, molecular
biology and structural
biology respectively) are
integrated in one schema

& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002 173
Bornberg-Bauer and Paton

generalisation relationship) between UML model for protein


classes are known as associations in UML. structure
For example, the fact that a protein may This section describes a UML model for
be associated with several experimentally protein structure data, which is a
determined tertiary structures can be simplification of the object-oriented data
represented by the association illustrated model provided in Gray et al.12 The
in Figure 5. The association original model was implemented directly
hasDeterminedStructure links a Protein using an object database system. The
to information about its experimentally UML diagram is given in Figure 6. The
determined TertiaryStructures. It is model includes primary, secondary and
possible to name associations in different tertiary structure information. The
ways. For example, when a class topmost class in Figure 6 is Protein, of
participates in a relationship it plays some which all other classes are either directly
role in the relationship; roles are the most or indirectly components. All the
common way of naming relationships in relationships between classes in Figure 6
UML. In Figure 5, the class are either aggregation or generalisation
TertiaryStructure is fulfilling the role of relationships.

Downloaded from bib.oxfordjournals.org by guest on October 21, 2010


the TertiaryStructure of the associated A Protein has several attributes, which
Protein. Furthermore, from the indicate its name, the code by which it is
perspective of the TertiaryStructure, the identified in PDB, and its
class Protein is playing the role of a molecularWeight. Each Protein consists
modelledProtein. of one or more Chains. A Chain can be
In analogy to the ER model, it is also seen as consisting of a collection of
possible to annotate a relationship with Residues or as a collection of
some information on its cardinality. For SecondaryStructureElements.
example, in Figure 5, a TertiaryStructure The class Residue provides both
is the structure of a single protein (the primary structure (the name of the amino
default cardinality, and thus not shown acid at a particular position within the
explicitly), and a Protein may have several Chain) and tertiary structure information
experimentally determined (the coordinates of the residue within the
TertiaryStructures (as depicted by the 0 , model). The x, y and z coordinates of
where  represents ‘many’). each atom associated with the Residue are
As well as annotating an association modelled using the class Coordinates,
with its cardinality, a further structure which associates the position of each atom
modelling feature of UML that is used in with the name of the atom (eg ca could be
the subsequent examples is aggregation, used to refer to the Æ-carbon).
which allows the representation of the part The class SecondaryStructureElement
of relationship. An aggregation is depicted is an abstract class – this is depicted in the
by e at the end of the relationship that diagram by the fact that its name is in
represents the whole in the part–whole Courier. An abstract class is one for
relationship. Examples of aggregations are which no direct instance objects are ever
given in Figure 6, for example to indicate created, but which can play a useful
that a Chain is part of a Protein and that a organisational role in the diagram. In this
SecondaryStructureElement is part of a case, SecondaryStructureElement is the
Chain. superclass of Loop, Helix and Strand, all
In this and subsequent sections, UML of which can have direct instances.
diagrams have been drawn using the Two of the subclasses of
UML editor ARGO, which can be SecondaryStructureElement have
downloaded.10,11 It should be noted that
UML class diagrams include more  (This should really be depicted in italics, but
modelling facilities than are described ARGO generates a Courier font in its Post-
here; for further details, see Booch et al.5 script generator.)

174 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002
Conceptual data modelling for bioinformatics

Figure 6: UML model

Downloaded from bib.oxfordjournals.org by guest on October 21, 2010


for protein structure.
See the section on ‘UML
model for protein
structure’ for further
details

additional properties that are not shared between using the attribute type, rather
by all SecondaryStructureElements. In than through the introduction of
particular, Helix has the attribute type additional subclasses.
which could, for example, be Æ or In terms of modelling practice, UML is
threeten, and each Strand is related to the relaxed in many things. For example, in
Sheet of which it is part. Figure 6 not all relationships are given
The modelling of names or role names – it is hoped that the
SecondaryStructureElements illustrates a use of aggregation will allow the user to
common dilemma in modelling – infer names such as consistsOf and
whether to use the generalisation isPartOf rather than these having to be
hierarchy or an attribute to categorise the given explicitly; classes are given
members of a collection. For example, the attributes, but the types of the attributes
addition of an attribute type on are not specified in the diagram – in
SecondaryStructureElement could be general it is good practice when
seen as removing the need for the modelling at least to identify the attributes
subclasses Loop, Helix and Strand – the associated with different classes, as this is
type attribute could then take on a value important in clarifying exactly what data
that indicates whether a specific each class actually models; and no classes
SecondaryStructureElement is a Loop, a are given operations – the original
Helix or a Strand. A criterion that schema was produced to describe the data
encourages the use of a generalisation and not the way the data are used, so the
hierarchy in this sort of situation is the emphasis was not on the behaviour
presence of attributes or relationships in associated with the classes.
the subclasses that are not applicable to
the superclass (eg the relationship UML model for genome
between Strand and Sheet). As there are sequences
no attributes or relationships specific to The UML model provided in Figure 7,
different kinds of helices in Figure 6, the which is originally from Paton et al.,13 can
different kinds of Helix are distinguished be used to describe sequence information

& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002 175
Bornberg-Bauer and Paton

Downloaded from bib.oxfordjournals.org by guest on October 21, 2010


Figure 7: UML model for eukaryote
genome sequence. See the section on ‘UML
model for genome sequences’ for further
details

for fully sequenced eukaryotic genomes. NonTranscribedRegion within the


The model has been used as the basis for chromosome. The next/previous
an implementation using an object relationship on ChromosomeFragment is
database, and has been populated with an example of a recursive relationship,
sequence data from Saccharomyces cerevisiae. which is used in this context to provide
A Genome consists of one or more an ordering to the
Chromosomes, each of which has a ChromosomeFragments. In fact, UML
number and a sequence. These allows a constraint to be specified for a
Chromosomes in turn consist of a relationship, to the effect that the
collection of potentially overlapping elements at one end of the association are
ChromosomeFragments, each of which {ordered}. This can be written at the end
represents a TranscribedRegion or a of the relationship where the ordering

176 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002
Conceptual data modelling for bioinformatics

exists, but it may additionally be useful to are shown shaded, do not contribute to
have a next/previous relationship to the SplicedTranscripts, but several of the
directly associate related regions. SplicedTranscriptComponents can
The class NonTranscribedRegion is potentially end up as part of differently
used to represent features such as constituted SplicedTranscripts. The many-
promoters, centromeres and telomeres, to-many relationship between
which are only distinguished between in SplicedTranscriptComponent and
the description attribute – these could SplicedTranscript represents the fact
generally benefit from more detailed that a SplicedTranscript is commonly
modelling than is provided here. The composed of more than one
DNA sequence associated with each SplicedTranscriptComponent, and that
ChromosomeFragment is modelled by on occasion there may be several
recording the start and end positions of alternative SplicedTranscripts that can
the sequence of each fragment in result from a collection of
ChromosomeFragment. These start and SplicedTranscriptComponents.
end positions can be used to obtain the This model for genome sequences
actual sequence of a illustrates the extent to which the purpose

Downloaded from bib.oxfordjournals.org by guest on October 21, 2010


ChromosomeFragment by looking up of the model influences what is to be
the relevant range in the sequence modelled. The model is explicitly targeted
attribute of Chromosome. at fully sequenced genomes, and there is
The modelling of transcribed regions is no attempt to describe the experimental
somewhat involved, in that it is necessary data generated during the sequencing. An
to be able to capture alternative splicing.13 example of a conceptual model for
Figure 8 illustrates the relationship genome mapping is given in Hu et al.14
between several of the classes in the
model. The top part of the figure Implementing from UML
represents the PrimaryTranscripts models
associated with a single In principle, UML models are
TranscribedRegion, and the bottom part independent of the implementation
of the figure illustrates two platform to be used. In practice, mapping
SplicedTranscripts that have resulted from UML models, including class diagrams,
alternative splicing. The Introns, which onto object-oriented implementation

Figure 8: Illustration of
alternative splicing. See
the section on ‘UML
model for genome
sequences’ for further
details

& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002 177
Bornberg-Bauer and Paton

platforms is more straightforward and following the industry standard for


intuitive than mapping onto non-object- object databases.16
based platforms. This is not to say, A further immediate change in
however, that it cannot be done. In the perspective tends to take place when
same way as the section on ‘Implementing mapping from UML class diagrams onto
from ER models’ provided some rules to the constructs of object-oriented
guide the mapping of ER models to the programming languages. Although UML
relational model, Blaha and Premerlani7 class diagrams place quite a substantial
provide a comprehensive description of emphasis on structural aspects of the data
how to map class diagrams onto relational (such as attributes and relationships), it is
tables. This process is along the same lines generally considered good practice to
as the process for ER models, but the make the structural properties of a
absence of keys in UML models and the program class private, and to provide
tendency for inheritance to be used more access to such properties only through
widely in object models, often makes the methods (see, for example, the get and set
mapping process more involved. methods in Figure 9). Languages such as
Here, the process of implementing Java have well-defined conventions for

Downloaded from bib.oxfordjournals.org by guest on October 21, 2010


from a UML model is illustrated using the defining and naming methods used for
genome schema from Figure 7, which has accessing structural properties.
been implemented using the object
database POET.15 DISCUSSION
While POET provides more than one Conceptual data models are a proven
way of implementing database classes, one technology, in that they have been in
makes use of Java class definitions, which widespread use in numerous software
are preprocessed by the POET system. In projects for many years. However, they
essence, POET then allows objects that have not been used particularly widely
are instances of the preprocessed classes to with biological and biochemical data,
be stored in the database. A schema despite the increasing number and
fragment in POET Java is illustrated in complexity of information resources in
Figure 9. these areas. This paper seeks to increase
A class from the UML diagram maps awareness of the role that conceptual
into a class in Java. Each attribute from models can play in describing and
the UML class maps to an attribute in understanding the structure of biological
Java. Relationships in UML map onto data, and has provided example models
attributes in one (or both) of the related using the two most widely used
classes. A complication here is that conceptual modelling notations.
neither Java nor POET supports
While ER was bidirectional relationships automatically, UML v. ER and other
developed earlier on so if the implementer wants both roles implementational issues
and for relational of a relationship to be represented The conceptual modelling notations
databases, UML directly in the Java classes, some effort is illustrated in this paper, namely the ER
supports object-
required to keep these roles consistent model and UML class diagrams, were
oriented design better
with each other. For example, in Figure developed at different times for somewhat
7, if the chromosomes attribute of different purposes. The ER model was
Genome is to be consistent with the developed to support the design of
fromGenome attribute of Chromosome, database schemas, in particular schemas
application programs or class methods for relational databases. As such, ER
must be programmed to support such models have reasonably direct and
consistency. Figure 9 also illustrates the systematic mappings onto relational
use of the POET collection class databases for implementation,2 and
SetOfObject, which is one of several support modelling notions that are
collection classes provided with POET, familiar from the relational model (eg

178 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002
Conceptual data modelling for bioinformatics

Downloaded from bib.oxfordjournals.org by guest on October 21, 2010

Figure 9: Mapping of
Genome and
Chromosome to POET
Java classes

keys). By contrast, UML class diagrams diagrams are not targeted at any specific
are only one of a collection of diagrams category of application, mapping of these
that together support the object-oriented diagrams onto implementation platforms
design of complete applications. As class can be less direct or systematic than in the

& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002 179
Bornberg-Bauer and Paton

narrower context within which ER is References


used, but it is often straightforward to 1. Baxevanis, A. D. (2000), ‘The molecular
map class diagrams onto object-oriented biology database collection: an online
implementation platforms. compilation of relevant database resources’,
Nucl. Acids Res., Vol. 28, pp. 1–7.
While there is now a standard definition
of UML, there are many ER proposals, 2. Elmasri, R. and Navathe, S, (2000),
‘Fundamentals of Database Systems’, 3rd edn,
which themselves differ significantly, eg in Addison-Wesley, Reading, MA.
terms of the ways that inheritance is 3. Chen, P. (1976), ‘The entity relationship
supported. In this paper we have thus model: Toward a unified view of data’, ACM
taken the view that it is appropriate to Trans. Database Systems, Vol. 1, pp. 9–36.
choose and describe a specific ER 4. Booch, G. (1991), ‘Object-oriented Design
proposal, but not to elaborate on the ways with Applications’, Benjamin/Cummings,
Redwood City, CA.
this specific proposal differs from UML.
We have also deliberately avoided detailed 5. Booch, G., Rumbaugh, J. and Jacobson, I.,
Eds (1999), ‘The Unified Modelling Language
discussion of how behaviour is modelled User Guide’, Addison-Wesley, Reading, MA.
in UML, as this is a large area for which
6. Rumbaugh, J., Blaha, M., Premerlani, W. et
there is no room in the paper.5 al., Eds (1991), ‘Object-oriented Design and

Downloaded from bib.oxfordjournals.org by guest on October 21, 2010


Modelling’, Prentice-Hall, Englewood Cliffs,
NJ.
Outlook
Developing conceptual models is not 7. Blaha, M. and Premerlani, W. Eds (1998),
‘Object-oriented Modelling and Design for
straightforward, which in turn reflects the Database Applications’, Prentice-Hall,
fact that obtaining a clear understanding Englewood Cliffs, NJ.
of the semantics of a piece of data is not 8. Attwood, T., Croning, M., Flower, D. et al.
always easy. Obtaining a clear and agreed (2000), ‘Prints-s: The database formerly
view of the data in a domain is known as prints’, Nucleic Acids Res., Vol. 28,
pp. 225–227.
challenging, as different people will see
9. Bairoch, A. and Appweiler. R. (2000), ‘The
things in different ways, use terminology SWISS-PROT protein sequence database and
differently and emphasise different its supplement TrEMBL in 2000’, Nucleic Acids
features. The situation is further Res., Vol. 28, pp. 45–48.
complicated since biosciences are non- 10. URL: http://argouml.tigris.org
axiomatic and the views on the same or 11. Robbins, J., Hilbert, D. and Redmiles, D.
similar concepts vary strongly among (1997), ‘ARGO: A design environment for
different, although closely related, evolving software architectures’, in ‘Proc.
ICSE’, ACM Press, pp. 600–601.
communities. However, conceptual
models can be helpful in developing, 12. Gray, P., Paton, N., Kemp, G. and Fothergill,
J. (1990), ‘An object-oriented database for
making explicit and communicating clear protein structure analysis’, Protein Eng., Vol.
and detailed descriptions of data that is 4(3), pp. 235–243.
available or that is about to be produced. 13. Paton, N., Khan, S., Hayes, A. et al. (2000),
It is hoped that this paper can extend ‘Conceptual modelling of genomic
the use of conceptual models within information’, Bioinformatics, Vol. 16(6), pp.
548–557.
bioinformatics, and thus ease the currently
14. Hu, J., Mungall, C., Nicholson, D. and
growing problems with managing and Archibald, A. (1998), ‘Design and
sharing biological data. implementation of a corba-based genome
mapping system prototype’, Bioinformatics, Vol.
14(2), pp. 112–120.
Acknowledgements
We are pleased to acknowledge the support of the 15. URL: http://www.poet.com
BBSRC/EPSRC Bioinformatics Initiative in 16. Cattell, R. and Barry, D. (2000), ‘The Object
funding work on genome information Database Standard: ODMG 3.0’, Morgan
management at Manchester. Kaufmann, San Diego, CA.

180 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 3. NO 2. 166–180. JUNE 2002