Вы находитесь на странице: 1из 13

Information & Management 49 (2012) 151–163

Contents lists available at SciVerse ScienceDirect

Information & Management


journal homepage: www.elsevier.com/locate/im

Data modeling: Description or design?


Graeme Simsion, Simon K. Milton *, Graeme Shanks
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia

A R T I C L E I N F O A B S T R A C T

Article history: Data modeling for database creation has generally been considered to be a descriptive process: the real-
Received 13 November 2010 world is observed and represented in a conceptual model that is then transformed into a logical structure
Received in revised form 15 November 2011 for a database. This is reflected in prescriptive methods and is the dominant assumption in most studies.
Accepted 25 January 2012
However, data modeling can also be considered a type of design with negotiable requirements, a creative
Available online 14 February 2012
process, and many workable solutions. Our paper discusses empirical results from almost 500
practitioners on three continents comparing data modeling to design. We found that data modeling, as
Keywords:
practiced, was better characterized as design.
Conceptual data modeling
Practitioner study
ß 2012 Elsevier B.V. All rights reserved.
Design
Analysis

1. Introduction The descriptive characterization is embodied in most empirical


studies through the use of a gold standard – a single correct solution
1.1. Alternative views of data modeling devised by the researcher, who often embedded entity and
relationship names in descriptions, thus constraining the modeling
Data modeling is one of the most critical activities in the abstractions. The ‘‘business requirements’’ amounted to a plain
implementation of an IS: it has been characterized as a process of language description and the participant’s task was to translate the
reality mapping. This characterization has been occasionally description to the original diagram. For example, ‘‘An employee
challenged from a philosophical perspective, from observations can report to only one department. Each department has a phone
of practice, and from empirical evidence. number.’’ Tasks showing these two traits mainly tested facility
This descriptive characterization also dominates the practi- with modeling formalisms. Yet it is common to see conclusions
tioner literature. In a descriptive activity, a set of artifacts may be that indicated that novice designers did not run into much trouble
created, and this might well be called design, but not be of in modeling entities and attributes. In the context of the research
sufficient importance to the overall result as to characterize the task, modeling entities may have meant little more than
entire activity as design. In data modeling, there is choice in the identifying nouns in the description. The use of simple models
selection of components (typically entities, relationships and and prescriptive instructions limited the scope of the design.
attributes) used to represent some part of reality. The difference Most empirical studies have used students as participants; of
between description and design is in whether this selection is a course, this limited the difficulty of the problems posed. Of the
trivial part of the process compared to understanding the Universe total of 3210 participants across 59 studies that we surveyed, only
of Discourse (UoD) {descriptive type}, or whether it is the essence 147 in nine studies had more than one year’s industry experience
of the process {design type}. of data modeling. Thus most studies used unrealistically simple
data models. Nevertheless, some studies have used experienced
1.2. Previous empirical research data modelers and have uncovered design behavior. Comparisons
of novice and expert data modelers have revealed behaviors
We know little about how experienced data modelers approach characteristic of designers (attempting to gain a holistic under-
their work or about the models that are produced for real business standing, categorization of problems, pattern re-use) in the experts
applications. This is because most studies have assumed the but not in the novices.
process to be descriptive and have not involved practitioners.
1.3. The research question

Our research question was: Is data modeling better characterized


* Corresponding author. Fax: +61 3 9349 4596. as description or design? Here, data model refers to a model of a
E-mail address: simon.milton@unimelb.edu.au (S.K. Milton). specific UoD (e.g. the data model of ABC corporation’s human

0378-7206/$ – see front matter ß 2012 Elsevier B.V. All rights reserved.
doi:10.1016/j.im.2012.01.003
152 G. Simsion et al. / Information & Management 49 (2012) 151–163

resources operations), while data modeling refers to the set of Table 1


Lawson’s properties of design.
activities required to specify a conceptual schema that will
transform into a database schema but prior to its transformation Design problems
1. Design problems cannot be comprehensively stated
into a specific DBMS data definition.
2. Design problems require subjective interpretation
Specifically, we examined the process of data modeling that 3. Design problems tend to be organized hierarchically
resulted in a database design that could be implemented in a The design process
relational DBMS. We did not include other purposes of conceptual 1. The process is endless
data modeling (e.g., its use in IS planning). 2. There is no infallibly correct process
3. The process involves finding as well as solving problems (including
This is an important research question for at least three reasons: creativity)
4. Design inevitably involves subjective value judgments
 Most data modeling research assumes the descriptive charac- 5. Design is a prescriptive activity
terization, notably in the design of experiments and in the 6. Designers work in the context of a need for action
Design solutions
application of ontology [2,3]. If data modeling is, in fact, design,
1. There are [sic] an inexhaustible number of different solutions
research results need to be reinterpreted in that light. 2. There are no optimal solutions to design problems
 Data modeling education should include expert level practice. 3. Design solutions are often holistic responses
 Creative thinking and evaluation of alternative designs are 4. Design solutions are a contribution to knowledge
intrinsic to design processes. If data modeling is seen as a design 5. Design solutions are parts of other design problems

activity rather than description, then data modeling methods


should be updated to reflect a design process and explicitly
include creative thinking and comparative evaluation. The RSQs were addressed with a combination of surveys,
laboratory studies, and interviews. The selection of the mode for
2. The ideal type of ‘design’ each is shown in Table 3. Semi-structured interviews (with
influencers of practitioners or thought leaders), surveys (to collect
Design is the ideal type against which we measure data modeling. the perceptions of experienced data modelers about data modeling
Its essence has been synthesized in the form of a list of some of the products, processes, and problems), and laboratory studies
important characteristics of design problems and solutions, and the (designed to explore diversity and style in data models by asking
design process itself. These characteristics are intended to typify participants to complete modeling tasks which were examined for
design and thus to differentiate it from description. evidence of diversity, style, and patterns in data modeling) were
The list is not exhaustive, and the characteristics are chosen to provide multiple sources of data to assess the practice of
interrelated. Collectively, they provide an overall picture of design. data model against the ideal type of design.
The characteristics (shown in Table 1) were grouped into fourteen The surveys and laboratory studies were incorporated into 12
properties within three dimensions – Problems, Solutions (i.e., data modeling seminars and workshops for experienced practi-
Products), and Process, following Lawson’s [1] properties of design tioners delivered by the first author in the US, UK, Scandinavia and
Australia between May 2002 and November 2004 (see Appendix A
3. Research design for summary of participants). Figs. 1–4 summarize the responses to
demographic questions.
The framework provides a basis for expanding the research There was a strong correlation between the two experience
question to 11 research sub-questions (RSQs) (see Table 2). Scope measures (g = 0.65, p < 0.0005). Our study was the largest
seeks to clarify what practitioners mean by data modeling. The currently published; it included 381 participants with at least
remaining three dimensions determine whether data modeling one year of data modeling experience. The minimum number to
practice has the properties of the design type: Problem deals with complete any task was 55. Three other groups participated in the
the negotiability of data modeling requirements, Process examines research. The Practitioner thought-leaders and expert model
whether data modeling is creative, and Product deals with the evaluators were purposive samples. Architects and accountants
diversity in data models produced by experienced practitioners in (who provided a benchmark for the Characteristics of Data
response to a task; this suggests that data modeling is a design Modeling component) were recruited from personal and profes-
process. sional contact lists.

Table 2
Research sub-questions (RSQs).

Type dimension Name Research sub-question

General – applying to Scope (RSQ1) What do data modeling practitioners believe is the scope and role of data modeling
all three dimensions within the database design process?
Importance (RSQ2) Is the description/design question considered important by data modeling practitioners?
Espoused Beliefs (RSQ3) What are the (espoused) beliefs of data modeling practitioners on the description/design
question?
Problem Perception of Problems (RSQ4) Are data modeling problems perceived as design problems by data modeling practitioners?
Process Methods (RSQ5) Do database design methods used in practice support a descriptive or design
characterization of data modeling?
Perception of Processes (RSQ6) Are data modeling processes perceived as design processes by data modeling practitioners?
Product Perception of Products (RSQ7) Are data modeling products perceived as design products by data modeling practitioners?
Diversity in Conceptual Modeling (RSQ8) Will different data modeling practitioners produce different conceptual data models
for the same scenario?
Diversity in Logical Modeling (RSQ9) Will different data modeling practitioners produce different logical data models from the
same conceptual model?
Patterns (RSQ10) Do data modeling practitioners use patterns when developing models?
Style (RSQ11) Do data modeling practitioners exhibit personal styles that can be identified in the data
models that they create?
G. Simsion et al. / Information & Management 49 (2012) 151–163 153

Table 3
Research methods used.

Research component Method Sub-questions addressed (listed by name)

Interviews with thought-leaders Interviews Importance (RSQ2), Espoused Beliefs (RSQ3), Perception of Problems (RSQ4), Perception of
Processes (RSQ6), Perception of Products (RSQ7)
Scope and stages Survey Scope (RSQ1), Methods (RSQ5)
Espoused positions on data modeling Survey Importance (RSQ2), Espoused Beliefs (RSQ3)
Characteristics of data modeling Survey Perception of Problems (RSQ4), Perception of Processes (RSQ6), Perception of Products (RSQ7)
Diversity in conceptual modeling Laboratory study Diversity in Conceptual Modeling (RSQ8), Style (RSQ11)
Diversity in logical modeling Laboratory study Diversity in Logical Modeling (RSQ9), Style (RSQ11)
Style in data modeling Laboratory study Style (RSQ11), Diversity in Conceptual Modeling (RSQ8), Diversity in Logical Modeling (RSQ9),
Patterns (RSQ10)

4. Results  Data modeling is not a process of creation, it is a process of


discovery.
4.1. Interviews with thought leaders  Data modeling is certainly a descriptive activity, it’s not a design
activity.
Interviews were held with seventeen ‘‘thought leaders’’ (see  I believe rabidly and intensely that it’s a design process.
Appendix B); they were conducted by the first author. These were  We’re designing (but) some of the people that we work with see
used to confirm the currency of the research question and to us as scribes.
clarify which aspects of the research would shed most light on the
questions. All interviewees chose acknowledgment over ano- 4.1.1. Problem: are business requirements negotiable?
nymity. Interviewees were asked for their views on the research Proponents of the design characterization answered that
question and asked to comment on three aspects of the ideal type: business requirements were negotiable, and that modelers should
be active in exposing new ways of doing business. One view was
1. Whether the data modeler should challenge business require- that the business does not know what is best for it. Modelers were
ments (Problem). seen as being able to make suggestions, and to bring in their own
2. Whether data modeling is a creative activity (Process). business knowledge to provide new perspectives.
3. Whether data modeling problems have a single right answer Interviewees favoring the description characterization sup-
(Product). ported the primacy of the business in determining its data model
and the danger of the modeler taking that role: ‘‘What we’re
Analysis was based on videotape transcription and confirma- modeling is what the domain expert says is right’’
tion; organization of statements by common meanings; organiza-
tion of meanings into themes, and then into the research 4.1.2. Process: is data modeling a creative activity?
framework; synthesis of views and positions; and participant Proponents of design believed that they were ‘‘creating’’ the
review of the findings. Opinions on the research question, objects in the model: ‘‘10–15% of entities are obvious and everyone
expressed directly and in discussion, varied widely and were agrees with them, but (beyond that) the actual choice of entities
summarized in Table 4. Positions were starkly articulated: requires a lot of imagination and creativity.’’
Some supporters of the description characterization recognized a
role for creativity in peripheral areas like the layout and presentation
Occupation of Participants of the model, or in the way of understanding the business.
2% 11%
Data Administrator
Number of Participants

25% Data Modeler Experience: Number of Models


120
Database 100
Data Warehouse 80
60
Enterprise Modeler 40
38% Manager 20
11% 0
Other
1%
d
00
10

0
5
0

0+

te

Not Reported
-2

-5
2-

-1
6-

10

or
11

21

51

8%
ep

4%
R
ot
N

Number of Models
Fig. 1. Occupation of participants.

Fig. 3. Experience: number of models.

How Participants Learnt


Number of Participants

250
Number of Participants

Experience: Years
200 140
150 120
100
100 80
50 60
40
0 20
0
r

d
.
ns

.
s

uc
to

uc

te
ok

nc

en
tio

Ed

or
Ed
Bo

rie

0
2

10

d
0

+
a

ep
M

-1

-2
1-

3-

te
20
lic

pe

try

6-

or
ar

11

16
R
b

Ex

us

ep
rti
Pu

t
No
d
Te

R
In

ot
N

Method Years Experience

Fig. 2. How participants learned. Fig. 4. Experience: years.


154 G. Simsion et al. / Information & Management 49 (2012) 151–163

Table 4 Modeling’’. Forty-one (75%) responses matched this overall


Interviewees’ overall positions.
pattern, and a further nine (16%) matched the pattern except for
Position Number of the omission of ‘‘Business Requirements Analysis’’.
interviewees Eighty percent of responses put the activities of data modeling
Strongly supports description 5 and responsibilities of the data modeler as (at least) the
Somewhat supports description 1 specification of an initial conceptual schema, to meet agreed
Supports neither position more strongly than the other 3 business requirements, prior to any modifications to improve
Position depends on modeling formalism/language 1
performance. Thus there was a broad consensus on the overall
Somewhat supports design 3
Strongly supports design 4 composition and sequence of the database design process. The
description/design debate was thus unlikely to be a consequence of
different definitions of data modeling.

4.1.3. Product: one right answer? 4.2.1. Questions measuring stages against the ideal type ‘Design’
This generated strongly conflicting responses. Some who The data modeling scope and stages survey provided answers to
believed that requirements were negotiable were less sure that the following questions.
the models would vary once requirements were settled. Some held Question 1 – Is a business requirements stage (not including
that there is a right model, allowing for variation only in notation entity, relationship, attribute identification) a necessary part?
and the naming of objects. Including this stage prior to identifying key model components
Those who saw differences because modeling was design spoke runs counter to the descriptive characterization when the UoD is
in terms of utility vs truth. A few raised the theoretical position of mapped directly onto the data model. Requirements statements
choice in classification, but most who argued against the one right can be seen as problem statements to which the model provides
answer drew on personal experience. Instructors noted that a solution. A business requirements stage was nominated by
students produced different workable models in response to case 84% of our respondents of whom 46% saw it as the responsibility of
study scenarios. The difficulty of integrating different databases the data modeler (solely or jointly); 65% the analyst and 39% the
within and across organizations was evidence that different user.
workable models could be implemented for the same data. Question 2 – Is entity/relationship/attribute identification
The trade-off between level of generalization and enforcement part of the business requirements stage? If these are established
of business rules was a central theme for those who believed in before data modeling starts, then the data modeler cannot ‘‘create’’
design. Three groups emerged: the literalists (concepts should be them, and the description characterization is supported. Only 5% of
modeled as used in the business); the moderate abstractors (some respondents included identification of entities, relationships, and
generalizations) and the rule removers (deliberately removing attributes in a business requirements stage.
business rules for representation elsewhere). Some saw these as Question 3 – Are there separate stages for DBMS-indepen-
stylistic preferences of modelers. dent (conceptual) modeling and DBMS-specific (logical)
modeling? Data modeling can be seen as an implementation-
4.1.4. Summary independent descriptive stage followed by transformation to a
Thought leaders considered the question was important. logical data model, supporting the descriptive characterization. In
Further, they were divided about whether they believed that data contrast, designers are constantly conscious of the implementa-
modeling was design or description. Their perceptions of whether tion environment or ‘‘medium’’. If data modeling is design, we
the problems, processes, and products were best characterized as would expect the two stages to blur and often combine.
design or description also varied. Respondents grouped most conceptual schema specification
tasks into one stage, generally called Logical Data Modeling, with
4.2. Survey: scope and stages only entity and relationship identification being part of a
Conceptual Data Modeling stage. Respondents effectively used
There were two reasons for this stage. We sought to determine the term conceptual modeling to describe a preliminary ‘‘sketch
what practitioners meant by data modeling before designing and plan’’ and not a rigorous and complete product for mechanical
interpreting survey questions, and sought responses about parts of translation into a conceptual schema. This is in line with
the database design process to see whether they could individually practitioner terminology.
be characterized as design or description. Question 4 – Where does view integration happen? In the
Our questionnaire therefore asked participants to nominate the descriptive characterization, where there is one right answer, view
stages in database design, and map 26 elementary activities (such integration is relatively simple. In the design characterization,
as normalization and definition of indexes) against the stages. The models may differ in complex ways and their integration becomes
elementary activities served as common reference points for a process of negotiation. View development and integration was a
comparing the higher-level stages nominated by different median task in the sequence of the tasks classed as data modeling
participants. Participants were also asked to define data modeling and it was considered ongoing rather than a discrete terminal task.
in terms of the stages that it covered. Respondents were attendees No respondent nominated it as a discrete stage.
at advanced data modeling classes in London (25) and Los Angeles Question 5 – Where is external schema specification located
(30). We found broad agreement on the scope, stages, and activities and who is responsible? One approach to database design uses
considered to be data modeling activities. external schemas (views) to replicate user views. In the descriptive
After consolidation of names, five stages accounted for 201 characterization, these are mappings from the conceptual schema
(83%) of the total of 243 activities cited: (1) Business Requirements that originally integrated them and the data modeler will be
Analysis; (2) Conceptual Data Modeling; (3) Logical Data responsible for their definition. If, instead, external schemas are a
Modeling; (4) Physical Data Modeling and/or Physical Database tool for managing data independence, programming needs, and
Design; and (5) Post-database-design Activities (optional). The two security rather than reproducing user views, we would expect
tasks in (4) were found to contain essentially the same activities. A them to be defined later in the process. This is indeed what we
stage simply named ‘‘Data Modeling’’ was listed on nine occasions found and this activity was seen to be outside the primary
and was in the same place in the sequence as ‘‘Logical Data responsibility of the data modeler.
G. Simsion et al. / Information & Management 49 (2012) 151–163 155

4.2.2. Summary Responses to Closed Question


60
In questions asked specifically about important aspects of the

Number of Participants
50
process that could illuminate whether data modeling is design or
description, practitioners offered opinions that were consistent 40

with the design type. 30

20
4.3. Survey: espoused positions on data modeling 10

0
We surveyed attendees at a one-day advanced data modeling Description Design Both
seminar at an international practitioner convention in an attempt Response
to determine their position on the description/design dichotomy
by asking them two questions. We asked firstly, an open question: Fig. 6. Responses to closed question.

What is data modeling? and secondly, a closed question: Which


better describes data modeling (a) Describing the data requirements of Source of Responses to Closed Question

Response to Closed
60

Frequency of Each
an organization or part of an organization? or (b) Designing data 50
structures to meet the requirements of an organization or part of an

Question
40 Both
organization? 93 respondents answered both questions. The 30 Design
questions were given after participants had completed a data 20 Description
10
modeling task developing a model from a business scenario. 0
Participants were told: ‘‘We are referring to data modeling to 0 1 2 3 4 5 NA
support the development of a relational database; not enterprise Response to Closed Question
data modeling or reverse-engineering’’. They were not shown the
Fig. 7. Source of responses to closed question.
closed question until after they answered the open question.
Two researchers coded the responses neutral, somewhat, or
strongly for the question depending on the level of support for showed a close-to-unanimous view that the design/description
either the design characterization (coded as 4 & 5) or the description distinction was real and important.
characterization (coded as 1 & 2). The distribution of coded
responses to the open question is shown in Fig. 5. Neutral was sub- 4.3.1. Summary
divided into both or neither (3 and 0 respectively). Inter-coder Data modeling practitioners espouse beliefs that data modeling
reliability (a) was 0.82 (0.7 was considered acceptable). Responses was description in response to the open question and were evenly
to the closed question are shown in Fig. 6. The word design (as a split between description and design in response to the closed
verb) was used in only six responses to the open question. Only 17% question. The practitioners confirmed that researching the design/
of responses to the open question did not embody a position. There description dichotomy was of importance.
was no significant correlation between responses and experience,
method of learning, or job position. 4.4. Survey: characteristics of data modeling
Fig. 7 compared responses to the open and closed questions: the
vertical axis shows the break-up of responses to the Closed This part of our survey sought a deeper understanding of
Question for the participants who gave each of the possible (coded) theory-in-use by addressing practitioners’ perceptions of char-
responses to the Open Question. Thus, participants favoring the acteristics of data modeling problems, products and processes to
design characterization in the open question and mostly maintained address RSQs 4, 7. The 25 questions shown in Appendix C used a
that view in the closed question, but a significant number of five-point Likert scale.
participants whose open question answers supported description The survey was benchmarked using architects and accountants.
reversed it in the closed question: providing only a moderate They were chosen because architecture is a design discipline
correlation (K = 0.34, p = 0.007) between the open and closed whereas accounting is a process of recording, classifying, reporting
questions when both and neither were excluded. This difference and communicating, and is thus a descriptive characterization.
suggested that some responses to the open question may have been Data modelers have been compared with both architects and
influenced by taught definitions of data modeling (which favor a accountants. Responses were received from a snowball sample of
description characterization) whilst the closed question demanded 38 accountants and 21 architects, all based in Australia. The results
some reflection. for these two professions were then used to benchmark the results
A facilitated discussion followed response collection. A show of for data modeling against them.
hands reporting closed question answers caused surprise. The The survey was administered to 266 attendees at seven seminars
discussion which followed established that many participants had targeting data modeling practitioners in the USA, Australia, UK, and
expected their own response to dominate. A second show of hands Scandinavia (the smallest 20 and the largest 90). Participants were
told, before completing the survey: ‘‘We are referring to data
Coded Responses to Open Question modeling to support the development of a relational database; not
60 enterprise data modeling or reverse-engineering’’.
Number of Participants

50 No significant differences were found in the results across the


40 seminars. Scale reliability (Cronbach’s a) was 0.73. The Corrected
30 Item-Total Correlation (CITC) was positive (showing that the
20 questions were measuring the same underlying construct in the
10 same direction) for all but one item. The exception was: ‘‘Data
0 modeling is prescriptive rather than descriptive’’ – this had a CITC
Neither Strong Somewhat Both Somewhat Strong
Description Description Design Design
value of 0.15. Subsequent discussion has suggested that some
Response
respondents had interpreted ‘‘prescriptive’’ as applying to the
modeling process rather than product (paradoxically supporting a
Fig. 5. Coded responses to open question. descriptive characterization).
156 G. Simsion et al. / Information & Management 49 (2012) 151–163

Table 5
Summary of responses to the data modeling questionnaire.

Design mean Dimension Property Property mean

3.75 t(267) = 31, A. Design problems mean = 4.11 t(317) = 37, 1. Design problems cannot be comprehensively stated 4.09 t(330) = 31, p < 0.0005
p < 0.0005 p < 0.0005 2. Design problems require subjective interpretation 4.08 t(335) = 31, p < 0.0005
3. Design problems tend to be organized hierarchically 4.18 t(342) = 24 p < 0.0005
B. Design products mean = 3.60 t(302) = 23, 1. There are an inexhaustible number of different solutions 4.04 t(339) = 23, p < 0.0005
p<0.0005 2. There are no optimal solutions to design problems 3.90 t(340) = 21, p < 0.0005
3. Design solutions are often holistic responses 2.55 t(334) = 6.4, p < 0.0005
4. Design solutions are a contribution to knowledge 3.94 t(320) = 19, p < 0.0005
C. The design process mean = 3.49 t(299) = 16, 1. The process is endless 4.00 t(342) = 23, p < 0.0005
p<0.0005 2. There is no infallibly correct process 3.34 t(337) = 4.9, p < 0.0005
3. The process involves finding as well as solving problems 4.00 t(341) = 18, p < 0.0005
4. Design inevitably involves subjective value judgments 3.65 t(331) = 10, p < 0.0005
5. Design is a prescriptive activity 2.66 t(323) = 6.7, p < 0.0005
6. Designers work in the context of a need for action 3.35 t(337) = 5.1, p < 0.0005

Table 5 shows the mean scores (maximum score of 5) for the two-tailed t(87) = 4.5, p < 0.001) who found the problem less
survey, at the Property, Dimension, and Overall levels. The one- difficult (difficulty 5-point Likert scale rating 2.6 vs 3.2; two tailed
sample t-test results indicated the significance of the difference t(89) = 3.4, p = 0.001). 88% used some variant of the ‘‘crow’s foot’’
between the mean and the neutral score of 3. Fig. 8 shows the notation.
frequency distribution of the Overall score, showing that most A reference set of standard entity names and definitions was
values were above the neutral value. synthesized to facilitate comparison of models.
These results show that modeler-espoused characteristics fit a
design characterization. Data modelers also scored significantly 4.5.1. Assessment of diversity
higher than accountants in all dimensions (p < 0.01), and Seven measures of diversity were used. These are neither
significantly higher than architects in the problem dimension orthogonal nor exhaustive but are indicative of diversity.
(p < 0.01) and overall (p = 0.02). Diversity measure no. 1 – Participants’ perceptions of
difference: On completing their model, participants paired off
4.4.1. Summary and compared their models. One percent of participants perceived
The results clearly showed that participants did perceive that the models as identical, six percent as identical except for naming
the problems, products, and processes of data modeling fit the or agreed errors, 53% as structurally different in minor ways, and
design ideal type. 40% as structurally different in important ways.
Diversity measure no. 2 – Number of entities: Fig. 9 shows the
4.5. Laboratory study: diversity in conceptual models frequency distribution of entity counts from the models. Subtypes
were excluded from the count to improve comparison with the 77%
Our laboratory study examined product diversity in the of models which did not use subtypes.
conceptual data models developed by experienced data modelers Diversity measure no. 3 – Variety of entity names: The 93
for a real-world problem. models had 291 different entity names after removing synonyms.
The task involved an effort to develop a conceptual model for a In addition to unrecognized synonyms, it includes some homo-
medical research database from a description of a real business nyms. The different uses were evident through the context of
requirement. Participants were attendees at an international relationships with other entities.
(practitioner-oriented) data management conference in North Diversity measure no. 4 – Use of nouns from the description:
America; they viewed a video of the project sponsor and data The frequency distribution of entity names matching nouns from
administrator describing the requirements, and were given a the interview transcripts is shown in Fig. 10. Of the 291 different
transcript of it (see Appendix D). Participants were then given names given to entities, comparatively few came directly from the
25 min to complete the task and a further 5 min to complete a problem description but had been invented by the modeler.
questionnaire about the process. Ninety-three models were Diversity measure no. 5 – Variability in construct use: One
received. Forty-nine responded that they understood the problem concept was represented in some models as an entity (52 times)
fairly well or very well, did not find it very difficult, made no guesses and as a relationship (5 times) in others. Three concepts were
or only trivial guesses and did not think their models would be shown in some models as entities and in others, explicitly or
much different if more time was allowed. implicitly, as attributes. Correlation between the three decisions
The first author judged 66 of the models to be workable; they was negligible in two cases and weakly positive but not significant
were submitted by the more experienced (8.5 years vs 3.5 years; in the third (K = 0.14, p = 0.2).

Total Number of Entities


Frequency Distribution of Overall Score 18
45 16
40 14
35 12
Frequency

30
Frequency

10
25 8
20
6
15
4
10
2
5
0
0
2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3 4.5 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18
Overall Score Number of Entities

Fig. 8. Frequency distribution of overall score – 3 is neutral. Fig. 9. Total number of entities in each model.
G. Simsion et al. / Information & Management 49 (2012) 151–163 157

Frequency of Nouns The task involved developing a logical data model to serve as
35
the specification for a relational database from a list of 22
30
attributes that the user wished to record for a real business
25
Frequency

application. The attributes were presented as a single table/


20
relation (see Appendix E). Participants were attendees at advanced
15
10
data modeling seminars: for measures 1–4 these were attendees in
5
London, U.K., and Pittsburg, U.S.A. (40 participants in total); for
0 measure 5 the participants included a further 58 at two other
0 1 2 3 4 5 7 seminars.
Entity Names Matching Nouns All but two of the models produced supported the data specified
by the original table and were thus workable. All but three models
Fig. 10. Number of entities corresponding to nouns in the problem description. were fully normalized. Straightforward assumptions were made
about the columns in each table for the few solutions that did not
Diversity measure no. 6 – Level of entity generalization: provide a full list. Consistent application of these assumptions may
Three concepts were represented at different levels of generaliza- have led to less diversity than if this task had been completed by
tion, though t here were significant correlations between the three the modelers themselves.
generalization decisions suggesting that modelers bring personal
styles to the generalization decision (0.42  g  0.77, p < 0.02). 4.6.1. Diversity measures
Diversity measure no. 7 – Holistic difference (expert Diversity measure no. 1 – Participants’ perceptions of
assessed): 19 experts assessed ten selected standardized models difference: Participants paired off and compared models. No
to assess the viability of their implementation. Standardization participant reported the two models as identical, nine percent
involved providing a common name for the same entities, reported the models as identical except for naming or agreed
removing entities outside the defined scope, and presenting the errors, 39% as structurally different in minor ways, and 52% as
models in a common format. Diversity was supported if more than structurally different in important ways.
one model was judged as being practically viable. The expert Diversity measure no. 2 – Number of tables: Fig. 12 shows the
modelers (with a minimum of 15 years experience), gave scores for frequency distribution of table counts in the models.
the measures overall quality, understandability and flexibility on a 5- Diversity measure no. 3 – Variety of table names: After
point Likert scale. The inter-rater reliability, measured by consolidation of obvious synonyms, the 39 models contained 66
Cronbach’s a, was 0.92 for Overall Quality, 0.84 for understandabili- different table names.
ty, and 0.73 for flexibility (0.7 would be considered acceptable) The Diversity measure no. 4 – Construct variability: The 39
level of understandability across all evaluators and models was 3.53 models resulted in seven concepts being represented in more than
(s = 0.47), placing it between neither easy nor difficult to understand one way: as tables in some models and in others as columns (see
and reasonably easy to understand. Appendix E).
Fig. 11 shows, for each model, the number of experts who Diversity measure no. 5 – Generalization: Examination of
assessed overall quality as 3 (mid-point of the scale: application the models revealed five different generalizations: four pro-
would work with no serious problems) or more, and the benchmark duced a single column from two different attributes and one
(experts who scored it equal to or higher than an average model that changed a column’s name to increase consistency with other
they would expect to encounter in their work, developed in the last ten columns (see Appendix E). A score was then calculated for each
years by someone other than themselves.) Thus, between three and participant by totaling the number of decisions taken by the
five models were acceptable to the majority of these experts. participant. Fig. 13 shows the frequency distribution of the
Evaluators were also asked to nominate the best model overall; scores.
four different models were selected. With one exception, generalization scores were not signifi-
cantly correlated with the standard demographic groupings or
4.5.2. Summary with responses to the process questions. The sole significant
The results demonstrated a diversity of objectively assessed correlation was that participants who had developed more than
workable solutions to the same problem and answer RSQ8. The one model in practice had significantly higher generalization
diversity observed was consistent with the design characterization. scores.

4.6. Laboratory study: diversity in logical models 4.6.2. Summary


The diversity in logical models is evident from the results
Our laboratory study also examined product diversity in the (RSQ9.)
logical data models developed by experienced data modelers in a
real-world problem.

Acceptability of Models
16 Total number of tables
8
Number of Evaluations

14 7
12 6
10
Frequency

>=Benchmark 5
8 4
>=3
6 3
4 2
2 1
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9
Model Number of tables

Fig. 11. Model acceptability. Fig. 12. Total number of tables in each model.
158 G. Simsion et al. / Information & Management 49 (2012) 151–163

Frequency of Generalization Scores Table 7


40
Family Tree generalization decisions.
35
Frequency of Score

30 Relationship Parenthood
25
Person 0.39 (p < 0.0005) 0.37 (p < 0.0005)
20
Relationship 0.88 (p < 0.0005)
15
10
5
0 Table 8
0 1 2 3 4 5 Correlation between generalization scores across the conceptual models.
Generalization Score
Demographic group g p
Fig. 13. Frequency distribution of generalization scores. >10 models produced 0.54 <0.0005
10 models produced 0.27 0.32
4.7. Laboratory study: style in data modeling Occupation = data modelers 0.52 <0.0005
Occupation = non data modelers 0.62 0.001
>6 years experience 0.56 0.001
Our laboratory study also examined whether by consistently
<6 years experience 0.49 0.003
favored higher or lower levels of generalization within and Total sample 0.51 <0.0005
between models.
The task involved an effort to develop two models. As in the
prior section generalization scores were calculated for each model. shows that the correlation remained moderate and significant
These were then used to determine the consistency of decisions (p < 0.005) within all but one group.
within each model, and the correlation of the scores between the No significant correlation was found between the generaliza-
two models. Participants were attendees at advanced data tion scores for the Annual Budget (logical) model and either the
modeling seminars in the USA (two seminars) and in Stockholm, Family Tree or Bank Loans conceptual models (Family Tree: g = 0.26,
Sweden: a total of 91 participants. Three modeling problems were p = 0.46; Bank Loans: g = 0.53, p = 0.86).
used, with each participant being assigned two. Two of the
problems required the development of a conceptual data model 4.7.1. Summary
from a text description and one required the development of a Some modelers consistently choose higher (or lower) levels of
logical data model (see Appendix F). generalization than others, within both conceptual and logical
Models that omitted any of the constructs were excluded from models and across conceptual models. This bias (or style) is not due
analysis. Each identified construct in each model was coded to their level or experience (and, by implication, expertise).
according to the level of generalization using ‘0’ for lowest, ‘1’ for Consequently, RSQ9, was answered in the affirmative and thus
next-lowest, to the lowest level of generalization. A total supported the conclusion that the products of data modeling were
generalization score was determined by adding the individual influenced by differences in style of the practitioners that produce
levels coded. them.
Four concepts in the Bank Loans and three concepts in the Family
Tree solutions were identified as being subject to different levels of 5. Summary
generalization. In each case, only two levels were found. With one
exception, the decisions were logically independent. Appendix F Our findings were based on the four key dimensions of our
contains the frequencies with which each decision was used. Recall design framework: general, problem, process and product. The
from the previous section that there were five generalization general dimension includes scope, importance and beliefs. Data
decisions in the logical data-modeling task (see ‘‘diversity measure modeling was found to consist of the specification of the initial
no. 5’’.). conceptual schema to meet the business requirements prior to any
Combinations of the decisions resulted in ten versions of the performance tuning (RSQ1). The research question was considered
Bank Loans model and five versions of the Family Tree model, the to be important (RSQ2). Data modeling practitioners were evenly
most popular in each case accounted for 50% of the models. Tables divided between their belief that data modeling was design and
6 and 7 show the correlation between each pair of generalization description (RSQ3).
decisions. Data modeling problems were seen as having the characteristics
Covariance using the Kuder-Richardson 20 (KR20) coefficient of design problems by data modeling practitioners, significantly
was 0.80 for the Family Tree decisions, 0.75 for the Bank Loans more so than architects and accountants (RSQ4). Data modelers
decisions, and 0.68 for the logical data modeling task generaliza- generally worked from a problem statement rather than directly
tion decisions from observations of the UoD (RSQ4).
Generalization scores for the Family Tree and Bank Loans models The data modeling process was perceived as having the
were moderately positively correlated. The gamma statistic characteristics of design processes, similar to the perceptions of
(g = 0.69, p < 0.0005), showed a strong correlation. Thus, between architects and significantly more than the perceptions of
model generalization correlation was supported for the two accountants (RSQ6). Consistent with the design characterization,
conceptual models. There was some correlation with demographic identification of entities, relationships, and attributes was not
categories, so the analysis was repeated for each of them. Table 8 considered to be part of the business requirements analysis.
Furthermore, there was no evidence of the widely advocated
‘‘view-definition, view-integration, view-reconstruction’’ se-
quence, which required that model differences can be reduced
Table 6
Bank loans generalization decisions. to reconcilable views (RSQ5).
Data modeling products were perceived to be design products
Party Party relationship Transaction
by data modeling practitioners, significantly more than percep-
Customer 0.77 (p < 0.0005) 0.58 (p < 0.0005) 0.27 (p = 0.03) tions of accountants, and similar to perceptions of architects
Party 0.75 (p < 0.0005) 0.28 (p = 0.03)
(RSQ7). Conceptual data models and logical data models developed
Party relationship 0.36 (p < 0.0005)
in response to a common problem were found to have substantial
G. Simsion et al. / Information & Management 49 (2012) 151–163 159

diversity (RSQ8, RSQ9). Data modelers frequently re-used their focused entirely on the design/description question and produced
own or other’s patterns, significantly more than architects. consistent evidence in favor of the design characterization.
Experienced conceptual data modelers re-used patterns much For researchers, there are two implications. First, careful design
more than less-experienced data modelers (RSQ10). A significant of data modeling experiments that take into account the likelihood
correlation was found between the levels of generalization of of alternative solutions is required. Second, generalization of the
entities within and between conceptual data models developed by results of empirical studies that use students as participants is
the same modeler. This suggested that personal style, evidenced by problematic, because design skills take time to develop.
generalization decisions, affected the data models that modelers Data modeling teachers should consider it as a design activity.
produce (RSQ11). Designing data modeling tasks by articulating a domain based on
nouns and verbs that relate to the entity types and relationship
types is a way to teach data modeling notation but it does not teach
6. Discussion and conclusions the practice of data modeling.
Practitioners should be aware that their method is more
Answers to our research sub-questions suggested that: data consistent with data modelling as a design activity. They should be
modeling, while traditionally characterized as description, was aware that alternative data modeling solutions may be useful and
better characterized as design based as it was practiced. We need to be evaluated for quality as part of the process.

Appendix A. Summary of participants

Research samples and research components.


Location Research component Number Response rate (%)

DAMA/Metadata Conference, San Antonio, TX, USA Diversity in conceptual modeling 112 66%
Espoused positions on data modeling
DAMA Conference, London, UK Diversity in logical modeling 17 85%
Enterprise Data Forum, Pittsburgh, PA, USA Diversity in logical modeling 23 80%
DAMA /Metadata Conference, Orlando, FL, USA Data modeling style 41 77%
DAMA Chapter Presentation, Portland, OR, USA Characteristics of data modeling 54 90%b
DAMA Chapter Presentation, Phoenix, AZ, USA Characteristics of data modeling 28 90%b
DAMA Chapter Presentation, Des Moines, IA, USA Characteristics of data modeling 39 90%b
IRM Data Modeling Workshop Stockholm, Sweden Data modeling style and 28 70%b
Diversity in logical modelinga
IRM /DAMA Conference, Stockholm, Sweden Characteristics of data modeling 70c 90%b
DAMA Chapter Presentation, Sydney, Australia Characteristics of data modeling 20 90%b
DAMA/Data Quality Conference, London, UK Characteristics of data modeling 25 83%
Scope and stages
Wilshire Conferences Data Modeling Masterclass, Los Angeles, CA, USA Characteristics of data modeling; 30 86%
Scope and stages 459 75%
Data modeling style and
Diversity in Logical Modelinga
a
The Diversity in Logical Modeling task was incorporated in the Data Modeling Style task in these two locations.
b
Estimate – exact attendee numbers not available.
c
This group included 28 who attended the previous item.

Appendix B. List of participants in the thought leaders interviews

The participants, and their positions or roles at the time of interview were:

 Peter Aiken, data management consultant, Associate Professor at Virginia Commonwealth University.
 Richard Barker, company director, architect of the Oracle CASE tool.
 Michael Brackett, President of the International Data Management Association.
 Harry Ellis, data modeling consultant to the British Department of Defence.
 Larry English, leading proponent of data quality techniques.
 Terry Halpin, Professor at Northface University Utah.
 David Hay, independent data modeling consultant and educator.
 Steve Hoberman, global reference data manager with Mars, Inc.
 Karen Lopez, data modeling consultant and commentator.
 Dawn Michels, data modeling specialist, Vice President of Chapter Services for DAMA International.
 Terry Moriarty, president of Inastrol data modeling consultancy.
 Ronald Ross, editor of the Database Newsletter for 22 years.
 Robert Seiner, data management consultant.
 Alec Sharp, independent data and process modeling consultant.
 Len Silverston, data modeling consultant, industry educator.
 Eskil Swende, Chief Executive of the IRM group. President of the Scandinavian chapter of the Data Management Association.
 John Zachman, industry consultant and educator.
160 G. Simsion et al. / Information & Management 49 (2012) 151–163

Appendix C. Survey questions – perceptions of characteristics of data modeling


Properties of design organized as a set of questions.
Design Dimension Property Additional survey question

Overall A. Design 1. Problems cannot 2. Data modeling problems are often full of uncertainties about objectives and relative priorities
problems be comprehensively 3. Many requirements do not emerge until some attempt has been made at developing a model
stated 4. Objectives and priorities are likely to change during the modeling process
2. Problems require 5. In establishing requirements for a data model, something that seems important to one
subjective interpretation data modeler may not seem important to another data modeler
6. In establishing requirements for a data model, something that seems important to one
business stakeholder may not seem important to another business stakeholder
3. Problems tend to be 7. Modeling problems are often symptoms of higher level problems
organized hierarchically
B. Design 1. There are an 9. Most data modeling problems do not have a single correct solution
products inexhaustible number 10. In most practical business situations, there is a wide range of possible (and workable)
of different solutions data models
2. There are no optimal 11. Data modeling almost invariably involves compromise
solutions to design 12. Data modelers will almost invariably appear wrong in some ways to some people
problems
3. Design solutions are 13. It is not usually possible to dissect a data model and identify which piece of the model
often holistic responses supports each piece of the business requirements
4. Design solutions are a 14. I frequently re-use patterns (structures) from other data models that I have developed myself
contribution to knowledge 15. I frequently re-use patterns (structures) that I have seen in models developed by others
C. The design 1. The process is endless 17. Identifying the end of the data modeling process (i.e. when to stop modeling) requires
process experience and judgment
2. There is no infallibly 16. There is no infallible correct process that (if properly followed) will always produce
correct process a sound data model
3. The process involves 21. Data modeling requires a high level of creative thinking
finding as well as solving
problems
4. Design inevitably involves 23. I find it difficult to remain dispassionate and detached in my data modeling work
subjective value judgments
5. Design is a prescriptive activity 24. Data modeling is prescriptive rather than descriptive
6. Designers work in the context 25. The final data model is often a result of compromise decisions made on the basis of
of a need for action inadequate information

The 19 Characteristics at the lowest level were derived from A further question was added as Question 1 in the survey to
concepts in the descriptions of the Properties, and were determine whether the most difficult part of data modeling was in
operationalized as questions that could be scored on a Likert understanding the business requirements. It served three pur-
scale (the numbers, complete with gaps in the sequence, are the poses:
numbers of the corresponding questions in the resulting question-
naire). Scores were computed by taking the mean to provide a (1) To answer the question: are requirements fixed or negotiable?
score for the higher level Property. Then the Property scores were If requirements are negotiable, but perceived as fixed by some
computed to provide Problem, Product, and Process scores, and modelers (or vice versa), we would expect those modelers to
ultimately an overall Design score. find the task difficult.
There was some subjectivity in the identification of these (2) To determine whether perceived difficulty in understanding
concepts and the framing of the questions. There was no question requirements correlated with other indicators of design.
addressing the Property (of design products) Design solutions are Incompleteness, subjectivity, and negotiability of requirements
parts of other design problems. This Property proved difficult to are cited as properties of design; if the task is essentially
communicate in a simple question or questions and after pilot descriptive, then gaining an understanding of it is the central
testing it was excluded. In all, six questions addressed problem, (and most difficult) task.
seven addressed product, and six addressed process. (3) In eliciting a deeper understanding of either description or
Five further questions were added based on other differences design positions, to reduce the possibility that respondents
between description and design. These were classified under their would recognize the dichotomy behind their questions and
relevant dimensions. answer. The question did not signal the dichotomy. It was
Questions added to characteristics of data modeling survey. placed first on the questionnaire.

The model was developed specifically for our research, in the


Additional survey question Dimension absence of established measures for differentiating description and
8. Business requirements are often negotiable Problem design activities. We were obliged to rely solely on the soundness
18. When I am developing a data model, I sometimes Process of the underlying theory (and on its operationalization) when
produce more than one workable solution, and then
drawing conclusions from the results.
choose the best one
19. I often start modeling before I have a thorough Process Questions were adapted, through minor re-wording, to enable
understanding of business requirements them to be used with two other professional groups, viz. architects
20. Sometimes, even when I understand the business Process and accountants. The two groups were chosen because:
requirements, I find it difficult to produce a data model
22. I have experienced ‘‘eureka’’ moments (sudden Process
(1) Architecture is generally recognized as a design discipline and
and dramatic insights or solutions to problems)
in my data modeling work is frequently employed as a metaphor for IS tasks and
deliverables.
G. Simsion et al. / Information & Management 49 (2012) 151–163 161

(2) Accounting is a process of recording, classifying, reporting and A couple of the other problems we have are the need for a central
identification number that we need to generate: we can’t use (for instance)
communicating, a definition consistent with the descriptive
a Medicare number or social security number because of privacy issues.
paradigm. Data modelers have in fact been compared with Another one of the problems that we have is that a lot of these surveys are
accountants: ‘‘Just as an accountant might use a financial used multiple times – two, sometimes three times. So it’s the ability to be
model, the analyst can develop an entity model’’. able to collect data on the third survey, linking it up with the same patient
that we used for the first survey.
To encourage a focus on common tasks, the accountants’ So if you were to participate in the study, you would come into your
ante-natal visit and with the help of staff fill out (say) four or five
questions were framed in the context of preparing a set of accounts questionnaires asking you about your mood and how you’re feeling.
for a business and the architects’ questions in the context of You would then answer the same questionnaires again at your post-natal
designing a building. visit, and the reason we have the same questionnaires again is just to see
how the mood and the response has changed over a period of time.
Appendix DLaboratory materials – diversity in conceptual
modeling
Instructions and a set of ‘‘process’’ questions addressing
The problem to be analyzed was presented to the participant in assumptions, level of difficulty and use of patterns (common to
three parts: all modeling exercises used in our research) were added to the
standard demographics questionnaire.
(1) A videotaped description of the business requirements as
recorded by the project director and also by the manager Appendix ELaboratory materials – diversity in logical modeling
responsible for managing the production system. The two
stakeholders were responding independently to our request to The task was to produce a logical model based on the
tell us about a project and the data that was needed to run it. conceptual model (see Fig. E1). The logical model needed to be
(2) A verbatim transcript of the videotape (see below), with a short a workable specification for a database: a single table/relation is
glossary of terms added by the author in consultation with the needed so that data can be stored: it is already normalized. The
project director. quarterly items are not repeating groups; they are different items
(3) A list of questionnaires to be used for data collection, with with different names and meanings.
excerpts from two questionnaires. The task and associated questionnaire were administered to 96
attendees at four advanced data modeling seminars. For some of
the measures of diversity, only the 39 responses from London (a
Postnatal depression interview transcript and glossary. substantial European conference) and Pittsburgh (a substantial
Case study interview transcripts North American conference) were included.
Key Terms (as used in the transcripts):
The last three concepts in Table E1 did not directly reflect
Post-natal or post-partum – after the birth of a baby
Ante-natal – before the birth of a baby (i.e. during pregnancy) columns in the original model but were added by modelers to
Intervention – action taken by a health professional e.g. counseling, capture semantics lost when generalizing some of the original
prescription of drugs. Also used by Prof. Buist (final sentence) to mean columns. Ignoring differing levels of generalization and consider-
‘‘actions taken to educate health professionals and the public about ing only the choices of representing each concept as either a
Post-Natal Depression.’’
Screening – administering a questionnaire (to a woman participating in
column or table resulted in 19 distinct models.
the study) There were five situations in which some participants had:
Professor Anne Buist, Director, National Post-natal Depression Initiative
This project is looking at ante-natal and post-natal depression, and it’s going (a) explicitly generalized two or more of the attributes in the
to run over four years, and cover five states of Australia. It’s being funded by
original model to produce a single column, e.g. Generalizing
Beyond Blue, which is the Australian national depression institute, and it’s
going to cover somewhere between 50,000 and 100,000 women over this Budget First Quarter Material, Budget Second Quarter Material,
time period Budget Third Quarter Material and Budget Last Quarter Material
The data collection is in three kind-of-separate bits: into a single column Quarterly Material Budget plus a Quarter
Firstly across all states we’re going to be screening women at a minimum Number column to identify which quarter the amount applied
of two time points – once through the pregnancy and once post-natally.
And the data we’re collecting there will be the same in each state.
to (Decision 1 in Table E2).or
However there’s also going to be state-specific interventions for these (b) Altered columns to increase consistency: e.g. Replacing Actual
women, and that will be evaluated both pre and post intervention with Total Material with Actual Fourth Quarter Material (Decision 5 in
another set of questionnaires that women or/and the research assistants Table E2) to make it consistent with the representation of
will be completing. And these may be at up to six different time points in
budgeted amounts and comparable with the other (quarterly)
covering through pregnancy and post-partum.
The other sort of aspect of the data collection is before we even start the actual material amounts. Although these decisions are not
study and at the end of the study we’re going to be sending questionnaires manifested as generalizations, they are based on the recogni-
to both women who have had babies and health professionals (general tion of commonality, and thus have been treated together with
practitioners, midwives and maternal child health nurses), and evaluating the explicit generalizations.
their understanding of post-natal depression with respect to what it is,
with respect to stigma and with respect to treatment. And we’ll be
evaluating that again
Table E2 shows the five situations and the different decisions
after our four-year time period where we’re going to be doing some
interventions and in particular increasing awareness of post-natal depression. made by participants.
Fig. E2 shows the frequency distribution of the decisions.
Dr Justin Biltza, Project Officer, National Post-natal Depression Initiative
Nineteen of the 22 modelers who generalized Budget and
So really there are two types of data that we’re collecting: the first lot being
patient demographic data (name, address, date of birth, contact details), Actual amounts (Gen BA) also made the other four general-
and the other set of data is based on a series of questionnaires which are izations and this covariance amongst the decisions was
either ‘‘short answer’’ or the selection of a score based on (say) a range from supported by a Kuder-Richardson 20 (KR20) statistic of 0.68.
(say) ‘‘good’’ to ‘‘bad’’. There’s approximately forty questionnaires that we’re Apparently this modeler had an underlying concept of propensity
using, five which form a core key component that everyone in the survey is
doing, but then each of the states has a number of individual surveys that
to generalize on the part of the modeler. Correlations between
they’re using, none of which cross over, so one of the problems that we have individual decisions (f) ranged from negligible to strong, and
is ensuring that we collect all the data on all the patients. were positive in all cases.
162 G. Simsion et al. / Information & Management 49 (2012) 151–163

Department Number (Primary key item) Budget-First-Quarter-Labor


Year (Primary key item) Budget-Second-Quarter-Labor
Approved-By Budget-Third-Quarter-Labor
Budget-First-Quarter-Material Budget-Last-Quarter-Labor
Budget-Second-Quarter-Material Actual-First-Quarter-Labor
Budget-Third-Quarter-Material Actual-Second-Quarter-Labor
Budget-Last-Quarter-Material Actual-Third-Quarter-Labor
Actual-First-Quarter-Material Actual-Total-Labor
Actual-Second-Quarter-Material Budget-Other
Actual-Third-Quarter-Material Actual-Other
Actual-Total-Material Discretionary-Spending-Limit

Fig. E1. Annual Budget conceptual model.

Table E1
Alternative representations of concepts.

Concept Not present As column Literal table Generalized table (in scope) Generalized table (beyond scope) Total

Approved by 0 13 7 1 18 39
Department 0 4 34 1 0 39
Disc spending limit 0 27 12 0 0 39
Year 0 26 6 7 0 39
LMO-type 14 7 18 0 0 39
Quarter 10 13 8 7 1 39
BA-type 28 7 4 0 0 39

Table E2
Generalization choices in the logical data models.

Decision Decision name Yes No Both


number

1 Gen QTR (generalization decision) Quarterly amount columns generalized – no columns Columns for individual quarters N/A
specific to a particular quarter.
2 Gen LMO (generalization decision) Labor, material, other generalized – no columns Specific columns for Labor, Material, N/A
specific to a particular type Other amounts
3 Gen BA (generalization decision) Budget and actual columns generalized – no Specific Columns for Budget and N/A
columns specific to a particular type. Actual amounts
4 Other QTR (consistency decision) Support for quarterly values for ‘‘Other’’ Support only for annual values Both options
amounts – no column for annual amount for Other amounts supported
5 Fourth QTR LM (consistency decision) Direct representation of fourth quarter labor and Annual totals held for Labor and N/A
material actual amounts Material Actual amounts

Frequency of Design Options


90
80 A Bank Loans problem: a simplified version of a real example,
70
Yes presented as a short plain-language description written by the
Frequency

60
50 No author.
40 Unclear A Family Tree problem: it included the concept of marriage,
30
Both presented as short plain-language description written by the
20
10 author.
0 Bank loans data modeling problem.
Gen Qtr Gen LMO Gen BA Other Qtr Fourth Qtr
LM
Design Option Bank loans
To support the business of a bank, we need to record details of personal
Fig. E2. Frequency of design options. loans, housing loans and motor vehicle finance loans. Against each loan,
we need to record the details of the borrower(s), the Loan Officer who
approved the loan, and (in some cases) a guarantor. We also need to record
payments, drawings (initial and further borrowings) and interest transactions
against each loan.
Appendix F Laboratory materials – style in data modeling
Family tree data modeling problem.
Three data modeling problems were used in this research Family tree
component. We are developing a database to record details of a family tree. For each
The Annual Budget problem: participants were presented with a person of interest to us, we need to be able to record details (where known)
conceptual model and some supporting information, and asked to of their mother, father, children, and marriages, and their date of birth,
produce a logical data model (the model shown in Appendix E). death and marriages
G. Simsion et al. / Information & Management 49 (2012) 151–163 163

Not Generalised [3] Y. Wand, R. Weber, Research commentary: information systems and conceptual
Frequency of Design Option
60 Generalised modeling: a research agenda, Information Systems Research 13 (4), 2002, pp. 363–376.

50
Graeme Simsion is an Information Systems Consultant,
Frequency

40 Educator, and Researcher. For 20 years he was CEO of a


30 business and information systems consultancy with
20 offices in three Australian cities. His PhD from The
University of Melbourne examined attitudes and
10
practices of data modeling practitioners. He is the
0
author of Data Modeling Essentials, one of the most
Customer Party Party Relationship Transaction
widely used practitioner texts on the subject, Data
Design Option Modeling Theory and Practice, and numerous academic
and practitioner articles, and is a regular speaker at
Fig. F1. Bank Loans generalization decisions.
industry and academic forums. His current focus is on
improving the consulting skills of business and
information systems professionals.
Frequency of Design Options
Simon Milton is a Senior Lecturer in the Department of
Not Generalised
70 Computing and Information Systems at The University
Generalised
60 of Melbourne, and received his PhD from The University
Frequency

50 of Tasmania in which he reported the first comprehen-


40 sive analysis of data modeling languages using a
30 common-sense realistic ontology. Dr Milton continues
20 his interest in the ontological foundations and practice
10 of data modeling. He is also interested in the value and
0 use of ontologies for business and biomedicine.
Person Relationship Parenthood
Design Option

Fig. F2. Family Tree generalization decisions. Graeme Shanks is an Australian Professorial Fellow in the
Department of Computing and Information Systems at
The University of Melbourne. He received his PhD from
Figs. F1 and F2 show the frequency with which each Monash University. His research interests focus on the
generalization option was used in the two models. management and impact of information systems, busi-
ness analytics, data quality and conceptual modeling.
References Graeme has published in journals including MIS Quarterly,
Journal of Information Technology, Information Systems
[1] B. Lawson, How Designers Think: The Design Process Demystified, 4th ed., Journal, Information & Management, Journal of the AIS,
Architectural Press, Oxford, 2005. Electronic Commerce Research, Journal of Strategic Infor-
[2] S.K. Milton, E. Kazmierczak, An ontology of data modelling languages: a study using mation Systems, Information Systems, Behaviour and
a common-sense realistic ontology, Journal of Database Management 15 (2), 2004, Information Technology, Communications of the AIS, Com-
pp. 19–38. munications of the ACM, and Requirements Engineering.

Вам также может понравиться