Data Arch

ADCO
Data Architecture
411510387.doc/ADCO Data Architecture

Printed 10/03/2019 10/03/2019 Page 1
Table of Contents
TABLE OF CONTENTS...........................................................................................................................
PREFACE TO THE DATA ARCHITECTURE......................................................................................

DESCRIPTION................................................................................................................................................
LIMITATIONS.................................................................................................................................................
APPROACH....................................................................................................................................................
Evolution of the Data Architecture..........................................................................................................
SECTION 1 - VISION AND SCOPE.......................................................................................................
BUSINESS PROBLEM.............................................................................................................................
The need for a Data Architecture............................................................................................................
VISION AND SCOPE.......................................................................................................................................
SECTION 2 - FEATURES AND BENEFITS..........................................................................................
Simplicity.................................................................................................................................................
Classify and Organise Data Types and Definitions................................................................................
Classify and Organise Data Storage and Access....................................................................................
Integrated.................................................................................................................................................
Management.............................................................................................................................................
BENEFITS......................................................................................................................................................
Standard Data Access Interface..............................................................................................................
Reuse IT Infrastructure............................................................................................................................
Flexibility to adapt to changing data requirements................................................................................
Data Integrity...........................................................................................................................................
Location Transparency............................................................................................................................
SECTION 3 - A FRAMEWORK FOR DEVELOPMENT OF THE ARCHITECTURE..................
SECTION 4 - POLICIES..........................................................................................................................
4.1 DATA MANAGEMENT POLICIES..................................................................................................

DATA MANAGEMENT ROLES (DATA OWNERSHIP AND STEWARDSHIP).........................................................
DATA REPOSITORY POLICY...........................................................................................................................
DATA STANDARDS POLICY............................................................................................................................
DATA INTEGRITY POLICY..............................................................................................................................
DATA AUDIT POLICY.....................................................................................................................................
4.2 DATA DISTRIBUTION POLICY......................................................................................................
RATIONALE FOR DATA DISTRIBUTION...........................................................................................................
Location of Data......................................................................................................................................
“Unit” of data distribution......................................................................................................................
Conclusions on “Unit” of Data Distribution..........................................................................................
Summary of Basis for Data Distribution.................................................................................................
Centrally Administered Data...................................................................................................................
Locally Administered Data......................................................................................................................
DATA DISTRIBUTION POLICY........................................................................................................................
PACKAGES AND DATA DISTRIBUTION...........................................................................................................
4.3 DATABASE MANAGEMENT POLICY..........................................................................................
DATA REPLICATION POLICY..........................................................................................................................
DATA AVAILABILITY POLICY.........................................................................................................................
DATA SECURITY POLICY...............................................................................................................................
DATABASE BACKUP POLICY..........................................................................................................................
DATA ACCESS TOOLS....................................................................................................................................
APPENDIX A.............................................................................................................................................

Printed 10/03/2019 10/03/2019 Page 2
DATA DISTRIBUTION SCENARIOS....................................................................................................
GENERAL DISCUSSION..................................................................................................................................
CENTRALIZED...............................................................................................................................................
PARTITION.....................................................................................................................................................
EXTRACT.......................................................................................................................................................
REPLICA........................................................................................................................................................
FACTORS FOR PHYSICAL DESIGN..................................................................................................................
APPENDIX B.............................................................................................................................................
MICROSOFT’S VISION FOR UNIFIED INFORMATION MANAGEMENT : “INFORMATION AT YOUR FINGERTIPS”
A UNIFIED FACILITY FOR STORING AND SHARING INFORMATION..................................................................
A UNIFIED FACILITY FOR EXCHANGING INFORMATION..................................................................................
A UNIFIED FACILITY FOR MANAGING INFORMATION......................................................................................
APPENDIX C.............................................................................................................................................
NOTES ON IMPLEMENTATION OF A DATA WAREHOUSE..........................................................

DATA WAREHOUSE POPULATION...................................................................................................

Printed 10/03/2019 10/03/2019 Page 3
Preface to the Data Architecture
Description
This Data Architecture is one of four architectural components that comprise the Enterprise
Architecture. A data architecture is a framework for managing the data of an enterprise. This
framework defines standards, procedures and guidelines for data management. There are two distinct
views of data which are reflected by the distinction between data administration and database
management. Data administration classifies and organises data into data models containing
attributes, entities and relationships. Database administration classifies and organises data into
databases containing data fields, and tables.
Operations on data are the province of data administration, application development, and database
administration. Operations on data can be broadly classified into two categories: maintenance and
business. Maintenance operations facilitate the creation, update, and deletion of data; they deal with
data content. Business operations apply algorithms (calculation, derivation, aggregation) and manage
relationships between the data; they deal with information. Rules for both classes of operation are
defined to provide information to the business. Rules for data content are the province of data
administration, application development and database administration. Data administration collects
the rules for data content, application development and database administration implement the rules.
Both functions, data and database management, define policy for data and database management and
standards, procedures and guidelines which implement policy.
This version of the data architecture focuses on a framework to describe a data architecture, data and
database management policy, and data distribution.
Limitations
This version of the data architecture is incomplete. Standards, Procedures and Guidelines to
implement policy are not defined. Technology requirements, for input to the technology architecture,
are not specified.
Approach
The approach taken is to develop the IT/IS principles for Data and Information from the IS/IT
Planning exercise with ADCO’s data administration and database administration staff. The policies
cover three subject areas: Data Administration, Data Distribution and Database Management.
The data architecture is described in four sections and a number of appendices. The first section
described the Business Problem to be solved by the Data Architecture, the User Profiles of the users of
data, and the Vision and Scope. The second section describes the features and benefits of the data
architecture. The third section presents a framework for development of the data architecture. The
fourth section states policies for Data Management, Data Distribution and Database Management.
Bases for data distribution are discussed and a basis for data distribution is stated as policy. In this
discussion, a concept for data classification is presented which aids discussions of data distribution.
Appendix A is a discussion of data distribution scenarios. Appendix B briefly describes Microsoft’s
vision for unified facilities for data storage, exchange and information management and current and
future technologies in support of that vision. Appendix C presents notes for implementation of a Data
Warehouse.
Evolution of the Data Architecture

It is envisaged that the Data Architecture will be expanded by adding further sections on Standards,
Procedures and Guidelines for Data and Database Administration. Standards might include naming
conventions, technology standards, modelling standards. Procedures might include procedures for
data collection, performance monitoring, data auditing, tracking usage patterns, publication and
presentation of data models, publication and presentation of database models, procedures for
databases design and administration, procedures for gaining access to data and so on. Guidelines
might include guidelines for acquisition of data access and reporting tools, building data queries, and
so on.

Printed 10/03/2019 10/03/2019 Page 4
It is envisaged that a data storage architecture will become part of the data architecture, defining
storage and access mechanisms and standards for the location and storage of unstructured data types.
Until object storage systems become commonly available, standards for using current file and
directory structures will need to be developed.

Printed 10/03/2019 10/03/2019 Page 5
Section 1 - Vision and Scope
Business Problem
The diversity of data and storage mechanisms, decentralised and autonomous data collection and
processing can combine to create an environment where the credibility of data is suspect. Each
organisational unit maintains its own data and the enterprise cannot determine which data is valid for
business decision-making. An ungoverned environment for data definition and storage results in an
inability to rely on or locate data for decision making. Lack of data definition and/or multiple
definitions can lead to misinterpretation of data, or different interpretations of the same data by
different business units. An enterprise requires many types of data to support business activities,
much of it unstructured.
The need for a Data Architecture

Data without governance results in a decrease in the value of data to the enterprise because there is no
unified process and tools to classify, organise and manage data.
Typically the management of data is focused on the efficiency and availability (access times and
backup and restore) rather than on data integrity. Unstructured data is managed in an ad-hoc fashion
and a multiplicity of data storage architectures are present, making it a complex and costly task to
access and process data.
A data architecture provides direction for the establishment of data management practices which
enhance the integrity and accessibility of data. The data architecture is the framework for ensuring
that data is reliable and accessible for business decision-making.
User Profile
Almost every employee in an enterprise is involved in either collecting, processing or using data and
information. The stakeholders in a data architecture are the data owners and the information
technology division.
Who is affected by the Data Architecture?
Every user involved in collecting, processing and using data. A broad classification of users might
categorise users as data collectors, data processors, and information users. Data collectors feed the
business systems with data. Data processors transform data into information and information users
use the output to support business decision-making.
Vision and Scope

A data architecture can bring order and unity to the definition, content, storage and management of
the enterprise’s data. ADCO’s vision for a data architecture is of an architecture which provides a
unified facility for storing, sharing, exchanging and managing data.
The effectiveness of the data architecture can be measured in terms of time, money and confidence of
users in the data. The time to prepare reports and build spreadsheets, the cost of producing reports
and spreadsheets, ease of access to data, elimination of duplicate data, consistent and accurate data
content are all measures of the effectiveness of a data architecture.

Printed 10/03/2019 10/03/2019 Page 6
Section 2 - Features and Benefits
The data architecture must enable the data and information requirements of the enterprise to be
satisfied. It must enable the definition, storage, access, sharing and exchange of structured and
unstructured data. It must enable differing types of data to be combined and presented as information.
It must provide a uniform view, and methods of access to users.
The data architecture must define a framework that provides a unified facility for storing, sharing,
exchanging and managing structured and unstructured data.
The key to achieving this goal is a common storage facility for all structured data (a database
management system), or at least a common access method to different database management systems,
and a common storage facility for all unstructured data.
The first part of the goal is relatively easy to achieve by standardising on one data interface (e.g.
Microsoft’s Open Database Connectivity (“ODBC”) specification for database access is implemented
by many major database vendors and independent software vendors).
The second part of the goal will be achievable in the next two to three years, and can be partially
implemented using products that are on the market today.
Currently many organisations purchase piecemeal solutions to manage unstructured data - An
imaging system, a workflow management system , a document management system, and so on. Each
product’s manufacturer implements a proprietary “back-end” for managing the storage of the objects
managed by their system and a proprietary “front-end” for storing, retrieving and searching for these
objects. The reason the manufacturers of these products each design their own, and different, solution
to the same problem (creation, storage, retrieval, searching, display, printing, and so on) is that there
is not, as yet, a standard platform managing unstructured data.
The general term for unstructured data is “document” which is defined as a file and attributes about
the file. Some attributes are predefined and others may be defined by the user.
The key to achieving a unified facility is to select one and one only “object storage architecture”.
Whilst databases have matured to the point where standard interfaces like ODBC are available, it is
unlikely that a standard interface to object stores will emerge soon after their introduction. It took
about ten years before common standards for a database interface were able to be defined. The
development of a standard interface to object stores may well emerge more quickly, building on the
techniques that have been developed for databases. This is entirely possible when we consider
searching for objects based on their properties, which are a form of structured data.
What would a physical architecture for object stores look like? The same as the physical architecture
for structured data. The only difference is in the content and access methods. Distributed computing
environments and their requisite networks and replication technology are all in place. Enhancements
to directory services for the translation of logical resource names to physical network addresses will
proceed irrespective of the development of object stores. From a directory services point of view, an
object store and its objects are just new resource types.
For a comprehensive discussion of document and information management, refer to “Foundations for
Document and Information Management”, Datapro, February 1995, by Marie Lindsay
To meet these objectives, the data architecture should have the following features:
Simplicity
The data architecture must be easy to understand and easy to communicate.
Classify and Organise Data Types and Definitions

The data architecture must focus on what types of data, their definitions and organisation are required
by the enterprise.
Classify and Organise Data Storage and Access

The data architecture must focus on what types of storage and access methods are to be used to locate,
store and access data.

Printed 10/03/2019 10/03/2019 Page 7
Integrated
The data architecture must facilitate the integration of data to produce information by standardising
types of storage and access methods. Standardise on one architecture for structured data and one
standard for unstructured data.
Management
The data architecture must define policy and standards, procedures and guidelines for data and data
storage management.
Benefits
Here is a partial list of benefits.
Standard Data Access Interface

For developers, a standard data access interface enables applications to be developed that can access
different database types. For users a wider choice of tools is available to access and manipulate data.
For the enterprise a standard data access interface reduces costs of training and creates reusable skills.
Reuse IT Infrastructure
Standard data storage architectures means that IT infrastructure is reused for each new data storage
requirement instead of purchasing more data storage and data management technologies.
Flexibility to adapt to changing data requirements

A data architecture will provide a tool and the flexibility to adapt the technology infrastructure to
changing data storage requirements.
Data Integrity
By exercising data management policy, data integrity can be assured.
Location Transparency
Users will not need to know where data is.

Printed 10/03/2019 10/03/2019 Page 8
Section 3 - A Framework for development of the Architecture
This section describes a framework for developing the data architecture.
Business
Analysis Database/
(Analysts & Data Owners) File/
Data Object Store
Functional
Requirements Design
Specification
and Rules
DEFINE
DESIGN OR
GATHER
EVALUATE
DATA DATA DEFINITIONS SCHEMA/FILE/

DATA
DEFINITIONS & RULES ACCESS RULES OBJECT STORE
DEFINITION
DEFINITION ACCESS PATHS DEFINITION
BY
BUILD
DEPLOY
IMPLEMENTED
BUY OR
REPOSITORY/ APPLICATIONS/ DATABASE/

DATA INTERFACE
DATA MODELLING ACTIVE/PASSIVE DATAACCESS & FILE/
DIRECTORY SERVICE
TOOLS REPORTING TOOLS OBJECT STORE
USE
ADMINISTER
OBJECT STORE
DATABASE/FILE/
BUILD
DATA MODELS
Data
Database
Administration Users
Administration
(Data Stewards)
Figure 1: A Framework for developing the data architecture
This framework illustrates activities, mechanisms and roles involved in data management. At the top
there are activities - Business Analysis which produces data requirements and functional
specifications, and Database design which produces schema definitions, or file designs or object
storage designs.
Data requirements are transformed into Data Definitions. From a Functional Specification Data
Access Definitions are derived. From Database/File/Object Store Design, a database
schema/file/Object Store is defined.
The mechanisms used to capture or implement these outputs are a Data Repository or modelling tools.
Applications are built or purchased. Storage definitions are implemented as (deployed) as a Database
Management System, File System or Object File Store.

Printed 10/03/2019 10/03/2019 Page 9
At the bottom of the figure, are the roles involved in managing and using data. Data Administration ,
acting in the role of data stewards, build the data models in the Data Repository. Users use data.
Database Administration manages the physical databases, files and object file stores.
Using this model we can define what information is gathered, standards for data definitions, databases
definitions, data repositories, data modelling tools, data storage systems, data access tools, and data
interfaces. We can define procedures and guidelines for all of the roles involved in data management.
In this way the framework can help us to develop the data architecture.

Printed 10/03/2019 10/03/2019 Page 10
Section 4 - Policies
4.1 Data Management Policies

This section defines the policy for data management. All data which is subject to these policies is
defined to be “Managed Data”.
Data Management Roles (Data Ownership and Stewardship)

Accountability for data management will be represented by a two-tier management structure :
1. Data Ownership
A data owner will be assigned for each subject area and will be responsible for defining data
meaning, access rules and rules for data content.
2. Data Stewardship
A data steward will be assigned for each subject area and will be responsible for administration of
changes to data content, data access rules and accuracy of data according to the rules defined by
the Data Owner.
Data Repository Policy

Key to a data management policy is a repository to hold data definitions.
· A single set of enterprise-wide definitions will be established for managed data by the data
owners.
· All data definitions will be kept in a central repository.
· At a minimum, data definitions for Subject Area, Entity, Attribute and Relationship will be stored
in the repository.
Collectively these definitions are referred to as a “data model”. The content of the repository
is a representation of the enterprise data model.
Data definitions will include data owner, data steward, definition or meaning, data content
and access rules.
· Local Repositories
If local repositories are used, data definitions contained in a local repository will be vetted by the
data owner before upload from a local repository to the central repository. In the case of
discrepancy, the enterprise data naming standards override any application-specific naming
conventions, and the application-specific data model is deemed to be non-conforming. Corrective
action plans to be supplied by the application project manager to data administration and the data
owner. The Data Owner has veto rights for all disputes.
· Software Package Data Definitions
The data models of packaged software will be mapped to Subject Areas defined in the repository.
When a software package’s Entity, Attribute and Relationship definitions coincide with Entity,
Attribute and Relationship definitions their names will be recorded in the repository as synonyms.
When a software package’s Entity, Attribute and Relationship definitions do not coincide with any
Entity, Attribute or Relationship definition in the repository, they will be recorded as new
definitions in the repository. When more than one package defines the same Entity, Attribute or
Relationship definition, one the definitions will be deemed to be the “base” definition, and the
rest will be defined to be synonyms.

Printed 10/03/2019 10/03/2019 Page 11
Data Standards Policy
Information Systems and Technology (“IST”), the Data Owner and the Data Steward will jointly
establish standards for the following:
· Subject Area Definitions
· Entity Definitions
· Attribute Definitions
· Relationship Definitions
· Templates for definitions
· Naming standards for all entities stored in the repository
· Data Modelling Standards
· The same standards will be applied to data irrespective of the database product used to physically
store the instances of data
Data Integrity Policy

To ensure accuracy, consistency and credibility of production data the following principles will be
adopted:
1. There will be only one copy of managed data. If values are duplicated, all copies will be
maintained to be consistent. In the case where data is duplicated, ITS and the users will
define a versioning strategy at the attribute level to achieve consistency. The case for
duplicate data must be justified by the users.
2. Referential Integrity of the data will be maintained independently of software applications

using declarative referential integrity.
In the case of packaged software, one of the evaluation criteria will be the adherence of
the package to this principle.
Data Audit Policy

The auditing function will perform audits on managed data at a frequency to be determined by IT and
the company’s auditors. Auditing shall sample data from key databases and perform consistency
checks. A report on data integrity will be furnished to the Data Owner, Data Stewards and Data
Administration for information and action.

Printed 10/03/2019 10/03/2019 Page 12
4.2 Data Distribution Policy
This section presents the rationale and policy for data distribution. Distribution of data does not
necessarily imply that data is physically distributed. Policy for the placement of database servers in
the network will determine the physical location of databases - which will be influenced by network
capacity, location and number of qualified database support personnel, optimisation of server machine
usage and cost.
Rationale for Data Distribution

Having the capability to distribute data does not provide a basis for making data distribution
decisions. The availability of relatively cheap hardware and database management software makes
data distribution a reality. Ad-hoc data distribution arrived at by simply purchasing the necessary
hardware and software and populating databases, invalidates a data management policy. To
incorporate distributed data into a data management policy we need to define criteria for making data
distribution decisions. The criteria are not easily arrived at since there are many bases to choose from.
There are two decisions to be made when distributing data: where to locate the data and what data to
distribute. Data Distribution Scenarios are described in Appendix A.
Location of Data
The IS/IT Planning Exercise (H195) produced an Enterprise Data Model (“EDM”). The EDM is an
integral part of ADCO’s data architecture. The EDM is defined in terms of Subject Areas, Subject
Area Databases, Data Collections, Entities, and Entity-Relationship diagrams. (Refer to Ernst &
Young Navigator Systems Series (E&Y/NSS SM): IT/IS Planning Approach for definitions of these
terms).
Candidate bases for making data location decisions are:
· Location of Application
· Location of Users
· Organizational hierarchy
· Point of origin
Location of Application
Appendix A (Data Distribution Scenarios) asserts that “data follows function” and, to achieve locality
of reference, “function follows users”. However, adoption of this strategy may cause periodic re-
evaluation of ADCO’s data distribution strategy and fails to take into account the changing structure
of future applications.
Step back in time 100 years to the oil fields of Texas, USA and ask the question “What entities could
we identify using current data modelling techniques?”. “Well”, “Pipeline”, “Lease”, “Stakeholder”,
“Distributor”, “Supplier” are a few of the entities that were present then and are present now in a data
model of an enterprise whose business is Oil Exploration and Drilling. Similarly the core business
activities do not change - Exploration, Drilling and Delivery. Unless an enterprise changes the
business it is in, say from manufacturer to distributor, the core entities and processes do not change.
What does change are the attributes we keep about each entity, and the way in which we execute the
core business processes. This is why we can buy packages for Finance, Human Resources,
Maintenance and Supply, and Oil Exploration and Drilling.
When we change the way we conduct business using information technology, we change the
applications we use. Improvement and innovation in business processes, through business process re-
engineering and workflow management, cause change to our applications. Applications driven by a
workflow management system are no longer based on activities and have a very different structure
from today’s applications. The frequency of change in the way we execute our business processes

Printed 10/03/2019 10/03/2019 Page 13
determines the frequency of change to our applications. Another factor influencing the frequency of
change to our applications is the rate of adoption of new information technology. For example,
Microsoft’s next generation of systems software and technology will support the creation and
deployment of distributed and component-based software. Applications are simply tools for capturing,
storing, retrieving and manipulating data, and producing information from data. For these reasons,
location of applications is not deemed to be a sound basis for making data distribution decisions.
Location of Users
Location of users is an important factor in making data distribution decisions because of technology
constraints, and more importantly to the user, because the user owns and manages the data. If we had
infinitely powerful computers, databases and networks, then data would not need to be distributed
because it would be readily accessible in an acceptable amount of time at an acceptable cost.
However, arguments based purely on technology rarely win the day since they fail to take into account
preferences, needs, wants, financial “ownership” of technology, and organisational politics.
Organizational Hierarchy
A common basis for data distribution decisions is the organizational hierarchy. This scheme is often
used where a company has an internal structure of autonomous lines of business operating under the
umbrella of the company. This may happen when one company operates in separate business sectors,
or clearly delineated segments of a business sector. (For example in Life Assurance: The business
sector is Life Assurance; The lines of business are: Individual Business; Employee Benefits (Pension
Schemes); Investments - Equity and Property; Unit Trusts or Mutual Funds). Another reason for
using the organizational hierarchy as a basis for data distribution is that a comprehensive data
modelling exercise has never been undertaken. In this scheme data is located according to the
location of the organizational units.
The disadvantage of this scheme is that it does not take into account enterprise applications - those
applications that cross vertical organizational boundaries. This often results in duplication of data in
different business units; a need to replicate data for enterprise applications, to overcome performance
problems in accessing data spread over a network; a need for location transparency so that enterprise
applications are not affected by server and database deployment decisions.
Point of Origin
In a highly centralized IT environment using mainframe databases there is no data distribution. All
data resides on the mainframe. This forced the point of origin to be the mainframe and influenced the
way we think about data placement, ownership, custodianship and stewardship. Because all data was
centralized and defined and administered by MIS (Business Analysts, Systems Analysts, Data &
Database Administrators), the business community implicitly delegated these roles to MIS.
The advent of applications designed to the client-server model deployed in a heterogeneous distributed
computing environment, and the opportunity to move data from the mainframe, causes these issues to
surface. Point of Origin can now be at a branch office stored on a local database, or even on an
individual’s personal computer with a local database. In this situation, a mainframe database might
revert to holding copies of data instead of originals for wider access. In a distributed computing
environment an understanding of ownership and stewardship is required to effect data management.
The business community has the opportunity to resume these roles which have been assumed by MIS.
Conclusions on Location of Data

Point of Origin is the most logical choice for determining location of data because it ties in well with
data stewardship. In this sense, point of origin might be interpreted as the location of the data
stewards (i.e. the users that are responsible for the administration of changes to data content, data
access rule and data content rules).

Printed 10/03/2019 10/03/2019 Page 14
“Unit” of data distribution
Having decided where to locate data, the decision of what “unit” of data to distribute is the next
decision to be made. Candidates for a unit of data distribution are :
· Data Collection
· Subject Database
· Types of Data
Data Collection
A Data Collection is defined as “The physical collection or means of management of one or more
entity types”. A Data Collection is characterized by Data Type, DBMS, Security, Frequency of Use,
Volume and Application Usage (Source: E&Y/NSSSM: Planning Phase Techniques).
Subject Area Database

A Subject Area Database is defined as “A database scoped in terms of a set of entities with high
affinity for each other in terms of their use by processes.”. Also of note is: “A subject database may be
implemented as a collection of physical data structures at a variety of locations”. (Source:
E&Y/NSSSM: Planning Phase Techniques).
Types of Data
A commonly-used scheme classifies data by owners in the organizational hierarchy. Using this
approach data is classified as Line-Of-Business, Divisional, Departmental, Team and Personal. So-
called “Corporate data” never gets defined because the only corporate entity above a line-of-business
is usually the board of directors, who are certainly not going to be interested in these details, nor take
on ownership and stewardship of data. Another point of view asserts that all data is corporate, and all
we need to know is who owns and maintains it. Yet another point of view asserts that “corporate”
data is determined by the degree of usage or sharing of a piece of data by applications.
Conclusions on “Unit” of Data Distribution

Data Collections are too physical and Type of Data is too subjective. Subject Databases are related to
business process and suggest an implementation scheme. Subject Area is the best candidate for a
“unit” of data distribution (the “unit” is the collection of entities contained in a Subject Area
Database). One entity can be contained in more than one Subject Area Database, which could create
redundancy, but the hint on implementation overcomes this problem.
For implementation this means that whenever a subject database contains an entity that occurs in at
least one other subject database, we will partition the subject database into so that the common entities
are held in separate physical subject area database.
Further, the principle on “centrally maintained data” and “locally maintained data” means that we
may have to further partition the subject database by partitioning it into “centrally maintained” and
“locally maintained” data.
This concept needs to be proven prior to rollout. Consideration needs to be given to relationships that
span partitions.
Summary of Basis for Data Distribution

In summary, Point of Origin determines where data will be located and Subject Area will determine
what is data is distributed. However, “Type of Data” is still commonly used in any discussion about
data and there is a need for a simple and objective definition. We propose the following two
definitions for Types of Data, which are related to how it is administered rather than trying to classify
it at some “level” in the organisation (e.g. Corporate, Business Unit, Departmental, and so on). The
definitions are:

Printed 10/03/2019 10/03/2019 Page 15
Centrally Administered Data
“Centrally administered” means data that has one and only one point of origin. Centrally
administered data is characterised by data that is commonly used throughout the enterprise and that is
administered by one data steward on behalf of the enterprise to ensure data consistency.
For example, ADCO’s major organisational units are Business Support, Operations, Technical and
Administration (Source: IS Migration Project Repository - Hierarchy Report “Organisational Unit is
Responsible for Process”). Any data that is commonly used by all major organisational units of
ADCO is defined to be common enterprise data and will be centrally administered to maintain
consistency of data.
Examples of entities that are candidates for classification as “centrally administered data” are
Company, Contractor, and Field (Source: IS Migration Project Repository - Hierarchy Report “Subject
Area involves Entity”) Similarly, any data that is commonly used by, say, all organisational units
within Technical is defined to be data common to the Technical division of ADCO and will be
centrally administered within the Technical division of ADCO. The entity types defined by the Finder
data model are an example of data common to all organisational units within the Technical division of
ADCO.
Locally Administered Data

“Locally administered” means data that has more than on point of origin. ). Locally administered
data is characterised by its point of origin defining its location. There are two possible interpretations
of this definition and both are valid. The first interpretation is: data types that are unique to a
location. The second interpretation is: occurrences of the same data type exist at more than one
location.
Data Types unique to a point of origin

In this case the data is naturally distributed because it is unique to a location.
Occurrences of the same data type at more than one point of origin
The method of distribution is to partition data horizontally (See Appendix A).
For example, in the field, the same data types may be used at all sites but each site maintains its own
occurrences of those data types. Another example of locally administered data is a project database in
the Finder product. Project databases are a horizontal partition, and replica, of the Finder database.
As a by-product this type of distribution solves the classic problem of relating an occurrence of data to
a user or group of users (data privacy). Defining privacy rules per occurrence of data is a massive
administrative and processing overhead and no commercial relational database products provide this
feature for obvious reasons. Implementation of privacy on an occurrence of a data type requires the
definition and storage of properties. These techniques are being introduced for object stores and these
techniques might be adopted by relational database vendors.
This ends the discussion of the rationale for data distribution and we can now state a data distribution
policy based on this rationale.
Data Distribution Policy

The Data Distribution policy will be a hybrid of centralized data and decentralised data:
1. Decentralised Data
Locally administered data will be decentralised and will reside at its point of origin.
2. Centralized
Centrally administered data will reside at a central location.

Printed 10/03/2019 10/03/2019 Page 16
The “unit” of data distribution will be a Subject Area Database, which may be implemented as one or
more physical databases. An illustration of this policy is shown in the next figure.
A physical Subject Area Database (no common entities, all entities are centrally maintained or locally
maintained) is stored on a Subject Area Database Server. If a logical subject database contains
common entities, then a second physical subject database (a “Super” Subject Area Database) will be
implemented.
A database server may be connected to a LAN or directly to the backbone. The location of the LAN is
determined by the type of data contained in the subject area database - centrally or locally maintained
data. Note that “central location” means one place, not necessarily head office. Also, locally
administered data may not necessarily be located at the physical location of the users who maintain
the data. These are all implementation decisions.
Packages and Data Distribution

ADCO’s policy on business solutions is “buy before build”. This policy influences, to a great extent,
the degree of control that ADCO has over implementing a data distribution policy. In practice ADCO
will be severe limited in implementing a data distribution policy such as the one described here (with
the possible exception of the Finder package. This will depend on whether or not the Project
Databases can be located at different locations from the master database).
If all of the data maintained by a package is deemed to be locally administered data, then a copy of the
application and its database must be installed at the desired location, at extra cost. If some data is
classified as centrally administered and some is classified as locally administered then ADCO will be
dependent on the capability to disable functionality that maintains centrally administered data at local
sites, and to disable functions that maintain locally administered data at the central location.
Depending on the package’s data model this may not be possible if there are relationships between
centrally administered data and locally administered data. It is possible to implement such a solution
but only if the software is custom-built.
Another sub-optimal solution is to maintain all data centrally and use the database management
system to replicate data that must be available locally (this is not the same as locally administered
SUBJECT LAN OR
AREA BACKBONE
DATABASE
DESKTOP
DATABASE
COMPUTER
SERVER
"SUPER"
SUBJECT
AREA
DATABASE DATABASE
SERVER
Figure 2: Subject Area Databases
data).

Printed 10/03/2019 10/03/2019 Page 17
4.3 Database Management Policy
Data Replication Policy

As a general rule, data replication will only be used in the following situations:
1. Whenever Service Level Agreements (“SLAs”) for Data Availability cannot be met.
To be resolved with the Data Owner and Data Steward (See Data Availability)
2. To facilitate the transfer of data across network nodes.
3. To facilitate the transfer of data between heterogeneous technology platforms.
Data Availability Policy

Data availability will be managed using the following principles:
1. Availability of data will be determined by the class of data:
· Centrally Administered Data
· Locally Administered Data
· Other classes....
2. Data that is no longer active will be archived or deleted.
ITS and users will jointly establish archiving rules for historical data.
3. Service Level Agreements (“SLA”) will be established for each class of data.
An example of a template for an SLA is:
· Lifetime of SLA
· Frequency of Review
For a new SLA: Monthly for 1st quarter, thereafter quarterly
· Identity of Data Steward
· Minimum and maximum standards for availability by class of data
· Frequency, maximum and minimum duration of database management activities

(Reorganisation, Index Builds, and so on). Enables capacity planning.
These will be factored by volume and number of transactions. The idea here is that
the larger the volume and number of transactions, the more frequent and longer the
database activities to maintain optimal performance.
· Forecast Growth
Users to supply data, database administration translates to capacity requirements

and costs. Enables capacity planning.
· Reporting requirements against this SLA

Printed 10/03/2019 10/03/2019 Page 18
Format, Frequency, Risk Assessment, Best practices form reviews with other data
stewards, Activity Reports (# Transactions, # Concurrent Users, Forecast Growth
(possible forecasting models: Moving Average, Linear Regression, or Exponential
Smoothing (with Trend))
· Procedure for handling exceptions to this SLA (exceeding thresholds)
· Other suggestions:
· Set up an E-Mail alias for communication
· One per “User Group” or “Subject Area”
Members are Data Administration, Data Owner, Data Steward, Users
· One general alias for general data management issues
Members are Data Administration, All Data Owners, All Data Stewards.
· Set up a Knowledge Base - User-oriented describing knowledge about data and data
administration procedures.
Data Security Policy

One of the challenges for security in a client-server database environment is the openness of the
DBMS. For example, there are approximately 150 desktop tools that can access Microsoft SQL
Server. So do you base security on a user or an application? If a user has access to an application’s
create, update and delete functionality, is that an implicit right to the affected database table’s Insert,
Update and Delete functions? And then, can any user gain access to any table through a tool?
· Managed Data will only be updated by an approved application.
Data Owner to approve.
· Data Owner’s security policy will be implemented using the DBMS’s security mechanism.
Users or User Groups are registered for none, one, some or all of Select, Insert, Update and
Delete access.
· A Database Owner will be defined as a member of Data Admin. Team.
· The Database Owner password will be changed every ....
· Uploads from personal databases to databases containing managed data must be vetted and
sanctioned by the data owner and steward of the target database.
Uploads will be via a secure “transaction” database containing data integrity rules.
· Downloads from databases containing managed data will be subject to security rules set by data
owner and data steward of the source database.
· Uploads from personal spreadsheets, text files, floppy disks files will not be permitted.
· Reports from databases will include the name of the database, date and time of report.
Database Backup Policy

Backups are performed for two purposes:

Printed 10/03/2019 10/03/2019 Page 19
1. To protect against media failure
2. To protect against accidental/malicious destruction or corruption of data.
· Backups for media failure will be performed by LAN Server Administration
· Backups for accidental/malicious destruction or corruption of data will be performed by Database

Administration.
· Frequency of backup will be determined by SLA for each subject area database.
· Performance & Utilisation Analysis
Key performance indicators of all databases will be monitored daily.
Performance statistics of all databases will be collected and analysed monthly.
Database administration will provide input for LAN Server capacity planning.
· Database Integrity
A Database Maintenance schedule will be planned for each quarter and reviewed monthly.
· Risk Management
A Risk Assessment (Severity, Description Mitigation) will be performed monthly.
Data Access Tools

End users will require a Data Access and Reporting Tool that can access any of ADCO’s database
types.
an approved list of Data Access and Reporting Tools that meet this requirement will be maintained by
Data Administration.

Printed 10/03/2019 10/03/2019 Page 20
APPENDIX A
Data Distribution Scenarios
General Discussion
Physical database design is discussed in a wide body of literature and practice that will not be
reiterated here. However, one critical issue that needs to be discussed is strategies and approaches for
distributing data. Distributing database(s) has the same goals: gain locality of reference, take
advantage of available processing power, improve scalability, and improve availability.
One of the most difficult issues for database designers attempting to distribute database(s) is
autonomy, or the distribution of control of the data. This is the degree to which a single site can run
independently of other sites The degree of autonomy in a data distribution strategy can exist at any
point along a continuum from complete coupling (total dependence) to total isolation (total
independence).
In a system with complete coupling, it is easier to present the user with the illusion of a single
integrated database. In this environment, each location in the distributed database has complete
knowledge of the state of the system and must have components that can control actions on data
spanning multiple locations. This complete knowledge is very difficult to attain and manage across
large systems. Additionally, every location must continue to have complete knowledge about other
locations, which makes ongoing changes and maintenance difficult (regression effects abound).
In a system with total isolation, each location operates with a standalone database and is unaware of
the other locations in the system. To distribute data in this model, a distribution strategy must be
layered on top of the local site, and all co-ordination and data transformations must be handled by this
layer. Note that transaction control across multiple sites in this model is extremely difficult.
The best approach lies somewhere in between complete coupling and total isolation. Based on the
software component distribution (the software components should already be as loosely coupled and
strongly cohesive as possible), it should be possible to look at sites independently and have them
voluntarily participate in areas where data needs to be shared. The independent view of the local sites
requires some modification to incorporate a knowledge level about the site(s) of distributed data, but
they only need knowledge about a limited subset rather than knowledge of all the data and
relationships in the enterprise. This information should already be captured in the transaction design
for the software components. This validates the earlier statement that if software components are
distributed, the logical sites for the data stores should follow.
So, what is to be distributed is known, but what is the best way to distribute it? Distribution must be
examined based on the characteristics of the data to be distributed. The key characteristics for this
decision are ease of partitioning, volatility, and site. Ease of partitioning refers to how easily the
database can be partitioned - vertically or horizontally. Volatility refers to how often the data itself
changes. Site refers to where the updating takes place (and again should be able to be determined
from examining the software components that use the data and how they are distributed). The
following table summarizes the basic strategies for distribution that can be applied based on these four
characteristics :

Printed 10/03/2019 10/03/2019 Page 21
Ease of Partitioning Volatility Site Distribution Strategy
hard N/A N/A centralized
easy high local partition
easy low none simple extract
easy low none time stamp extract
easy medium none refreshed extract
easy low local periodic replica
easy medium local continuous replica
easy low multisite checkout replica
easy medium-high multisite centralized
These strategies are described in more detail below.
Centralized
With centralized data, data is stored in a database at a central location. The database is called the
central database. “Centralized data” may mean centralized data for a whole corporation, but it might
be centralized data for a division, branch or department (or any other oganizational unit or grouping).
Centralized means in one place. Centralized does not necessarily mean mainframe (although
centralized data could be implemented on a mainframe, or a super-server, or a PC Server). Along
with centralized location, this strategy also means that no copies of the data are made. When the data
cannot be partitioned, look at a centralized strategy. A centralized strategy can also be used where
there is medium to high volatility and multi-site update requirements to the same data.
U p d a te s
C e n tra l
D a ta b a s e
R e t r ie v a l s
Figure 3: Centralized Data

Printed 10/03/2019 10/03/2019 Page 22
Partition
A partition is a segmentation of a central database. This strategy is appropriate when the data is
relatively easy to partition, can be updated at a local site, and changes frequently. There is no overlap
between partitions. In a horizontal partition, each row exists in only one database. In a vertical
partition, each column is contained in one, and only one, database, except for the primary key.
Single-site update is enforced.
Updates
Partition
Database
Retrievals
Updates
Partition
Database
Retrievals
Updates
Partition
Database
Retrievals
Figure 4: Partitioned Data

Printed 10/03/2019 10/03/2019 Page 23
Extract
An extract is a copy of all or a portion of the a database. Update to the extract is not allowed. This
copy by itself is called a simple extract. A simple extract is appropriate when the data is easy to
partition and is not updated by the system under consideration. When the data changes occasionally,
but is still not updated by this system, a time stamp is added as an indicator that tells the system or
user how old the data is. This is a time stamp extract. The system or the user must judge whether the
extracted data is still valid based on this time stamp. If the data needs to be more timely or under
stricter control, the extract can be performed automatically at defined intervals for a refreshed extract.
Updates
FULL IMAGE Copy of

Central
Central
Database
Incremental Image Database
(if refreshed)
Retrievals
Retrievals
Figure 5: Extracted Data (Full Data Copy)
Replica
When the site of update changes to a local site, a different set of strategies is needed. Updates cannot
be allowed on both the local and central databases. In this situation, use a replica. A replica is a
partition or copy of a central database that may be updated. If there can be some delay in
synchronization of the local and central databases, use a periodic replica, where incremental images
of changes are processed in batch mode and sent back to the central database as a group. If
synchronization is critical, use a continuous replica. This is a replica that synchronizes continuously
with the central database. As each change occurs, it is replicated back to the central database as an
individual transaction. This is not a two-phase commit. The user does not have to wait for the dual
portion of the update. A continuous replica is often implemented with some type of store-and-forward
mechanism. A checkout replica is useful when the location of an update cannot be predetermined.
This is a replica that is flagged in the central database as checked out; no one else can update it, but
others could still view it (or extract it).

Printed 10/03/2019 10/03/2019 Page 24
Updates
Partition
Central
Database
Retrievals
Periodic or
Copy or Updates
Continuous
Partition of
Replication.
Central
Single-/Bi- Database
Retrievals
Directional
Copy or Updates
Partition of
Central
Database Retrievals
Figure 6: Replicated Data
Factors for Physical Design

There is still more work to do even after it is determined what data is to be distributed and how (the
strategy for distribution). Other factors that impact the final physical design need to be examined.
Some of the questions/issues that need to be considered are:
· What are the trade-offs between the overhead of copying and storing multiple copies of data
versus remote data access?
· Will servers become overburdened with connected users?
· Are ad-hoc queries allowed to run on transaction databases?
· Is there a large enough time window to allow effective data extracts or replications?
· Should all distributed databases have the same schema to facilitate replication or will
extensions and optimizations be allowed?
· Can the network handle the traffic size and volume of the anticipated replication strategy?
· Is the network to all distributed sites reliable?
· Can the distributed data be effectively managed (security and disaster recovery)?
A final and important consideration is the enterprise’s readiness to design, implement and manage a
distributed data environment. For an enterprise migrating from a centralized mainframe database
environment to a client-server environment, there are many other issues, over and above data
distribution, to cope with. From a risk management perspective, implementing a distributed data
policy may be one risk too many. The need for distributed data needs to be prioritized and balanced
against other needs. In an enterprise that relies primarily on purchased software, the degree of control
the enterprise has over data distribution strategy may be limited and it is likely that each package’s
database will be centralized with respect to the application package.

Printed 10/03/2019 10/03/2019 Page 25
Appendix B
Microsoft’s vision for unified information management : “Information At

Your Fingertips”
A unified facility for storing and sharing information

...Will be achieved by selecting one object storage architecture, such as Microsoft’s Cairo Object File
System. Until systems such as Cairo are commercially available, there are a number of
comprehensive Document Management Systems available (where “document” is defined as a file and
its attributes, and is not limited to a text document)
A unified facility for exchanging information

Microsoft’s Exchange Server provides the facility to share information by providing folders of
information. Applications can store and retrieve objects in Microsoft Exchange folders using
Microsoft’s Messaging Application Programming Interface (MAPI). Folders can be shared (public
folders) and replicated across a network for local access.
A unified facility for managing information

“Managing Information” is a very broad term and is only partly in the realm of software. Facilities
for managing information provided by software will include features like: Object Versioning,
Keyword Searches, Hierarchical Searches, full-text and pattern indexing, Hierarchical Storage
Management, Encryption, Access Control, Compound Documents (Objects containing objects), and
programmable folders, check-in, check-out. Object stores should support both magnetic and optical
media. Object stores should be programmable in a common programming language to avoid the need
for specialised programming skills, and to avoid cost of retraining programmers. Facilities
management should be unified within the operating system environment. For example, replication of
object stores, and location of object stores should be managed in exactly the same way as replication
and location is today. with respect to the operating system, object stores are just another type of
resources, managed by an installable file system.
With all of these features, object stores will be a comprehensive platform for anew generation of
document management and workflow management tools.

Printed 10/03/2019 10/03/2019 Page 26
APPENDIX C
Notes on implementation of a Data Warehouse

There is a wide body of literature on Data Warehouses and we do not intend to reiterate it here. Key
points for implementing a data warehouse are:
An automated collection mechanism to collect data required in the data warehouse. This data may be
exact copies of operational data, in which case replication technology can be used to populate a data
warehouse.
Typically, the content of a data warehouse is aggregated data and aggregation procedures can be
defined on operational databases to either populate a data warehouse directly, or to build an extract
database and replicate that to the data warehouse, again using replication technology.
Where the source database and the data warehouse are disparate then either middleware that can
access a variety of database types or export/import procedures can be used.
Having a mass of data stored in a data warehouse is not very useful unless the data warehouse has
features for displaying and navigating its contents to the user.
For the more sophisticated user, who has a knowledge of the data warehouse structure and content, a
data access and reporting tool is needed to provide the facility to discover information in the data
warehouse.
DATA WAREHOUSE POPULATION

This section briefly discusses data integration from a data warehouse viewpoint. The discussion is
limited to database access, although the same arguments hold for files and object stores in that they
hinge around a common data interface.
The next figure illustrates a logical view of the population of a data warehouse. There are several
options for implementing the data collection mechanism shown in the figure. They are:
· Use a middleware product that can retrieve data from many database types
· Have each database populate the data warehouse.
This can be implemented easily if the source database and the data warehouse are of the same
type.
· Implement a staging database of the same type as the data warehouse.
Populate the staging database by export/import. Then replicate the contents to the data
warehouse. This technique has the advantage that the staging database acts as a transaction
database, offering the possibility to reverse updates to the data warehouse.

Printed 10/03/2019 10/03/2019 Page 27
FINANCE
DATABASE
HUMAN
RESOURCES
DATABASE
DATA DATA
WAREHOUSE COLECTION
(MIS DATABASE) MECHANISM
SUPPLY &
MAINTENANCE
DATABASE
OTHER
BUSINESS UNIT
DATABASE
Figure 7: Data Warehouse population

Printed 10/03/2019 10/03/2019 Page 28

Data Arch

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Arch

Загружено:

Авторское право:

Доступные форматы

ADCO

411510387.doc/ADCO Data Architecture

PREFACE TO THE DATA ARCHITECTURE......................................................................................

4.1 DATA MANAGEMENT POLICIES..................................................................................................

411510387.doc/ADCO Data Architecture

NOTES ON IMPLEMENTATION OF A DATA WAREHOUSE..........................................................

411510387.doc/ADCO Data Architecture

Evolution of the Data Architecture

411510387.doc/ADCO Data Architecture

411510387.doc/ADCO Data Architecture

The need for a Data Architecture

Vision and Scope

411510387.doc/ADCO Data Architecture

Classify and Organise Data Types and Definitions

Classify and Organise Data Storage and Access

411510387.doc/ADCO Data Architecture

Standard Data Access Interface

Flexibility to adapt to changing data requirements

411510387.doc/ADCO Data Architecture

DATA DATA DEFINITIONS SCHEMA/FILE/

REPOSITORY/ APPLICATIONS/ DATABASE/

Figure 1: A Framework for developing the data architecture

411510387.doc/ADCO Data Architecture

411510387.doc/ADCO Data Architecture

4.1 Data Management Policies

Data Management Roles (Data Ownership and Stewardship)

Data Repository Policy

411510387.doc/ADCO Data Architecture

· Subject Area Definitions

· Templates for definitions

· Naming standards for all entities stored in the repository

· Data Modelling Standards

Data Integrity Policy

2. Referential Integrity of the data will be maintained independently of software applications

Data Audit Policy

411510387.doc/ADCO Data Architecture

Rationale for Data Distribution

Candidate bases for making data location decisions are:

411510387.doc/ADCO Data Architecture

Conclusions on Location of Data

411510387.doc/ADCO Data Architecture

Subject Area Database

Conclusions on “Unit” of Data Distribution

Summary of Basis for Data Distribution

411510387.doc/ADCO Data Architecture

Locally Administered Data

Data Types unique to a point of origin

Data Distribution Policy

Centrally administered data will reside at a central location.

411510387.doc/ADCO Data Architecture

Packages and Data Distribution

Figure 2: Subject Area Databases

411510387.doc/ADCO Data Architecture

Data Replication Policy

2. To facilitate the transfer of data across network nodes.

3. To facilitate the transfer of data between heterogeneous technology platforms.

Data Availability Policy

1. Availability of data will be determined by the class of data:

· Centrally Administered Data

· Locally Administered Data

2. Data that is no longer active will be archived or deleted.

An example of a template for an SLA is:

For a new SLA: Monthly for 1st quarter, thereafter quarterly

· Identity of Data Steward