Академический Документы
Профессиональный Документы
Культура Документы
Data Architecture
BUSINESS PROBLEM.............................................................................................................................
The need for a Data Architecture............................................................................................................
VISION AND SCOPE.......................................................................................................................................
SECTION 2 - FEATURES AND BENEFITS..........................................................................................
Simplicity.................................................................................................................................................
Classify and Organise Data Types and Definitions................................................................................
Classify and Organise Data Storage and Access....................................................................................
Integrated.................................................................................................................................................
Management.............................................................................................................................................
BENEFITS......................................................................................................................................................
Standard Data Access Interface..............................................................................................................
Reuse IT Infrastructure............................................................................................................................
Flexibility to adapt to changing data requirements................................................................................
Data Integrity...........................................................................................................................................
Location Transparency............................................................................................................................
SECTION 3 - A FRAMEWORK FOR DEVELOPMENT OF THE ARCHITECTURE..................
SECTION 4 - POLICIES..........................................................................................................................
Description
This Data Architecture is one of four architectural components that comprise the Enterprise
Architecture. A data architecture is a framework for managing the data of an enterprise. This
framework defines standards, procedures and guidelines for data management. There are two distinct
views of data which are reflected by the distinction between data administration and database
management. Data administration classifies and organises data into data models containing
attributes, entities and relationships. Database administration classifies and organises data into
databases containing data fields, and tables.
Operations on data are the province of data administration, application development, and database
administration. Operations on data can be broadly classified into two categories: maintenance and
business. Maintenance operations facilitate the creation, update, and deletion of data; they deal with
data content. Business operations apply algorithms (calculation, derivation, aggregation) and manage
relationships between the data; they deal with information. Rules for both classes of operation are
defined to provide information to the business. Rules for data content are the province of data
administration, application development and database administration. Data administration collects
the rules for data content, application development and database administration implement the rules.
Both functions, data and database management, define policy for data and database management and
standards, procedures and guidelines which implement policy.
This version of the data architecture focuses on a framework to describe a data architecture, data and
database management policy, and data distribution.
Limitations
This version of the data architecture is incomplete. Standards, Procedures and Guidelines to
implement policy are not defined. Technology requirements, for input to the technology architecture,
are not specified.
Approach
The approach taken is to develop the IT/IS principles for Data and Information from the IS/IT
Planning exercise with ADCO’s data administration and database administration staff. The policies
cover three subject areas: Data Administration, Data Distribution and Database Management.
The data architecture is described in four sections and a number of appendices. The first section
described the Business Problem to be solved by the Data Architecture, the User Profiles of the users of
data, and the Vision and Scope. The second section describes the features and benefits of the data
architecture. The third section presents a framework for development of the data architecture. The
fourth section states policies for Data Management, Data Distribution and Database Management.
Bases for data distribution are discussed and a basis for data distribution is stated as policy. In this
discussion, a concept for data classification is presented which aids discussions of data distribution.
Appendix A is a discussion of data distribution scenarios. Appendix B briefly describes Microsoft’s
vision for unified facilities for data storage, exchange and information management and current and
future technologies in support of that vision. Appendix C presents notes for implementation of a Data
Warehouse.
Business Problem
The diversity of data and storage mechanisms, decentralised and autonomous data collection and
processing can combine to create an environment where the credibility of data is suspect. Each
organisational unit maintains its own data and the enterprise cannot determine which data is valid for
business decision-making. An ungoverned environment for data definition and storage results in an
inability to rely on or locate data for decision making. Lack of data definition and/or multiple
definitions can lead to misinterpretation of data, or different interpretations of the same data by
different business units. An enterprise requires many types of data to support business activities,
much of it unstructured.
Simplicity
The data architecture must be easy to understand and easy to communicate.
Management
The data architecture must define policy and standards, procedures and guidelines for data and data
storage management.
Benefits
Here is a partial list of benefits.
Reuse IT Infrastructure
Standard data storage architectures means that IT infrastructure is reused for each new data storage
requirement instead of purchasing more data storage and data management technologies.
Data Integrity
By exercising data management policy, data integrity can be assured.
Location Transparency
Users will not need to know where data is.
Business
Analysis Database/
(Analysts & Data Owners) File/
Data Object Store
Functional
Requirements Design
Specification
and Rules
DEFINE
DESIGN OR
GATHER
EVALUATE
BUILD
DEPLOY
IMPLEMENTED
BUY OR
ADMINISTER
OBJECT STORE
DATABASE/FILE/
BUILD
DATA MODELS
Data
Database
Administration Users
Administration
(Data Stewards)
This framework illustrates activities, mechanisms and roles involved in data management. At the top
there are activities - Business Analysis which produces data requirements and functional
specifications, and Database design which produces schema definitions, or file designs or object
storage designs.
Data requirements are transformed into Data Definitions. From a Functional Specification Data
Access Definitions are derived. From Database/File/Object Store Design, a database
schema/file/Object Store is defined.
The mechanisms used to capture or implement these outputs are a Data Repository or modelling tools.
Applications are built or purchased. Storage definitions are implemented as (deployed) as a Database
Management System, File System or Object File Store.
1. Data Ownership
A data owner will be assigned for each subject area and will be responsible for defining data
meaning, access rules and rules for data content.
2. Data Stewardship
A data steward will be assigned for each subject area and will be responsible for administration of
changes to data content, data access rules and accuracy of data according to the rules defined by
the Data Owner.
· Entity Definitions
· Attribute Definitions
· Relationship Definitions
· The same standards will be applied to data irrespective of the database product used to physically
store the instances of data
1. There will be only one copy of managed data. If values are duplicated, all copies will be
maintained to be consistent. In the case where data is duplicated, ITS and the users will
define a versioning strategy at the attribute level to achieve consistency. The case for
duplicate data must be justified by the users.
In the case of packaged software, one of the evaluation criteria will be the adherence of
the package to this principle.
Location of Data
The IS/IT Planning Exercise (H195) produced an Enterprise Data Model (“EDM”). The EDM is an
integral part of ADCO’s data architecture. The EDM is defined in terms of Subject Areas, Subject
Area Databases, Data Collections, Entities, and Entity-Relationship diagrams. (Refer to Ernst &
Young Navigator Systems Series (E&Y/NSS SM): IT/IS Planning Approach for definitions of these
terms).
· Location of Application
· Location of Users
· Organizational hierarchy
· Point of origin
Location of Application
Appendix A (Data Distribution Scenarios) asserts that “data follows function” and, to achieve locality
of reference, “function follows users”. However, adoption of this strategy may cause periodic re-
evaluation of ADCO’s data distribution strategy and fails to take into account the changing structure
of future applications.
Step back in time 100 years to the oil fields of Texas, USA and ask the question “What entities could
we identify using current data modelling techniques?”. “Well”, “Pipeline”, “Lease”, “Stakeholder”,
“Distributor”, “Supplier” are a few of the entities that were present then and are present now in a data
model of an enterprise whose business is Oil Exploration and Drilling. Similarly the core business
activities do not change - Exploration, Drilling and Delivery. Unless an enterprise changes the
business it is in, say from manufacturer to distributor, the core entities and processes do not change.
What does change are the attributes we keep about each entity, and the way in which we execute the
core business processes. This is why we can buy packages for Finance, Human Resources,
Maintenance and Supply, and Oil Exploration and Drilling.
When we change the way we conduct business using information technology, we change the
applications we use. Improvement and innovation in business processes, through business process re-
engineering and workflow management, cause change to our applications. Applications driven by a
workflow management system are no longer based on activities and have a very different structure
from today’s applications. The frequency of change in the way we execute our business processes
Location of Users
Location of users is an important factor in making data distribution decisions because of technology
constraints, and more importantly to the user, because the user owns and manages the data. If we had
infinitely powerful computers, databases and networks, then data would not need to be distributed
because it would be readily accessible in an acceptable amount of time at an acceptable cost.
However, arguments based purely on technology rarely win the day since they fail to take into account
preferences, needs, wants, financial “ownership” of technology, and organisational politics.
Organizational Hierarchy
A common basis for data distribution decisions is the organizational hierarchy. This scheme is often
used where a company has an internal structure of autonomous lines of business operating under the
umbrella of the company. This may happen when one company operates in separate business sectors,
or clearly delineated segments of a business sector. (For example in Life Assurance: The business
sector is Life Assurance; The lines of business are: Individual Business; Employee Benefits (Pension
Schemes); Investments - Equity and Property; Unit Trusts or Mutual Funds). Another reason for
using the organizational hierarchy as a basis for data distribution is that a comprehensive data
modelling exercise has never been undertaken. In this scheme data is located according to the
location of the organizational units.
The disadvantage of this scheme is that it does not take into account enterprise applications - those
applications that cross vertical organizational boundaries. This often results in duplication of data in
different business units; a need to replicate data for enterprise applications, to overcome performance
problems in accessing data spread over a network; a need for location transparency so that enterprise
applications are not affected by server and database deployment decisions.
Point of Origin
In a highly centralized IT environment using mainframe databases there is no data distribution. All
data resides on the mainframe. This forced the point of origin to be the mainframe and influenced the
way we think about data placement, ownership, custodianship and stewardship. Because all data was
centralized and defined and administered by MIS (Business Analysts, Systems Analysts, Data &
Database Administrators), the business community implicitly delegated these roles to MIS.
The advent of applications designed to the client-server model deployed in a heterogeneous distributed
computing environment, and the opportunity to move data from the mainframe, causes these issues to
surface. Point of Origin can now be at a branch office stored on a local database, or even on an
individual’s personal computer with a local database. In this situation, a mainframe database might
revert to holding copies of data instead of originals for wider access. In a distributed computing
environment an understanding of ownership and stewardship is required to effect data management.
The business community has the opportunity to resume these roles which have been assumed by MIS.
· Data Collection
· Subject Database
· Types of Data
Data Collection
A Data Collection is defined as “The physical collection or means of management of one or more
entity types”. A Data Collection is characterized by Data Type, DBMS, Security, Frequency of Use,
Volume and Application Usage (Source: E&Y/NSSSM: Planning Phase Techniques).
Types of Data
A commonly-used scheme classifies data by owners in the organizational hierarchy. Using this
approach data is classified as Line-Of-Business, Divisional, Departmental, Team and Personal. So-
called “Corporate data” never gets defined because the only corporate entity above a line-of-business
is usually the board of directors, who are certainly not going to be interested in these details, nor take
on ownership and stewardship of data. Another point of view asserts that all data is corporate, and all
we need to know is who owns and maintains it. Yet another point of view asserts that “corporate”
data is determined by the degree of usage or sharing of a piece of data by applications.
For example, ADCO’s major organisational units are Business Support, Operations, Technical and
Administration (Source: IS Migration Project Repository - Hierarchy Report “Organisational Unit is
Responsible for Process”). Any data that is commonly used by all major organisational units of
ADCO is defined to be common enterprise data and will be centrally administered to maintain
consistency of data.
Examples of entities that are candidates for classification as “centrally administered data” are
Company, Contractor, and Field (Source: IS Migration Project Repository - Hierarchy Report “Subject
Area involves Entity”) Similarly, any data that is commonly used by, say, all organisational units
within Technical is defined to be data common to the Technical division of ADCO and will be
centrally administered within the Technical division of ADCO. The entity types defined by the Finder
data model are an example of data common to all organisational units within the Technical division of
ADCO.
Occurrences of the same data type at more than one point of origin
The method of distribution is to partition data horizontally (See Appendix A).
For example, in the field, the same data types may be used at all sites but each site maintains its own
occurrences of those data types. Another example of locally administered data is a project database in
the Finder product. Project databases are a horizontal partition, and replica, of the Finder database.
As a by-product this type of distribution solves the classic problem of relating an occurrence of data to
a user or group of users (data privacy). Defining privacy rules per occurrence of data is a massive
administrative and processing overhead and no commercial relational database products provide this
feature for obvious reasons. Implementation of privacy on an occurrence of a data type requires the
definition and storage of properties. These techniques are being introduced for object stores and these
techniques might be adopted by relational database vendors.
This ends the discussion of the rationale for data distribution and we can now state a data distribution
policy based on this rationale.
1. Decentralised Data
Locally administered data will be decentralised and will reside at its point of origin.
2. Centralized
A database server may be connected to a LAN or directly to the backbone. The location of the LAN is
determined by the type of data contained in the subject area database - centrally or locally maintained
data. Note that “central location” means one place, not necessarily head office. Also, locally
administered data may not necessarily be located at the physical location of the users who maintain
the data. These are all implementation decisions.
If all of the data maintained by a package is deemed to be locally administered data, then a copy of the
application and its database must be installed at the desired location, at extra cost. If some data is
classified as centrally administered and some is classified as locally administered then ADCO will be
dependent on the capability to disable functionality that maintains centrally administered data at local
sites, and to disable functions that maintain locally administered data at the central location.
Depending on the package’s data model this may not be possible if there are relationships between
centrally administered data and locally administered data. It is possible to implement such a solution
but only if the software is custom-built.
Another sub-optimal solution is to maintain all data centrally and use the database management
system to replicate data that must be available locally (this is not the same as locally administered
SUBJECT LAN OR
AREA BACKBONE
DATABASE
DESKTOP
DATABASE
COMPUTER
SERVER
"SUPER"
SUBJECT
AREA
DATABASE DATABASE
SERVER
data).
1. Whenever Service Level Agreements (“SLAs”) for Data Availability cannot be met.
To be resolved with the Data Owner and Data Steward (See Data Availability)
· Other classes....
ITS and users will jointly establish archiving rules for historical data.
3. Service Level Agreements (“SLA”) will be established for each class of data.
· Lifetime of SLA
· Frequency of Review
These will be factored by volume and number of transactions. The idea here is that
the larger the volume and number of transactions, the more frequent and longer the
database activities to maintain optimal performance.
· Forecast Growth
· Other suggestions:
Members are Data Administration, All Data Owners, All Data Stewards.
· Set up a Knowledge Base - User-oriented describing knowledge about data and data
administration procedures.
· Data Owner’s security policy will be implemented using the DBMS’s security mechanism.
Users or User Groups are registered for none, one, some or all of Select, Insert, Update and
Delete access.
· Uploads from personal databases to databases containing managed data must be vetted and
sanctioned by the data owner and steward of the target database.
Uploads will be via a secure “transaction” database containing data integrity rules.
· Downloads from databases containing managed data will be subject to security rules set by data
owner and data steward of the source database.
· Uploads from personal spreadsheets, text files, floppy disks files will not be permitted.
· Reports from databases will include the name of the database, date and time of report.
· Frequency of backup will be determined by SLA for each subject area database.
Database administration will provide input for LAN Server capacity planning.
· Database Integrity
A Database Maintenance schedule will be planned for each quarter and reviewed monthly.
· Risk Management
General Discussion
Physical database design is discussed in a wide body of literature and practice that will not be
reiterated here. However, one critical issue that needs to be discussed is strategies and approaches for
distributing data. Distributing database(s) has the same goals: gain locality of reference, take
advantage of available processing power, improve scalability, and improve availability.
One of the most difficult issues for database designers attempting to distribute database(s) is
autonomy, or the distribution of control of the data. This is the degree to which a single site can run
independently of other sites The degree of autonomy in a data distribution strategy can exist at any
point along a continuum from complete coupling (total dependence) to total isolation (total
independence).
In a system with complete coupling, it is easier to present the user with the illusion of a single
integrated database. In this environment, each location in the distributed database has complete
knowledge of the state of the system and must have components that can control actions on data
spanning multiple locations. This complete knowledge is very difficult to attain and manage across
large systems. Additionally, every location must continue to have complete knowledge about other
locations, which makes ongoing changes and maintenance difficult (regression effects abound).
In a system with total isolation, each location operates with a standalone database and is unaware of
the other locations in the system. To distribute data in this model, a distribution strategy must be
layered on top of the local site, and all co-ordination and data transformations must be handled by this
layer. Note that transaction control across multiple sites in this model is extremely difficult.
The best approach lies somewhere in between complete coupling and total isolation. Based on the
software component distribution (the software components should already be as loosely coupled and
strongly cohesive as possible), it should be possible to look at sites independently and have them
voluntarily participate in areas where data needs to be shared. The independent view of the local sites
requires some modification to incorporate a knowledge level about the site(s) of distributed data, but
they only need knowledge about a limited subset rather than knowledge of all the data and
relationships in the enterprise. This information should already be captured in the transaction design
for the software components. This validates the earlier statement that if software components are
distributed, the logical sites for the data stores should follow.
So, what is to be distributed is known, but what is the best way to distribute it? Distribution must be
examined based on the characteristics of the data to be distributed. The key characteristics for this
decision are ease of partitioning, volatility, and site. Ease of partitioning refers to how easily the
database can be partitioned - vertically or horizontally. Volatility refers to how often the data itself
changes. Site refers to where the updating takes place (and again should be able to be determined
from examining the software components that use the data and how they are distributed). The
following table summarizes the basic strategies for distribution that can be applied based on these four
characteristics :
Centralized
With centralized data, data is stored in a database at a central location. The database is called the
central database. “Centralized data” may mean centralized data for a whole corporation, but it might
be centralized data for a division, branch or department (or any other oganizational unit or grouping).
Centralized means in one place. Centralized does not necessarily mean mainframe (although
centralized data could be implemented on a mainframe, or a super-server, or a PC Server). Along
with centralized location, this strategy also means that no copies of the data are made. When the data
cannot be partitioned, look at a centralized strategy. A centralized strategy can also be used where
there is medium to high volatility and multi-site update requirements to the same data.
U p d a te s
C e n tra l
D a ta b a s e
R e t r ie v a l s
Updates
Partition
Database
Retrievals
Updates
Partition
Database
Retrievals
Updates
Partition
Database
Retrievals
Updates
Retrievals
Retrievals
Replica
When the site of update changes to a local site, a different set of strategies is needed. Updates cannot
be allowed on both the local and central databases. In this situation, use a replica. A replica is a
partition or copy of a central database that may be updated. If there can be some delay in
synchronization of the local and central databases, use a periodic replica, where incremental images
of changes are processed in batch mode and sent back to the central database as a group. If
synchronization is critical, use a continuous replica. This is a replica that synchronizes continuously
with the central database. As each change occurs, it is replicated back to the central database as an
individual transaction. This is not a two-phase commit. The user does not have to wait for the dual
portion of the update. A continuous replica is often implemented with some type of store-and-forward
mechanism. A checkout replica is useful when the location of an update cannot be predetermined.
This is a replica that is flagged in the central database as checked out; no one else can update it, but
others could still view it (or extract it).
Periodic or
Copy or Updates
Continuous
Partition of
Replication.
Central
Single-/Bi- Database
Retrievals
Directional
Copy or Updates
Partition of
Central
Database Retrievals
· What are the trade-offs between the overhead of copying and storing multiple copies of data
versus remote data access?
· Is there a large enough time window to allow effective data extracts or replications?
· Should all distributed databases have the same schema to facilitate replication or will
extensions and optimizations be allowed?
· Can the network handle the traffic size and volume of the anticipated replication strategy?
· Can the distributed data be effectively managed (security and disaster recovery)?
A final and important consideration is the enterprise’s readiness to design, implement and manage a
distributed data environment. For an enterprise migrating from a centralized mainframe database
environment to a client-server environment, there are many other issues, over and above data
distribution, to cope with. From a risk management perspective, implementing a distributed data
policy may be one risk too many. The need for distributed data needs to be prioritized and balanced
against other needs. In an enterprise that relies primarily on purchased software, the degree of control
the enterprise has over data distribution strategy may be limited and it is likely that each package’s
database will be centralized with respect to the application package.
An automated collection mechanism to collect data required in the data warehouse. This data may be
exact copies of operational data, in which case replication technology can be used to populate a data
warehouse.
Typically, the content of a data warehouse is aggregated data and aggregation procedures can be
defined on operational databases to either populate a data warehouse directly, or to build an extract
database and replicate that to the data warehouse, again using replication technology.
Where the source database and the data warehouse are disparate then either middleware that can
access a variety of database types or export/import procedures can be used.
Having a mass of data stored in a data warehouse is not very useful unless the data warehouse has
features for displaying and navigating its contents to the user.
For the more sophisticated user, who has a knowledge of the data warehouse structure and content, a
data access and reporting tool is needed to provide the facility to discover information in the data
warehouse.
HUMAN
RESOURCES
DATABASE
DATA DATA
WAREHOUSE COLECTION
(MIS DATABASE) MECHANISM
SUPPLY &
MAINTENANCE
DATABASE
OTHER
BUSINESS UNIT
DATABASE