Вы находитесь на странице: 1из 56

BI/EDW Reference Architecture & Cookbook

Description: This document helps understand EDW reference architecture,


its components, lifecycle, key challenges and mitigations.

Author: Nitin Ambare


Reviewed by: Rahul Mehta
Keywords: EDW, Cookbook, Reference Architecture
Version 1.0

This document is confidential and contains proprietary information, including trade secrets of CitiusTech. Neither the document nor any of the information
contained in it may be reproduced or disclosed to any unauthorized person under any circumstances without the express written permission of CitiusTech.
Agenda

• Enterprise Data Warehouse Reference Architecture

• EDW Components

• Data Warehousing Lifecycle

• Reference Project Plan

• Governance & Team Structure

• EDW: Key Challenges & Mitigation

• Appendix

2
Enterprise Data Warehouse Reference Architecture
Business
Healthcare Data Extraction Intelligence &
Sources Transformation Enterprise DW Reporting
1 2 Dashboards
HL7/ HL7/ 1. Cleansing 7
CCDA/ CCDA/ Master
QRDA QRDA 2.Quality Mgmt.
Data
Parser 3. Reconciliation
Clinical
Extract 6 Reporting
EMR 4. Standardization
EMR 5
3 4
RIS RIS
Staging ODS EDW Operations
Finance
Finance Data Extracts

Other RCM

Other

Metadata Management 8

Auditing & Logging 9

Data Security 10

3
Agenda

• Enterprise Data Warehouse Reference Architecture

• EDW Components

• Data Warehousing Lifecycle

• Reference Project Plan

• Governance & Team Structure

• EDW: Key Challenges & Mitigation

• Appendix

4
EDW Components - Data Sources & Extraction
Type of Sources Sources of Data to EDW are internal or external. Often are a
combination of Clinical and Financial Data
Few examples of data sources are: EMRs, HL7 / CCDA / QRDA, RIS,
Financial sources
Methodologies to extract data Data extraction from the sources is usually via following methods:
from Sources  Direct access to the source system
• EMR / EHR / Claims Systems / EMPI
 Indirect access to source system
• HL7 / CCDA / QRDA file

Key Considerations Following areas should be considered in the design:


 Data Volume per hour/day/month & its monthly growth in %
 Data type – structured, semi-structured or unstructured
 Latency of data – real time versus batch
 Complexity of source access
 Incremental updated records per hour/day
 Data security, if any

5
EDW Components - Staging Layer (1/2)
What is Staging Place where data from various source systems is loaded. Subsequently, the
Layer? data is extracted, transformed and loaded into either the EDW or the ODS
Why is it required?  Data to be cleansed and validated before loading into DWH
 Avoid negative impact on source database performance as ETL process
takes longer time to process the data
 Helps to perform troubleshooting using source data for reporting
discrepancies
 No direct access to source system, rather source data is available in CSV
and Excel
 Efficient data loading into DWH using staging as various source systems
have different allotted timing for data extraction and partial loading in
DWH is not possible
 Requirement of business rules in ETL jobs which involve use of more than
one source systems

6
EDW Components - Staging Layer (2/2)
Why is it required?  To create a copy of the source data that can then be extracted,
transformed and loaded to ODS (or EDW) without impact to the source
systems
 Facilitates restart ability of failure ETL job anytime with very less impact of
extraction on source system
 Used as historical data storage when source system stores data for limited
time period
 Supports near real time reporting without hitting the source system
Design  Use incremental load strategy since this helps in making our data
Considerations / warehouse robust and complete
Best Practices  Staging tables need to have only those columns which will be needed by
ETL load
 Store only required ETL log information
 Purge ETL log information as per the reporting requirement
 Archive data of huge tables e.g. 3 years of data and older into a separate
archive table in different database
 Export staging data into a flat file, if needed
 Build a separate database for staging

7
EDW Components - Extract, Transform & Load (1/2)
What is ETL? ETL is short for extract, transform, load, three database functions that are
combined into one tool to extract data out of one database and transform if
needed finally load into target database
Design Initial Load/ Full Load: Full Load is a process of completely deleting the existing
Considerations & data and reloading it from scratch. A lot of unchanged data is also deleted and
Best Practices reloaded in this process. But a destructive load ensures the highest data integrity
easily

Best Practices:
 Avoid use of Indexes of target table during load: Do not use indexes while
performing full load on tables. Create indexes after the full load is complete
 Achieve Parallelism: Try to run the ETL loads for full load in parallel as the time
taken to load each table will be significantly more
 Alternatively you can use BCP (bulk copy) or other bulk load strategy for data
loading
 Use Change Data Capture (CDC) to capture incremental changes

Incremental Load: Incremental data load is the most common method to load the
data in DWH. Incremental load job are scheduled on recurring basis, its frequency
can be real-time to daily, weekly, and monthly
Tip: Use separate packages for Initial and for Incremental load

8
EDW Components - Extract, Transform & Load (2/2)
Design Best Practices:
Considerations  Plan the Load Sequence carefully considering dimension tables first and then
& Best Practices load the dependent fact tables
 Partitioning Large Fact Tables: Partitioning huge tables (e.g. 50GB) can help in
gaining performance. Inserting, updating and deleting records will be quicker
and join on partitioned tables will also retrieve output quicker
 Overriding Lookup SQL with additional filter condition when incremental load
is performed on a huge fact table with lookup SQL
 Parallel Execution: Prefer parallel execution to speed up the time taken for
load completion. Parallelism can be achieved by placing ETL loads
appropriately so that more than one independent table can be loaded
 Checksum implementation is considered as an option to assist in determining
changes for incremental load of table without last modified date column

9
EDW Components - ODS Layer (1/2)
What is ODS? The ODS (Operational Data Store) is a database used to bring multiple sources
together into a single normalized, cleansed and integrated database to support
operational reporting of granular data on a real-time or a near real-time basis
It is different from a standard Enterprise Data warehouse because and ODS
typically doesn’t support the extent of history, volumes, complexity of data or
cross-functional analysis and reporting that an Enterprise Data Warehouse
supports
An ODS (when present in an organization) is often a key source for the Data
Warehouse
What should ODS is used for
ODS be used/not  Reporting on lowest granular data for business operations
used for?  To support real-time or near real-time reporting requirements
ODS should not be used for
 Complex queries against summary-level or on aggregated data
 Historical and trend analysis reporting usually on a large volume of data
 Full integration and transformation as there is data warehouse tables

10
EDW Components - ODS Layer (2/2)
Design  ODS Data model: Data model design needs to be based on the expectations
Considerations / from the ODS. Typical ODS design is a hybrid between OLAP and OLTP
Best Practices  Change Data Capture: Change Data Capture (CDC) is an approach to data
integration that is based on the identification, capture and delivery of the
changes made to data sources
 Data Lineage: With multiple sources, targets and reporting & consolidation
system with embedded business rules, it is important to manage the data
lifecycle to avoid data integrity issues. Data lineage is the methodology and an
approach for ending more effective use of the business information and it
brings data lifecycle transparency to all the stakeholders
 Data Security: Security needs to be implemented either at application /
visualization layer, data layer or the combination of two and in collaboration
with the organization's security policies and functions

11
EDW Components - Data Warehouse (EDW) (1/2)
What is Data Enterprise Data Warehouse (EDW) is a relational data repository aggregating data
Warehouse? from operational systems and used for analytical reporting to provide 3600 view
across the enterprise. EDW is an iterative process which can be used for
measuring effectiveness of business initiatives and providing evidences to support
business decisions.
EDW is not used for operational reporting.
In recent times, big data system are starting to co-exist with EDW, where big data
handles complex and unstructured data storage & processing and EDW
consolidates aggregated data.
Architecture  Top down approach recommends building a EDW first and then derive data
options marts
 Bottom up approach recommends building data mart from operational
systems and then consolidate data marts to build EDW
 EDW landscape constitutes Staging – temporary storage for operational data,
ODS- for operational reporting and data mart- aggregated subset of EDW data
 Reporting needs drives the architecture to include/exclude ODS & data mart
 Atomic data model is normalized upto 3rd form is considered suitable for EDW
 Dimensional data model is suitable for data mart where reporting happens &
require lesser joins for reporting
 Processes involved are extraction, transformation & loading (ETL) of
operational data in EDW

12
EDW Components - Data Warehouse (EDW) (2/2)
Architecture  ETL approach is followed with Massive Parallel Processing (MPP) database
options where multi-node processing is leveraged
Design Data Modelling:
Considerations /  Conceptual data model to design EDW subject area and entities
Best Practices  Logical data model to add entity attributes and define entity relationships
 Physical data model to implement with relational database with keys and
constraints
 Consider historical “Point in time” reporting to build slowly changing
dimensions (SCD) & snapshots tables for KPIs
Data Governance:
 Setup governing council for proactive data governance
 Define data quality rules and set alert and notification to publish ongoing data
quality
 Build automated reconciliation of master and transactional entities
 Set up control reporting to proactively manage data loading issues –
performance, count mismatch
Database:
 Set up process to archive EDW data, and to maintain optimum data
 Leverage table partition for transaction/fact tables

13
EDW Components - Data Mart (1/2)
What is Data  Granular level of data for a specific business/subject area presented in a
Mart? dimensional data model for end user queries.
 It typically represents a subset of the Data Warehouse.
 There are multiple schools of thought in how data marts can be leveraged for
enterprise, cross-functional reporting and analysis:
• Subject- area Data Marts are derived from a normalized Enterprise Data
Warehouse (as represented in the conceptual architecture diagram in
slide 3)
• A series of Data marts representing different subject areas are directly
created by integrating data from multiple source systems. Multiple data
marts ‘connected’ via ‘conformed’ dimensions provide an enterprise,
cross-functional reporting and analysis capabilities.
Why is it  Lower implementation cost (when you create data marts directly instead of
required? data warehouse)
 Contains only business essential data, and less cluttered
 Easy access to frequently needed data by a business line
 Improves end-user response time

14
EDW Components - Data Mart (2/2)
Why is it  Provide data in a form that matches the collective view of a group of users
required?  Potential users are more clearly defined than in a enterprise data warehouse
 Demonstrate the viability and ROI (return on investment) potential of an
application prior to migrating it to the Enterprise Data Warehouse.
Design  Keep required data in data mart, built as per specific business use cases
Considerations /  Setup archival process to keep required data in fact tables
Best Practices  Consider reporting requirement for data modelling rather than EDW data
model
 Mostly de-normalized data model for fewer joins while reporting
 Star schema & snow flake models balancing reporting performance & storage
 Dimension tables to store optimum dimension members to avoid
performance issues
 Resolve many to many relationships using bridge tables

15
EDW Components - BI Reporting (1/2)
What is BI  Analytical reports to provide valuable business information and to discover the
Reporting? business insights. Dashboard is a preferred way to showcase trends and
distribution. BI reporting to provide drill down/drill through of aggregated data
to access the underline granular data.
 Self-service BI empowers the business user to create/modify dashboard and
reports without minimal help from IT.
Why is it  To measure performance of business initiative
required?  Support business decisions by analyzing historical information
 Support compliance reporting e.g. Meaningful Use reporting
 Performance measurement e.g. year-on-year, year-to-date
 Provide executive summary
 Identify business insights, trends and distribution
 Collaborate information with business users
 Quick access to enterprise wide historical data

16
EDW Components - BI Reporting (2/2)
Design Best Practices:
Considerations /  Design Semantic layer with required aggregations & hierarchies in dimensions
Best Practices  Incorporate User Experience (UX) design before developing report to select
appropriate visualizations
 Identify dimension data level security, and single sign-on requirement
 Reporting tools to leverage LDAP and Kerberos security
 Allocate sufficient memory for reporting, self-service BI is more memory
intensive than canned reports
 Allow super users to control access rights and users
 Leverage database for pre-aggregation and avoid calculation while rendering
report
 With information intensive report, provide summarized report first and
provide ability to drill down to granular level
 Provide standard export formats e.g. Excel, PDF, CSV
 Allows to manage reporting operation by providing dashboard about overall
reporting – top 10 longest running reports, top 20 most used reports, top 10
active users
 Provide breakup of timing to identify reporting issue e.g. issue at database
level, rendering in browser, calculation at report level

17
EDW Components - Master Data Management (1/2)
What is MDM? Master Data Management (MDM) enables business user to consolidate master
data, federate it and propagate to the other systems. It is used to resolve multiple
versions, standardize and implement rules to correct the data.
In addition, MDM provides basic reporting and workflow management to support
the collaboration.
There are three MDM architectural patterns:
 Virtual Implementation – MDM to store references to master data
 Physical Implementation – MDM to store, author and alter master data
 Hybrid Implementation – Combining advantages of virtual and physical
implementation
How is it EDW leverages a well governed master data which minimizes data quality issues. It
different in the allows to track historical changes and build slowly changing dimensions.
context of a
DW?

18
EDW Components - Master Data Management (2/2)
Design  Perform data profiling before implementing MDM – it provides overall source
Considerations/ data quality assessment
Best Practices  MDM to perform minimum data quality rather it should be done in DQ
application
 MDM to focus on matching & merging of master data
 Keep master entities with optimum data since large volume may cause
performances issues

MDM is out of scope for EDW cookbook.


19
EDW Components - Meta Data Repository (1/2)
What is Meta Data about data/information about data which is stored in enterprise data
Data Repository warehouse
?
Why is it It provides the below information related to data as follows:
required ?  Data definitions
 The origin of data
 The structure of data
 Rules for the selection and transfer of data
 Qualitative and quantitative data about data
Additionally it supports the information needs of
 System developers
 Data administrators
 System administrators
 Users
 Applications on the data warehouse
Key Metadata  Technical description such as file format, field names and meanings,
Component methodology, and instructions for proper use
 Source/authorship of data
 Contact information for questions

20
EDW Components - Meta Data Repository (2/2)
Categories of  Business Metadata - It has the data ownership information, business
Metadata definition, and changing policies
• User: Primarily end users, manager, business analysis, power users,
regular users, casual users, Senior managers etc.
 Technical/Administration Metadata - includes all information necessary for
setting up and using a DW e.g. Information about source databases, DW
schemas, dimensions, hierarchies, predefined queries, physical organisation,
rules and script for extraction, transformation and load, back-end and front
end tools
• User: Project Manager, DW administrator, metadata manager, DB admin,
Business Analyst, Data Quality Analyst
 Operational Metadata - information collected during the operations of the
DW e. g. usage statistics, error reports , currency of data and data lineage.
Currency of data means whether the data is active, archived, or purged.
Lineage of data means the history of data migrated and transformation
applied on it

21
Agenda

• Enterprise Data Warehouse Reference Architecture

• EDW Components

• Data Warehousing Lifecycle

• Reference Project Plan

• Governance & Team Structure

• EDW: Key Challenges & Mitigation

• Appendix

22
EDW Implementation – Lifecycle (1/6)

Requirement
Design &
Gathering & Development Testing Deployment Support
Planning
Analysis

Key Activities Entry Criteria Exit Criteria Deliverables


 Gather business use cases for enterprise  Project kick off  Requirement  Document
reporting / analytics  Key business Sign off with functional
 Meeting with business groups to users  Project requirements
understand reporting / analytics identified milestones with business
requirements  Onboarding of identified use cases
 Understand difficulties with current business  Document
structure – cleansing, transformation analysts & DW with security &
need, modification effort architect performance
 Understand compliance need of data requirements
availability
 Analyze non functional requirements –
performance, security
 Analyze required subjects areas
 Understand existing source systems and
map subject areas
 Study existing reports and collect data
elements

23
EDW Implementation – Lifecycle (2/6)

Requirement
Design &
Gathering & Development Testing Deployment Support
Planning
Analysis

Key Activities Entry Criteria Exit Criteria Deliverables


 Detailed Source analysis - Perform data  Requirement  Baseline design  Architecture
profiling sign-off  Development & document
 High level design  Development QA  High Level Design
• Database data model & QA teams environments Document
• ETL onboarding  De-identified • Database
• Reporting data for • ETL
 Define overall BI solution architecture development • Reporting
 Design access & security mechanism and QA • BI
 Create data model/enhancement of  Source to target
existing data model mapping
 Demonstrate solution architecture to document
architecture team  Project plan
 Perform capacity planning  Coding guidelines
 Technology & tools selection –  Code review
Database, ETL & reporting checklist
 Deployment Strategy – cloud, on  QA Strategy
premise, etc. document

Design process has been explained in Detail in the Appendix. Please refer to Appendix A for details
24
EDW Implementation – Lifecycle (3/6)

Requirement
Design &
Gathering & Development Testing Deployment Support
Planning
Analysis

Key Activities Entry Criteria Exit Criteria Deliverables


 Develop application prototype  Architecture  Fully functional  Low level design
• Report wireframes– UX document OLAP application document
design  Requirement  QA environment  Source code
• Proof of concept document setup • ETL with
 Project plan  Staging auditing, logging,
 Construct EDW components -
 Teams on environment error handling
Database, ETL & Reporting
boarded with production • Physical data
 Perform Unit testing and size volume model
capture Unit test results • Reports
 Perform Code reviews - Self &  Unit tested
peer reviews artefacts &
evidences
 Release readiness
checklist
 Release &
deployment
document

25
EDW Implementation – Lifecycle (4/6)

Requirement
Design &
Gathering & Development Testing Deployment Support
Planning
Analysis

Key Activities Entry Criteria Exit Criteria Deliverables


 Deploy in QA environment  Development  Performance  QA sign-off
 ETL Testing complete benchmarking document
• End-to-End data  QA & staging  UAT sign-off  Requirement
• Transformation environments  Production traceability
• Data quality rules setup environment set matrix
• Integration testing up  Test execution
• Performance testing reports
• Security testing  Performance
• Ongoing regression testing testing execution
 Report testing report
• Report content  Security testing
• Security execution report
• Performance  Go-Live plan
 Deploying ETL & reports in
staging environment
• Performance testing
• Security testing
• User acceptance testing
26
EDW Implementation – Lifecycle (5/6)

Requirement
Design &
Gathering & Development Testing Deployment Support
Planning
Analysis

Key Activities Entry Criteria Exit Criteria Deliverables


 Production deployment  UAT Sign-off  Historical data
• Database loading
• ETL  Configured
• Reports Incremental data
 Historical data load loading
 Setting up incremental data load
 Setting up service accounts for
ETL and reports
 Setting up user access
• Database
• Reports
 Setting up super users and
administrators
 Setting up alerts & notifications
 Production sanity testing

27
EDW Implementation – Lifecycle (6/6)

Requirement
Design &
Gathering & Development Testing Deployment Support
Planning
Analysis

Key Activities Entry Criteria Exit Criteria Deliverables


 Prepare user manual for support  Functional  Weekly
work production project SLA
 Define SLA metrics for support environment metrics
work
 Onboard support team
 Define change management
process
 Define escalation matrix
 Place governance council to
govern data - data quality, data
availability, data security
 Define Level1, 2 & 3 support
• Level1 – break-fix
• Level2 – adding preventive
measures
• Level3 – provide support on
business issues

28
Agenda

• Enterprise Data Warehouse Reference Architecture

• EDW Components

• Data Warehousing Lifecycle

• Reference Project Plan

• Governance & Team Structure

• EDW: Key Challenges & Mitigation

• Appendix

29
Reference Project Plan for EDW Implementation (1/5)
# Prerequisite Task Owner Weeks

1 - Staffing Business Analyst & Data Architect Project lead 4


Requirement Gathering & Analysis
2 1 Requirement analysis - functional & non- Business
4
functional Analyst
2.1 Review current reporting & reporting databases 1
Interview with business users and collect
2.2 1
limitation with existing applications
2.3 Collect input for future state of BI application 1
2.4 Prepare requirement analysis document 1
3 2 Staffing of DW architect, development & QA Project lead 4
teams
Design & Planning
4 2 Data profiling – source data quality assessment Data Architect 3
4.1 Profile identified source systems 1
4.2 List down possible fields, keys 1
Prepare reports with cleansing & data quality
4.3 1
rules
5 2 Data modelling – Staging database Data Architect 1 30
Reference Project Plan for EDW Implementation (2/5)
# Prerequisite Task Owner Weeks

6 4 Data Modelling – logical and physical EDW, ODS Data Architect 4


6.1 Create logical data model of EDW 1
6.2 Create physical data model EDW 2
6.3 Create ODS data model 1
7 6 Data Modelling – Data mart Data Architect 3
7.1 Identify dimensions 1
7.2 Identify measures & KPIs 1
7.3 Create data mart model with relations 1
8 6 ETL Design – Auditing & logging, error handling ETL Architect 3
8.1 Design ETL auditing & logging 1
8.2 Design ETL error handling 1
8.2 Create model ETL to be replicated 1
Development
9 8 ETL development – create metadata ETL Architect 1
10 9 ETL development – source to staging & ODS ETL Architect 2

31
Reference Project Plan for EDW Implementation (3/5)
# Prerequisite Task Owner Weeks

10.1 ETL Source to Staging 1


10.2 ETL Source to ODS with KPIs 1
11 10 ETL development – cleansing & data quality ETL Architect 2
11.1 Create data cleansing & DQ metadata 1
11.2 Create ETL with cleansing and DQ checks 1
12 10 ETL development – transformation ETL Architect 2
13 9 ETL development – reconciliation ETL Architect 2
14 13 ETL – control reporting ETL Architect 2
15 12 ETL – EDW to Data Mart BI Architect 3
15.1 ETL for dimension tables with SCD 1
15.2 ETL for fact tables with measures 2
10.1 ETL Source to Staging 1
Testing
16 15 QA ETL – Source to data mart QA Lead 3
16.1 QA source to staging 1

32
Reference Project Plan for EDW Implementation (4/5)
# Prerequisite Task Owner Weeks

16.2 QA Staging to EDW 1


16.3 QA EDW to Data mart 1
17 12 Report Development – create wireframes BI Architect 3
18 17 Report Development – create MOLAP database BI Architect 2
Report Development – canned reports, self-service
19 18 BI Architect 6
BI
20 18 Report Development – configure security BI Architect 1
21 20 QA Reporting QA Lead 3
Deployment
ETL Architect,
22 21 Staging deployment 1
BI Architect
23 22 Staging – historical data load ETL Architect 2
24 23 Staging – incremental data load ETL Architect 1
25 23 QA – performance QA Lead 2
26 23 QA – security QA Lead 2
Business
27 24 UAT 4
Analyst

33
Reference Project Plan for EDW Implementation (5/5)
# Prerequisite Task Owner Weeks

Support
ETL Architect,
28 27 Production Go Live 2
BI Architect
29 28 Share project support metrics Project Lead 2
30 28 Training IT staff to support ETL & reporting Project Lead 2
31 28 Share ETL & reporting support user manual Project Lead 2

34
Agenda

• Enterprise Data Warehouse Reference Architecture

• EDW Components

• Data Warehousing Lifecycle

• Reference Project Plan

• Governance & Team Structure

• EDW: Key Challenges & Mitigation

• Appendix

35
EDW Implementation: Project Governance Process
Communication
Area Coverage CT Participants Client Participants
Tools
Daily Scrum  Daily Scrum Project Team Project  Common
call updates - plan / Management knowledge
(15-30 Min activity review repository for
Daily) process assets,
Project  Weekly Dashboard Mandatory – Mandatory – guidelines and
Dashboard  Task completion Project Project reference
 Deliverable Quality Management Management implementations
(Weekly)
 Issue Management  Telephone
and Resolution Optional – Program Optional – Program number /
Management Management conferencing
 Project Lead –
Project  Monthly Progress Mandatory – Mandatory – mobile email /
Review  Plan v/s actual Program Program phone access
(Monthly) Reports: Resources, Management Management  Secure email for
Schedule Optional – Optional – all team
 Scope Changes Executive Sponsors Executive Sponsors members
Engagement  Relationship Mandatory – Mandatory –  In-person
Review Review Program Program meetings
(Quarterly)  Program Updates Management, Management,
 Feedback Executive Sponsors Executive Sponsors
36
EDW Implementation: Team Structure
Project Lead / Scrum Master
Proposed Scrum Team Staffing
The PL also plays the role of Business Analyst
the Scrum Master Each SCRUM team has 1 BA Approach:
for finalizing requirements
and story grooming  Project Management: Experienced
Project Lead / Scrum Master with
strengths in relevant technology
and domain
 Business Analysis: Domain
CitiusTech – Client
Consultant/ Business Analyst with
Scrum Team
strong expertise in relevant
healthcare workflows and
technologies / standards
 Architecture: Each team with Data
Architect/ETL architect/BI architect
Developers along with 3 developers
Dev Engineers include BI
QA Analysts
Engineers, ETL architect, data
Each SCRUM team has at  Composition: 3:1 ratio of
architect & BI architect development to QA resources
least 1 QA Engineer
 Onsite & Offshore: 1 person onsite
as a SPOC
37
EDW Implementation: Key Role & Responsibility (1/2)
Roles High Level Description /Key Responsibilities
Project Manager Defining, planning, coordinating, controlling and reviewing all project activities

Business Analyst  Requirements clarification


 Work with the business owner to solve data gaps/data quality issues, if any
 Coordinate UAT and user training
 Participate in functional testing
 Review test scenarios
Data Architect  Provide a technical overview of the source systems in scope
 Provide the data dictionary for all the feeds coming out of the source
systems
 Collaborate with the Product engineering team to understand the proposed
solution architecture
 Guide the team in deployment processes in Client environments
 Participate in performance optimization, server sizing and security reviews

ETL/DW Architect  Encompass definition of overall data warehouse architectures and standards
 Participate in architectural meetings and analyse all technical requirements
of BI applications
 Evaluation and selection of infrastructure components including hardware,
DBMS, networking facilities, ETL (extract, transform and load) software
 Responsible for high level technical design

38
EDW Implementation: Key Role & Responsibility (2/2)
Roles High Level Description /Key Responsibilities
BI Architect  Provide BI & reporting strategy
 Create wireframes and UX design
 Design MOLAP layer for reporting
 Responsible to deliver reports & collaboration strategy

ETL Developer  Developing ETL


 Performing unit testing

Report Developer  Develop reports


 Create/Modify MOLAP database

39
Agenda

• Enterprise Data Warehouse Reference Architecture

• EDW Components

• Data Warehousing Lifecycle

• Reference Project Plan

• Governance & Team Structure

• EDW: Key Challenges & Mitigation

• Appendix

40
EDW Key Challenges and Mitigation (1/4)
Key Challenges Mitigation plan

Data Reconciliation – Reconciliation is a Develop automated reconciliation framework


process of ensuring correctness and  Master data reconciliation – Reconcile master
consistency of data in a data warehouse. entities and collect counts from source system &
Key reason: EDW e.g. count of patient, provider, ICD, CPT
 Complexity of the development
 Reconciliation process must also  Transaction data reconciliation – Reconcile count of
comply with performance transaction data between source and EDW e.g. visit
requirement count, claim count, claim amount
 Logging Framework – Consider scenarios like
rejection in DQ process, merging in transformation,
insert & update in loading
 Reconciliation report – Generate & share
reconciliation report to highlight any issue in process
 Retain source data in staging for any issue resolution

41
EDW Key Challenges and Mitigation (2/4)
Key Challenges Mitigation plan

Data Quality (DQ) - Inconsistent data,  Perform data profiling – Before data mapping and
duplicates, logic conflicts, and missing ETL development perform exhaustive source data
data are the key challenges in DQ due to: profiling with available tools to identify DQ issues
 Disparate data sources from legacy upfront.
system  Data Governance – set up strong data governance
 Source systems with many ad-hoc framework
changes • Data Governance Council
This poor data quality results faulty • Data Stewards
reporting and analytics for decision • Master data management
making.  DQ Reporting – Setup report collaboration & alerts
related to DQ issues
 Adding enough DQ checks in each stage of ETL

42
EDW Key Challenges and Mitigation (3/4)
Key Challenges Mitigation plan

Performance – A data  Hardware Sizing – Perform capacity planning to withstand growth


warehouse must also be for next 3-5 years. For cloud deployment, many options are
carefully designed to meet available for processing & storage. Setting up database & BI
overall performance servers in same data centers reduces network latency
requirements for faster  Database Sizing – Consider data growth pattern to design
execution of report as well database for processing & reporting. Consider MPP databases for
as dashboard. The initial processing intensive EDW, use ELT instead of ETL. Analytical
overall design must be databases with column storage orientation are best suited for
carefully thought through to reporting with selective data elements are requested
provide a stable foundation  Database Design – Consider fact table partitioning, file groups,
from which to start. data archival requirement, using star & snowflake schemas in
reporting database
 Reporting Performance Issues– Keep optimum data in reporting
databases e.g. create data marts with required data volume. Avoid
KPI calculation while browsing reports, perform calculation at
database level

43
EDW Key Challenges and Mitigation (4/4)
Key Challenges Mitigation plan

Auditing & Logging  Leverage built-in features of tools – ETL tools have in-built
features to collect row count, time taken, users & service
accounts used, events like warning & error logging
 Control Reporting – Create reports to show ETL details e.g.
read, insert, & update count in every step of ETL. Create ETL
performance report to showcase.

Data Security  Multi-tenancy – Setup role based security to access intended


data only
 Encryption – Avoid any unintentional data access by encrypting
and data masking sensitive information
 Leverage existing security –
• Use LDAP security to authorize and authenticate users
• Kerberos – Ticket based authentication among nodes over
non secure network

44
Agenda

• Enterprise Data Warehouse Reference Architecture

• EDW Components

• Data Warehousing Lifecycle

• Reference Project Plan

• Governance & Team Structure

• EDW: Key Challenges & Mitigation

• Appendix

45
EDW Reference Architecture – Design Process (1/7)
Requirement Data EDW Data Data ETL Data Mart Reporting
Gathering Profiling Modelling Mapping Design Design Design

 As-is Assessment: Deliverables:


• Assess existing report delivery model  Requirement
• Explore existing reporting data, data acquisition process Document with overall
• Explore current business use cases for reporting reporting requirement
 Requirement Analysis: – canned reports, self-
• Collect business use cases for enterprise reporting service BI, predictive
• Meeting with business groups to understand analytics analytics
requirements  Requirement document
• Understand difficulties with current structure – cleansing, with business use cases
transformation need, modification effort along with user groups
• Understand compliance need of data availability  Gap analysis document
• Analyze non functional requirements – performance, security – current v/s future
state
 Data Element Analysis:
• Analyze required subjects areas
• Understand existing source systems and map subject areas
• Study existing reports and collect data elements

46
EDW Reference Architecture – Design Process (2/7)
Requirement Data EDW Data Data ETL Data Mart Reporting
Gathering Profiling Modelling Mapping Design Design Design

 Profiling data sources Deliverables:


• Identify source tables/files for subject areas with discussion  Source data quality
with source system SME assessment report
• Perform data profiling on current data- NULL/empty value  Entity Relationship
distribution, unique values, min/max length, pattern in values diagram with keys
• Identify relationship among tables/files – functional  Required cleansing
dependencies, possible candidate & unique key rules
• Collect data volume statistics - Achieved & initial data size,  Required data quality
incremental data volume rules
• Identify “Last Modified Date” column for Change Data Capture
(CDC)

47
EDW Reference Architecture – Design Process (3/7)
Requirement Data EDW Data Data ETL Data Mart Reporting
Gathering Profiling Modelling Mapping Design Design Design

 EDW Modelling Deliverables:


• Create conceptual data model with subject areas and  Logical data model for
entities involved EDW
• Create logical data model (LMD) with entities and  Physical data model for
attributes involved EDW
• Create physical data model with tables, columns, keys and  ODS data model
constraints  Staging data model
• Evaluate healthcare data models e.g. HL7 RIM, S&I
• Evaluate commercial data models e.g. IBM UDMH
 ODS Modelling
• Create ODS data model based on operational reporting
requirements – preferable as close to source system
 Staging Modelling
• Create staging database model – replicate source
structure with audit data elements
• Design audit & logging tables
 Metadata
• Creating domain values e.g. age buckets, codes for
Unknown, not available, not applicable
48
EDW Reference Architecture – Design Process (4/7)
Requirement Data EDW Data Data ETL Data Mart Reporting
Gathering Profiling Modelling Mapping Design Design Design

 Data Mapping Deliverables:


• Mapping source data fields with EDW fields  Source to EDW
• Define transformation for fields mapping document
• Add lookup to derive standard values  Cleansing metadata
• Cleansing metadata lookup to cleans fields  Transformation
• Define strategy to maintain history of changes required
• Define data quality rules for fields

49
EDW Reference Architecture – Design Process (5/7)
Requirement Data EDW Data Data ETL Data Mart Reporting
Gathering Profiling Modelling Mapping Design Design Design

 ETL Design Deliverables:


• Evaluate ETL tools based on business requirement  ETL Design document
• Create ETL for historical /archival data load  ETL Control reporting
• Create model ETL with auditing, logging & error handling
• Create ETL metadata
o List of individual ETL
o Scheduling information
o Parallel execution of ETL
o Key parameters – source & destination connection
o Backdated subset ETL execution scenario
• Create ETL control reporting with performance related
parameters
• Consider ETL execution DEV, QA, PROD scenarios with
different logging levels
• Adhere to compliance – keeping log history
• ELT – with MPP database, ELT works well

50
EDW Reference Architecture – Design Process (6/7)
Requirement Data EDW Data Data ETL Data Mart Reporting
Gathering Profiling Modelling Mapping Design Design Design

 Data Mart Design Deliverables:


• Design subject area/department specific data mart  Data mart design
• Leverage database for KPI calculations since runtime document
calculation slow down reporting  MOLAP design
• Star & snow-flake schema based on dimension data volume
• Consider dimension data level security to design dimensions
 MOLAP
• Leverage partition & aggregations for measure groups
• Leverage caching feature of MOLAP technology
• ROLAP, MOLAP and HOLAP – consider data refresh frequency
and aggregations to define storage
• Compatibility with intended reporting tools

51
EDW Reference Architecture – Design Process (7/7)
Requirement Data EDW Data Data ETL Data Mart Reporting
Gathering Profiling Modelling Mapping Design Design Design

 Canned reporting Deliverables:


• Consider UX – user experience for reports  Wireframes of canned
• Create wireframes of reports reports
• Demonstrate wireframes to business users  Model for self-service
• Evaluate reporting platforms to finalize reporting tool BI
• Configure auditing and logging of reports – reporting history
storage & archival as per compliance
• Configure user role access
• Configure report subscription & export schedule
• Create filter sets used across reports – reporting frequency,
last 30, 60, 90 days
 Self-service BI
• Configure data sources with required fields to business users
• Rename attribute & KPI caption as per vocabulary used
• Create template reports for quick adoption of application

52
EDW Reference Architecture – Design Components (1/3)
Design Components Key Design considerations
Healthcare Data sources  Integration of disparate sources – authorship & integration of
business entities
 Disparate Formats – standardized formats - HL7/CCD, customized
formats – Flat file/Excel/API, Structured format – RDBMS
Extraction – Healthcare  Performance – Sequential/ parallel processing, Daily/hourly file
data parser volume, size of individual file, Required data elements from
individual file
 Scalability – Ability to support future growth
 File Archival – Archive file for any troubleshooting
Extraction – database  Change data capture – Incremental data extraction
extract  Performance – Extraction of historical & archived data volume
 Extraction window – Any time restriction from source system for
extraction
 Latency to extract data from source
 Restart Ability – Ability to extract data in case of failure or business
need to extract historical data for certain duration
Transformation –  Metadata driven – Add/Update cleansing metadata, configure
Cleansing & Quality quality rules
Management  Reconciliation – Data extracted, rejected, updated and inserted

53
EDW Reference Architecture – Design Components (2/3)
Design Components Key Design considerations
Data Reconciliation  Completeness – Ability to reconcile source and target database at
entity level, automation of reconciliation approach
 Master Data reconciliation
 Transaction data reconciliation
Data Standardization  Performance – Less performance issue with fewer distinct values
Transformation  Late Binding – Ability to modify KPI calculation/attribute binding with
ease

Staging Area  Impact on source systems – Move and transform raw data in Staging,
intermediate storage for data massaging
 Transformation effort – Transformation effort for data from multiple
source systems
 Proactive performance monitoring – ETL control reporting
 Data retention for troubleshooting
Operational Data Store –  Historical data requirement
ODS  Refresh frequency

54
EDW Reference Architecture – Design Components (3/3)
Design Components Key Design considerations
EDW  Historical data reporting – Ability to generate point-in-time
reporting
 Enterprise wide reporting – e.g. 3600 patient view
 Reporting Need – Analytical, operational, predictive
Data Marts  Segregation by subject areas or department
 Aggregations
 Partition by reporting period e.g. yearly, quarterly
Master Data  Matching & merging
Management (MDM)  Reporting
 User access and security
Metadata Management  Security – Access restriction to super users
 Historical change – Capture changes for any troubleshooting with
new metadata
Auditing & Logging  Control reporting
 Proactive performance monitoring

Data Security  Multi-tenancy


 Encryption, data masking
 Dimension data level security

55
THANK YOU

56

Вам также может понравиться