DW KT: Data warehouse overview

Data warehouse KT: Data warehouse is the repository of an organizations electronically stored data.
DW means to Extract Transform and Load Data. DART is the name of DW in SVB and it stands for Decision Analysis and Reporting Tool. In SVB we pull from around 60 source systems. Data warehouse includes BI Tools (For reporting) Tools to extract transform and load data ETL stands for Extract -> Transform -> Load Extract: In extract data is cleaned and pulled from source systems into the staging tables. Data cleanup task include trimming data or converting to required formats. The data in source systems may be in SQL, Oracle, Excel, Sybase or any other system. The extracted data is loaded into staging tables in DW (Oracle 9i). Most of the extracted jobs are Truncate and load which means the staging tables are deleted first before the data is loaded into it. One exception is one table in MCA where it is append and load which means the data are added to the stating tables without deleting the existing data. The extract may fail if there are duplicates and the following approach is followed Transform: Transformation happens by consolidation / Analytics where data from different source systems is consolidated and business logic is applied Load: Load means loading data into data marts Dimension is a context. SCD is slowly changing Dimension and there are three types 1. Type 1 which stores only the current status 2. Type 2 All history 3. Type 3 Current + previous ( only one level for example if you are storing by months then you store the current month and the previous month only) At SVB we follow Type-2 Fact consists of dimensions and measures DIMENSION FACT MEASURE
Consider an example of a sale. Here SALE is a fact and all the attributes of the sale are DIMENSIONS. Number of products and cost are MEASURES Incentive / Promotion Time (When)
SALE
Customer (Who)
Product (What)
Channel (Where)
DART Process: DW database Environments available are 1. Sandbox (only used by developers) 2. Dev / QA 3. UAT a. UAT-F-DWAPP01 (BODI Server) b. DWSGNTDEV (Sagent Server) 4. Prod Various Production servers a. SCL-F-DWAPP01 (BODI Server) b. DWSGTPRD (Sagent Server) c. DWDB1SOSCL(DB server) : Database : DWSTGPRD d. DWTLSPRD (Repository Info) Production SCL-F-DWAPP01 DWSGTPRD DWDB1SOSCL DWSTGPRD Development UAT-F-DWAPP01 DWSGNTDEV Toolsdev DWDB1SOADS UAT Sandbox
BODI Server Sagent Server DB Server Database
Posting Date: It is the date we get from CBS (value = system date).
Load Date: The logic used for calculating the load date is the last business day minus one except when it rolls into next month in which case it is the last day of the month. For example Scenario -1 Date 1 Day Mon Load 1 Date
2 Tue 2
3 Wed 3
4 Thu 4
5 Fri 7
6 Sat
7 Sun
8 Mon 9
9 Tue 9
Scenario -2 Date 25 Day Mon Load 25 Date
26 Tue 26
27 Wed 27
28 Thu 28
29 Fri 30
30 Sat
1 Sun
2 Mon 2
3 Tue 3
The following tables are used for scheduling the jobs. These tables are present in the DWADMIN schema 1. etl_Log This has a log of all the jobs that run. Various columns available in this table are a. Source b. load_Dt c. start_time d. end_time 2. etl_dependencies This has a list of all jobs and there dependencies with an indicator whether an dependency is required or not a. Source b. Dependent c. Required d. Load_dt e. complete 3. etl_exec_list This has an entry for all the jobs with their execution details The procedures used for scheduling the jobs are ETL_Start and ETL_End Timelines 9:30 PM PST: Posting Date : SVB Dictionary Pull 10:10 PM PST: Load Date
10:20 PM PST: Trigger staging jobs Types of Triggers: Event based trigger: The job is triggered after an event completes. Most of the staging jobs have an event based trigger. Time Based: The job will be triggered after a specific time. Example of time based is Stucky and PSHR Time and File Based: The job will be triggered after a specific time provided the trigger file has arrived. Example of Time and file based are CBS and PSGL File Based: The job will be triggered after the trigger file arrives. Example of file based is SAM/IPS SLA for CBS jobs is that the trigger file should be available by 12:30 AM PST Transform: Transformation happens by consolidation / Analytics where data from different source systems is consolidated and business logic is applied Consolidation job runs in two phases
Consolidation After post loads Triggers Analytics Analytics Triggers Consolidation again from grossloans) SAGENT Sagent jobs are pure SQL or PL/SQL while BODI is more advanced and is GUI based. Sagent is an ETL tool and it needs a source and target to create a plan. To load a source: Use select 1 from Dual
To execute a procedure: Use the command Exec <Procedure_name>. (from SQL command prompt) Begin Prp name; end 80 % of Plans are of type Source 20% of plans are of type Source 1 Join Source 2 Expression Target Target
After plans are created in Sagent or BODI they are automated (scheduled) using the Sagent Automation and BPM (The new scheduler Redwood will be used before end of year). After staging is complete it triggers consolidation and after consolidation post loads it triggers analytics which again triggers consolidation. Trigger file from CBS ( 12:30 AM) will trigger the CBS staging job-> triggers job CBS2PSGL -> which generates trigger file for DART2PSGL (1:30 AM) -> which is followed by processing at PSGL (3hours) -> PSGL generates trigger file which triggers the GL2DART job (3:45AM) after which PSGL staging starts. Load Sagent PSGL Load BODI GLMART (PSGL COA)<start only after 12. at night>
FINMART (CBS COA)<wil start only after psgl completes in BODI also>
CBS has 3 charters of accounts (CBS COA) 1. Account 2. RC 3. Dept
PSGL has 5 charters of accounts (PSGL COA) 1. Account 2. Product 3. Business Unit 4. Project Id 5. Department After FINMART is complete , following exports are triggered: 1. Essbase 2. ALLL (BODI) 3. PGA ( Sagent) 4. Cash Recon ( Sagent) 5. Certification old ( Sagent) Essbase is a reporting tool and stores data in 3 dimensions. It has 3 cubes namely CurrYears, ALLYears and PSAllYears First the cubes are built and backed up and it is followed by backup of DWTLSPRD (Repository Info) and DWSTGPRD (DART Backup) Currently we take a COLD backup (DB is completely shut down during backup) but in future we may have hot backups (where the DB is up and running while it is backed up.) After the Backup is complete it places a trigger file to start DART exports Trouble Shooting: In Sagent: The duplicates are first deleted and then loaded. In BODI 1. The constraints are disabled 2. The data is loaded 3. Duplicates are deleted
4. The constraints are enabled back Constraints can be set either at the column level or table level.<check constraint> Table level : The PK (Primary Key) constraint ensures that there are no duplicates in one or combination of columns but does not allow NULL whereas the UK (Unique Key) constraint ensures that there no duplicates in one or a combination of columns but allows NULL. Staging In this phase no transformation takes place and no business logic is applied. Most of the staging jobs are run on BODI (Business Objects Data Integrator) with the exception of PSGL staging job which is on both BODI and Sagent. How BODI jobs and Sagent Plans are scheduled Create a job in BODI
Create a Plan in Sagent Designer and create a Plan for each BODI job to execute the batch file on the BODI server
Create an Automation task on Sagent Automation to schedule the Sagent plans designed on Sagent designer. Some BODI jobs are scheduled using BPM
Sagent Plan can consist of task and trigger. Task will call a Sagent plan while trigger will call another Sagent automation task BODI job consist of Dataflow, workflow, job and Script
Generic ETL Job Flow ETL Start
Check whether there is an entry in the ETL_log table
No
Check whether all the dependencies are completed Yes Update ETL_log
Execute the job
Update ETL_log and ETL_depedencies
ETL End
Baseview: The database and server details used by the Sagent plans are specified in the baseview configuration. To configure baseview go to: Sagent Designer Tools Baseview Editor To check what BODI jobs are scheduled on the BODI server login to server and type at in the command prompt The syntax to delete a scheduled BODI job is : at <jobid> /delete Script to identify duplicates in the primary key column Select <pk column(s)>, count(*) from table_name having count(*)>1 group by <pk column(s)> The SLA for DART DB (dwsrgprd) to be available after back up is 7:00 AM. In case there is an expected delay which can push the DART DB availability beyond 7:00 AM we disable the ESSBASE backup(which runs for around 3 hrs) and allow the rest of backups to complete. After the DART DB is up after backups we manually trigger the ESSBASE backup. Steps to deactivate ESSBASE backup and allow the rest of backups to continue 1. Deactivate ESSBASE backup 2. Update the backup dependencies table (set essbase, psallyrs, curyear,allyears to Y) Trigger ESSBASE backup On the BODI server go to path e:\sagent\scripts\backups Important FACT tables are Gross Assets (uses CBS Charter of Accounts) Gross Liabilities (uses CBS Charter of Accounts) Balance FACT (uses PSGL Charter of Accounts) Daily Balance FACT Revenues Loans
Forward Mapping CBS (3 COA) PSGL (5 COA) Reverse Mapping PSGL (5 COA) CBS (3 COA) Steps to filter data from source when doing an append Step 1: First identify the duplicates in the source using the following query Select <pk column1>,<pk column2>,<pk column3>, count(*) from<source table_name> having count(*)>1 group by <pk column1>,<pk column2>,<pk column3> Step 2: Identify how many rows has been loaded into the target by the plan in question Run the following query to know how many rows has be loaded already in the target table for the current loaddt Select count(*) from <target_table> where loaddt =<current load dt> Find the count retrieved by the source query Step 3: If the two queries in Step 2 return same count, we can delete all the rows loaded in the target table for current loaddt delete * from <target table> where loaddt =<current load dt> Step 4: If the two queries in Step 2 DOES NOT return same count, we have to identify the application id from the source query and include that in the query to find how many rows has been loaded for that application id and loaddt Select count(*) from <target_table> where loaddt =<current load dt> and applid =<applid> Step 5: If the count retrieved by the query in Step 4 matches the count retrieved by the count retrieved by source query then you can delete the already loaded value in the target table using the following query delete * from <target_table> where loaddt =<current load dt> and applid =<applid>
Step 6: In Sagent designer create a new plan and modify the source query of the original to ignore the duplicates identified in Step 1 and update it manually.
Query to count the rows without duplicate Select count(*) from <source_table> where loaddt =<current load dt> and <pk1>||<pk2>||<pk3> < > duplicates identified in STEP 1 Query to count the rows with duplicates Select count(*) from <source_table> where loaddt =<current load dt> and <pk1>||<pk2>||<pk3> = duplicates identified in STEP 1 To ignore duplicates add the following condition in the source query Where <pk1>||<pk2>||<pk3> <> duplicates identified in STEP 1
Check logs: You should that the return value is zero . It will check the .log files and see if there is any failure and send an email with the following subject Sagent production : plan failure<attachment>

DW KT: Data warehouse overview

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

DW KT: Data warehouse overview

Загружено:

Авторское право:

Доступные форматы

Data warehouse KT: Data warehouse is the repository of an organizations electronically stored data.

BODI Server Sagent Server DB Server Database

Scenario -2 Date 25 Day Mon Load 25 Date

CBS has 3 charters of accounts (CBS COA) 1. Account 2. RC 3. Dept

Generic ETL Job Flow ETL Start

Check whether there is an entry in the ETL_log table

Execute the job

Update ETL_log and ETL_depedencies

Вам также может понравиться