Вы находитесь на странице: 1из 21

Course Analytics Data Warehouse Pilot

Extract, Transformation, and Load (ETL) Design Document

Prepared for:

The Ohio State University

Prepared by:

Richard Menezes, Covansys Gary Grismore, Ohio State University


July 24, 2002

Extract, Transformation, and Load (ETL) Design Document

DOCUMENT REVISION LIST


Client: The Ohio State University Project: Course Analytics Data Warehouse Pilot Project Code: OSU003 Document Name: Extract, Transformation and Load (ETL) Design Document Version # 1.0 1.1 1.3 1.4 Revision Date 7/24/2002 10/3/2002 10/7/2002 05/24/2004 Revision Description Initial Release Revised ETL process for current data. Reformatted document Update Operational Process Sections (5.1 and 5.2) Author Richard Menezes, Gary Grismore Kandiah Ravindran, Gary Grismore Don Rohde Gary Grismore

Covansys Corporation Confidential Documentation

October 7, 2002

Extract, Transformation, and Load (ETL) Design Document

Table of Contents
1. INTRODUCTION .................................................................................................................................. 1 1.1. 1.2. 2. 3. Purpose of Document .................................................................................................................. 1 Background.................................................................................................................................. 1

HIGH-LEVEL ETL ARCHITECTURE .................................................................................................. 2 DETAILED ETL ARCHITECTURE ...................................................................................................... 4 3.1. 3.2. 3.3. 3.4. ETL Mapping tables..................................................................................................................... 6 Instructional_Course_Section Dimension.................................................................................... 7 Student Dimension ...................................................................................................................... 9 Section_Fact_Table................................................................................................................... 11

4. 5.

UNIT AND STRING TESTING RESULTS ......................................................................................... 13 OPERATIONAL PROCESS FLOW ................................................................................................... 15

5.1. 5.2.

Explanation of Operation Flow .............................................................................................. 15 Course Analytics Operational Process ................................................................................ 17

Covansys Corporation Confidential Documentation

ii

October 7, 2002

Extract, Transformation, and Load (ETL) Design Document

1. Introduction
1.1. Purpose of Document
The purpose of this document is to detail the extract, transformation, and load (ETL) design and operation flow of the processes required to load course and student information into the Course Analytics data mart. The following information related to the design and operations of the ETL processes required to load current registration cycle-point information is as follows: High-level overview of the Extract, Transformation, and Load (ETL) architecture to support loading current information into the Course Analytics data mart. Detailed ETL architecture that identifies data sources, transformations and procedures Unit and String Testing Results. Operational Process Flow

1.2. Background
The Ohio State University engaged Covansys to manage the Course Analytics Data Warehouse Pilot project. As part of this project, data from existing OSU systems would needed to be extracted, transformed and loaded into the Course Analytics data mart for both ongoing (current) course registration cycles and previous course registration cycles (historical). During the analysis of business requirements and information needs, it became clear that there was not a single source of data to provide current and historical course analytics information. The decision was made to initially design and develop the ETL processes to load the current course registration cycle information into the Course Analytics data mart. The historical ETL process would be developed later in this project if time permitted.

Covansys Corporation Confidential Documentation

October 7, 2002

Extract, Transformation, and Load (ETL) Design Document

2. High-Level ETL Architecture


There are fours sources of data used to populate the Course Analytics data mart. The course information contains data about courses and student enrollment. The course waitlist information contains data on students that requested course enrollment, but could not be fulfilled and are waiting to get enrolled in the course during a future registration cycle. The student information contains specific data related to the student. The mapping database is assorted information that is used to validate values and provide enterprise reference data used in the ETL process. These data sources are cleansed, integrated, and loaded into a staging area. The staging area contains information for the current load cycle and is used in creating a clean and consolidated view of data. The staging area also allows for the resolution of errors loading data into the Course Analytics data mart. This helps in isolating errors and minimizing the risk of corrupting the Course Analytics data mart. Information stored in the staging area is then processed and loaded into the Course Analytics data mart. A high-level overview of the ETL architecture for loading current information is presented in Figure 1. This architecture represents the environment that was used to support the Course Analytics Data Warehouse Pilot project. This architecture is expected to change as OSU evolves its data warehouse architecture and infrastructure environment.

Covansys Corporation Confidential Documentation

October 7, 2002

Extract, Transformation, and Load (ETL) Design Document

C ourse Inform ation

E S -O U R D W W indows 2000 D ata W arehouse S taging A rea Inform atica Inform atica C ours e A nalytic s D ata M art

C ourse W aitlis t Inform ation

DW DEV W indows 2000

DW DEV W indow s 2000

S tudent Inform ation

M apping Inform ation

G EO RG E W indows 2000

Figure 1 High-level ETL Architecture for Current Data

Covansys Corporation Confidential Documentation

October 7, 2002

Extract, Transformation, and Load (ETL) Design Document

3. Detailed ETL Architecture


The Detailed ETL architecture provides a specific description of the process steps required to transformed the fours data sources referenced above, into the staging database, and then load the staged data into the Course Analytics data mart. Loading course enrollment information into the Course Analytics Data Mart is one of the goals of the Course Analytics Data Warehouse Pilot project. The Course Analytics data mart was designed to provide an analytical base of data to answer a specific set of business questions. Information requirements were identified to support the answering of these business questions. The resulting data model of the Course Analytics data mart is represented in figure 2.
Dimensional Data Model - July 17,
Instructional_Course_Section Instrnl_Crse_Sect_Key: int Instrnl_Cmps_Num: char(1) Instrnl_Cmps_Abbrv: char(3) Instrnl_Cmps_Nam: varchar(20) Instrnl_VP_Coll_Num: char(2) Instrnl_VP_Coll_Nam: varchar(25) Instrnl_Fisc_Unt_Cd: char(4) Instrnl_Fisc_Unt_Desc: varchar(25) Acad_Coll_Cd: char(3) Acad_Unt_Num: char(3) Acad_Unt_Abbrv: char(8) Acad_Unt_Desc: varchar(45) Crse_Tier_Lev_Num: char(1) Crse_Tier_Lev_Desc: varchar(45) UG_Cd: char(1) Prof_Cd: char(1) Grad_Cd: char(1) Crse_Num: char(7) Crse_Num_Pfx: char(1) Crse_Num_Main: char(3) Crse_Num_Sfx: char(1) Crse_Num_Decml: char(2) Crse_Ttl: varchar(18) Crse_Aggr_1: char(5) Crse_Aggr_2: char(6) Crse_GEC_Flg: char(1) Crse_GEC_Cd: char(3) Crse_Cap_Fund_Flg: char(1) Crse_St_Shr_Elig_Cd: char(1) Crse_St_Shr_Lev_Cd: char(1) Crse_Subj_Cd: char(6) Crse_Cred_Hrs: varchar(5) Crse_Grd_Opt_Cd: char(3) Crse_Sect_Qty: smallint Prim_Sect_Call_Num: char(5) Prim_Sect_Chk_Dgt: char(1) Prim_Sect_Typ: char(3) Prim_Instr_SSN: char(9) Prim_Instr_Frst_Nam: varchar(12) Prim_Instr_Mid_Nam: varchar(12) Prim_Instr_Last_Nam: varchar(16) Prim_Instr_Ttl: varchar(35) Prim_Instr_Fisc_Unt_Cd: char(4) Prim_Instr_FTE: char(3) Secnd_Sect_Call_Num: char(5) Secnd_Sect_Chk_Dgt: char(1) Secnd_Sect_Typ: char(3) Secnd_Sect_Lmt_Sz: smallint Secnd_Instr_SSN: char(9) Secnd_Instr_Frst_Nam: varchar(12) Secnd_Instr_Mid_Nam: varchar(12) Secnd_Instr_Last_Nam: varchar(16) Secnd_Instr_Ttl: varchar(35) Secnd_Instr_Fisc_Unt_Cd: char(4) Secnd_Instr_FTE: char(3) Eff_Beg_Dt: datetime Eff_End_Dt: datetime Registration_Cycle Reg_Time_Key: int Fisc_Yr: char(4) Fisc_Perd_Num: char(1) Reg_Yr: char(4) Reg_Perd_Num: char(1) Reg_Perd_Cd: char(2) Reg_Sub_Perd: char(1) Qtr_Cyc_Pnt: char(2) Qtr_Cyc_Pnt_Desc: varchar(40) Qtr_Cyc_Eff_Dt: datetime

Student Stdnt_Key: int Stdnt_SSN: char(9) Stdnt_Old_SSN: char(9) Stdnt_Gndr_Cd: char(1) Stdnt_Rpt_Ethncy_Cd: char(1) Stdnt_Rpt_Ethncy_Shrt_Desc: char(16) Stdnt_Rpt_Ethncy_Lng_Desc: varchar(30) Stdnt_Brth_Dt: datetime Stdnt_PT_FT_Cd: char(1) Stdnt_Rnk: char(1) Stdnt_Rpt_Rnk: char(2) Stdnt_Rpt_Rnk_Desc: varchar(33) Stdnt_Rpt_Cls_Cd: char(1) Stdnt_Rpt_Cls_Desc: char(20) Stdnt_Enroll_Proj_Rnk: char(2) Stdnt_Lev_Cd: char(1) Stdnt_Hon_Flg: char(1) Stdnt_Athlt_Flg: char(1) Stdnt_Schlr_Flg: char(1) Stdnt_Mult_Maj_Qty: tinyint Stdnt_Enroll_Stat_Cd: char(1) Stdnt_Enroll_Stat_Desc: varchar(35) Stdnt_Fee_Paid_Flg: char(1) Stdnt_OSU_Qtrs: smallint Stdnt_Cum_Hrs: smallint Stdnt_Cum_Pnts: decimal(5,1) Stdnt_Cum_GPA: decimal(3,2) Stdnt_Qtr_Hrs: smallint Stdnt_Qtr_Pnts: decimal(5,1) Stdnt_Qtr_GPA: decimal(3,2) Stdnt_Atmpt_Crse_Hrs: smallint Stdnt_Fail_Crse_Hrs: smallint Stdnt_Earn_Hrs: smallint Stdnt_Maj_Cd: char(3) Stdnt_Maj_Abbrv: varchar(8) Stdnt_Maj_VP_Coll_Num: char(2) Stdnt_Maj_Coll_Cd: varchar(3) Stdnt_Maj_Coll_Nam: varchar(25) Stdnt_Maj_Coll_Fisc_Unt_Cd: char(4) Stdnt_Maj_Coll_Fisc_Unt_Desc: varchar(25) Stdnt_Oth_Declrtn_Typ: char(1) Stdnt_Oth_Declrtn_Typ_Desc: varchar(20) Stdnt_Oth_Declrtn_Cd: char(3) Stdnt_Oth_Declrtn_Abbrv: varchar(8) Stdnt_Oth_Declrtn_VP_Coll_Num: numeric(2) Stdnt_Oth_Declrtn_Coll_Cd: char(3) Stdnt_Oth_Declrtn_Coll_Nam: varchar(25) Stdnt_Oth_Declrtn_Coll_Fisc_Unt_Cd: char(4) Stdnt_Oth_Declrtn_Coll_Fisc_Unt_Desc: varchar(25) Eff_Beg_Dt: datetime Eff_End_Dt: datetime

Section_Fact_Table Reg_Time_Key: int Enroll_Coll_Key: int Instrnl_Crse_Sect_Key: int Stdnt_Key: int Stdnt_Enroll_Qty: smallint Crse_Wtlst_Dmd: smallint Grd_Pnts: decimal(3,1) Cred_Hrs: smallint Ltr_Grd: char(2) Qual_Pnt: decimal(3,1)

Enrollment_College Enroll_Coll_Key: int Enroll_Cmps_Num: char(1) Enroll_Cmps_Abbrv: char(3) Enroll_Cmps_Nam: varchar(20) Enroll_Coll_Cd: char(3) Enroll_Coll_Nam: varchar(25) Enroll_Secnd_Coll_Cd: char(3) Enroll_Secnd_Coll_Nam: varchar(25) Eff_Beg_Dt: datetime Eff_End_Dt: datetime

Figure 2 Course Analytics Data Mart Model

Covansys Corporation Confidential Documentation

October 7, 2002

Extract, Transformation, and Load (ETL) Design Document There are four dimensional tables and one fact table. The Instructional_Course_Section table represented the hierarchy of the instructional side of the university. This includes information for the VP college, campus, fiscal unit, academic unit, course, section and instructor information that is associated with university instructional offering. The Student table represents the information that describes the students enrollment, majors, minors, academic performance and progress, etc. The Enrollment_College table represents all possible combinations of primary and secondary colleges within a university college. The Registration_Cycle table is the directory of university registration cycle-points that occur during a university academic period over a period of years. The Section_Fact table represents a students enrollment and performance within a specific course/section. The dimension tables provide specific reference to a given students enrollment based on the instructional college hierarchy, enrollment college, and at a specific point within the registration cycle of an academic period. Information contained in the Section_Fact table includes an indication of enrollment, waitlist, grade points achieved, credit hours earned, and letter grade achieved. Since waitlist information is recorded at a course level for a student. Waitlist information was associated with a section that was all zeros. Student enrollment is associated with a valid university section number. In order to load information into the Course Analytics data mart, there are six areas that must be completed. These areas are: 1. 2. 3. 4. 5. 6. Registration Cycle Dimension Enrollment College Dimension ETL Mapping tables used for reference and validation. Instructional Course/Section Dimension Student Dimension Section_Fact table

The Registration Cycle dimension is a static table and does not change from one load cycle to the next. There is no automated process used to populate this dimension. This dimension will be populated directly using a spreadsheet given by the Office of Enrollment Services. The Enrollment College Dimension was created using all the possible combinations of primary and secondary colleges in the staging area. Any changes for this dimension should be made in the staging area, which will then be propagated to the Enrollment College Dimension. A data steward will have to be assigned to maintain this data in the staging area.

Covansys Corporation Confidential Documentation

October 7, 2002

Extract, Transformation, and Load (ETL) Design Document

3.1. ETL Mapping tables


The information contained in the mapping tables serves as a point of validation in the ETL process for coded fields being loaded into the database. Additionally, the mapping tables provide standard reference information associated codes values of these fields. Figure 3 outlines the detailed processes used to build the different mapping (Lookup) tables that will be used in the data warehouse. In essence, these mapping tables provide additional information about certain elements in the data mart. For example, the GEC_Courses table contains information about which courses are categorized as general education curriculum (GEC) courses. Some of the mapping tables reside in the source system itself and hence were not replicated in the staging area.

Perm_Academic_ Unit_Mf

perm_academic_unit_mf.d eactivation_mo_yr_qtr is null

Convert to Character and Pad Zeros in academic_unit_num

Perm_Acad_Unit_Map

Campus_Agg_Map

campus_code < '9'

Campus_Map

Fiscal_Dept_Map

Convert to Character osu_fiscal_unit_num, vice_president_coll_code Pad Zeros in osu_fiscal_unit_num_char

Fiscal_Unit_Map

College_Map

College_Map

Qtr_Yr is NOT Null Rtrim & Ltrim Course Number, Convert Fiscal_Year to Char, Populate Course Funding Indicator

Funded_Courses

Route Courses

Instructional_Course_ Section_Staging

Qtr_Yr is Null

Create Rows for four quarters

Figure 3 ETL Mapping Tables

Covansys Corporation Confidential Documentation

October 7, 2002

Extract, Transformation, and Load (ETL) Design Document

3.2. Instructional_Course_Section Dimension


The Instructional_Course_Section information is built from information contained in the Csection_Current, Ccourse_Current, and CInstructor files. Specific information from these files are extracted and consolidated to give a complete picture of Course, Section, and Instructor information. The Extraction_Cycle table contains the time period (Year_Qtr) for which information from the source systems need to be extracted. Figures 4 and 5 outline the detailed processes used to load the Instructional Course Dimension.

CSECTIO N _Current

(CSECTIO N.Cam pus < 9 and NO T CSECTIO N.Call_Check = 'A') OR (CSECTIO N.Cam pus < 9 AND Call_Check = 'A' and Call NO T IN (SELECT Parent FROM CSECTIO N))

Extrctn_Year_Q tr = CSection.Year_Qtr

Ltrim and Rtrim Course Num ber

Extraction_Cycle

CCO URSE _Current

CCO URSE.Dpt_Num ber < '999'

Join on Year_Qtr

Ltrim and Rtrim Course Num ber & Course Title Populate Course_Lvl, Ugrad_Cd, G rad_Cd, & Prof_Cd Decode Dpt_Fiscal

Join on Cam pus, College, Dept Num ber, Course Num ber

Cam pus_Map

Perm _Acad_Unit_Map Extraction_Cycle

Crse_Tier_Lev_Map

Instructional_Course_ Section_Staging

Aggregate Courses and create dum m y sections for W aitlisted Students

Populate Crse_Aggr_1, Crse_Aggr_2, Crse_Cap_Fund_Flg, Crse_Sect_Q ty

Populate G EC Flag

Acad. Unit Abbrv Desc., Acad. Unit Sched Desc., Fisc. Unit Desc, VP College Num ber, VP College Nam e, Cam pus Abbrv Desc., Cam pus Nam e, GEC Code, Funded Courses

Fiscal_Unit_Map LookUp

GEC_Courses

Funded_Courses

Figure 4 - First part of the Instructional_Course_Section ETL process

Covansys Corporation Confidential Documentation

October 7, 2002

Extract, Transformation, and Load (ETL) Design Document

Parent = '' AND Mobility Group is Null

Instructor Information

LookUp

CInstructor

Parent > '' AND Mobility Group is Null

Primary Instructor, Secondary Instructor, Primary Section Type LookUp

Router different types of Sections CInstructor CSection

Parent > '' AND MB_Group = '1'

Primary Instructor, Secondary Instructor, Primary Section Type (Concat MB_Group 1& 2 Section Type) LookUp

Instructional_Course_ Section_Staging

Using Informatica W izard, Populate Slowily Changing Dimension

Instructional_Course_S ection

CInstructor

CSection

Parent = '' AND MB_Group = '1'

Primary Instructor, Primary Section Type (Concat MB_Group 1& 2 Section Type)

CInstructor

CSection

Figure 5 - Second part of the Instructional_Course_Section ETL process

Covansys Corporation Confidential Documentation

October 7, 2002

Extract, Transformation, and Load (ETL) Design Document

3.3. Student Dimension


The Student information is built from information contained in the Enrollment and Crse_Grade files. Specific information from these files are extracted and consolidated to give a complete picture of Student information. Several mapping (Lookup) tables such as Active_Perm_Academic_Unit_Mf, Enroll_Status_Map, College_Map, and Fiscal_Unit_Map help provide additional information about each Student. Figures 6 and 7 outline the detailed processes used to load the Student Dimension.

enrollm ent

Ex trc tn_Y ear_Qt r = enrollm ent.y y y y q_c ode

enrollm ent.s s n = pers onal.s s n

J oin enrollm ent.s s n = dept2.s s n

Ex trc tn_Y ear_Qtr = enrollm ent.y y y y q_c ode Enrollm ent_y y y y q_c ode = y y y y q_c ode

c rs e_grade

E x trac tion_C y c le Ex trac tion_C y c le ex pres s ion: populate lev el, reported rank , ethnic ity , part/tim e f ull tim e, projec ted rank , reported c las s c ode, reported c las s c ode des c ription,etc ... F ilter: option_c ode is null or option_c ode != 'R ' AN D m obility _grp_c ode is null or m obility _grp_c ode != '2' AND c all_num is not null and c all_num not lik e 99% f inal_grade not in ('K','KM','KD ','E', 'R ') AND drop_date is null or drop_date > ef f ec tiv e_date (or th day 14th date if 15 D ay or EO Q+30. Add_date als o c hec k ed if 15t h D ay of EO Q+30)

Ex trc tn_Y ear_Qtr = y y y y q_c ode

E x trac tion_C y c le

pers onal

pers onal.s s n = s s n_c hange_ev ent.s s n

U niv ers ity _c alendar_ mf

s s n_c hange_ev en t

dept2

Ex trc n_Y ear_Qtr = dept2.enroll_y y y y q_c ode AND dept2_ty pe_c ode in ('2','3','4','5', '6','7','8','9')

Aggregator: c ounts by s s n. S tdnt_Mult_Maj_Q nty Populate parttim e/ f ullt im e hours

Aggregator : s um c redit hours by s s n

Ex trac tion_C y c le

O v erride Major C ode in Ex c eption C as es

Look up: Stdnt_Maj_C d

Look up: Stdnt_Enroll_S tat_ D es c

Look up: Stdnt _H on_F lag Stdnt _Sc h_F lag Stdnt _Athl_F lag c urrent_U H S_v iew

ac tiv e_perm _ac a dm ic _unit_m f

enroll_s tatus _m a p

c urrent_SPT_v iew

c urrent_SC H _v iew

Figure 6 - First part of the Student ETL process

Covansys Corporation Confidential Documentation

October 7, 2002

Extract, Transformation, and Load (ETL) Design Document

Lookup Stdnt_Maj_Coll_Cd and Stdnt_Maj_Coll_Nam

Override Stdnt_Maj_Coll_Fisc _Unt_Code in exception cases

Lookup Stdnt_Maj_Coll_Fisc _Unt_Desc and Stdnt_Maj_VP_Coll _Num

Lookup Stdnt_Oth_Declrtn_Cd

current_secondary _major

current_minor

College_map

fiscal_unit_map

Override Stdnt_Oth_Declrtn _Cd in Exception Cases

current_AOI_less _than_900

current_AOI_greater _than_900

current_specialization

Lookup Stdnt_Oth_Declrtn_Coll _Cd and Stdnt_Othr_Declrtn_Coll _Nam

Override Stdnt_Othr_Declrtn_ Coll_Fisc_Unt_Code in exception cases

Lookup Stdnt_Oth_Declrtn_Coll _Fisc_Unt_Desc and Stdnt_Oth_Declrtn_VP _Coll_Num

Student_Dimension _Staging

Using Informatica W izard, Populate Slowily Changing Dimension

Student

College_map

fiscal_unit_map

Figure 7 - Second part of the Student ETL process

Covansys Corporation Confidential Documentation

10

October 7, 2002

Extract, Transformation, and Load (ETL) Design Document

3.4. Section_Fact_Table
The Student information is built from information contained in the Enrollment and Crse_Grade and Waitlist files. Specific information from these files are extracted and consolidated to determine student enrollment and waitlist in different instructional course sections. Figures 8 describe the process that determines all the information required to pick up the surrogate keys from the corresponding dimensions. This information is stored in the Section_Fact_Staging_Table. Figure 9 describe the process that uses that information to determine the surrogate keys and populates it into the Section_Fact_Table.

c rs e_grade

Ex trc tn_Y ear_Qtr = enrollm ent. y y y y q_ c ode

F ilt er: W H ER E (m obility _grp_c ode is null or m obility _grp_c ode != '2') AN D c all_num is not null AN D c all_num not lik e 99% AN D (f inal_grade is null O R f inal_grade not in ('K', 'KD ', 'EM','KM')) AND drop_date is null

F ilter: inc lude only appropriate TER M rec ords f or SU Q uarter, and inc lude only rec ords with add_dat e t <= 14 t h D ay and drop_dat e > 14h D ay f or F inal F if t eent h D ay and EOQ +30 c y c le points

R trim f inal_grad e

Look up point hour v alue

grade_m ap

Ex trac tion_C y c le

U niv ers ity _ c alendar_m f

ex pres s ion: c alc ulate grade points , quality points , St dnt _Enroll_Qt y = 1

Aggregate rec ords by s s n,c ours e, c all_num ber. Sum grade point s , c redit _hours , qualit y point s , c alc ulat e av g. grade point

ex pres s ion: c alc ulate letter grade, quality points

Enrollm ent

Ex t rc tn_Y ear_Q tr = enrollm ent. y y y y q_c ode

Look up: perm ac un_c oll_c ode, s ec ondary _c oll_c ode_ abbrev

ex pres s ion : c alc ulate c ollege_c ode

join c rs e_grade.s s n = enrollm ent .s s n

Sec t ion_F ac t_Staging

Ex trac tion_C y c le

ac t iv e_perm _ac ade m ic _unit _m f

s ec ondary _c ollege_m f

Ex trac tion_C y c le

f ilter : res ult = 'C ' and c all_num is not null

Ex trc t n_Y ear_Q tr = reques t.y y y y q_c ode

ex pres s ion : ref orm at and trim c ours e_num ber

join reques t .s s n = enrollm ent. s s n

waitlis t .reques t

Figure 8 - First part of the Section_Fact ETL process

Covansys Corporation Confidential Documentation

11

October 7, 2002

Extract, Transformation, and Load (ETL) Design Document

Join: ssn = Stdnt_SSN

Expression : Conv ert UVC to USS

Lookup Enroll_Coll_Key

Lookup Instrnl_Crse_Sect_Key

Route error records to Section_Fact_Table_ Rejects

If any key v alue is null or Last_name = Cancelled

Y es

No

Student

Enrollment_College

Instructional_Course_ Section

Section_Fact_Table_ Rejects

Section_Fact_Table

Figure 9 - Second part of the Section_Fact ETL process

Covansys Corporation Confidential Documentation

12

October 7, 2002

Extract, Transformation, and Load (ETL) Design Document

4. Unit and String Testing Results


As part of the implementation, the ETL team conducted a series of tests to determine the integrity and quality of data in the Course Analytics data mart. The results of these tests are outlined in Table 1. The ETL jobs and processes were tested using data from Spring 2002. The data in all the dimensions were validated and/or accounted for against the source systems.
Number of Source Records Number of Target Records

Name of Source

Selection Criteria

Name of Target

Comments

CSection

Year_Qtr = 20022, Campus < '9', Dpt_Number < '999', Call_Check <> 'A', MB_Group is Null or '1'

19750

Instructional_Course _Section

19750

Campus < '9', Year_Qtr = 20022, Dpt_Number < '999', CSection/Ccourse Call_Check <> 'A', MB_Group is Null, Parent = '' Campus < '9', Year_Qtr = 20022, Dpt_Number < '999', CSection/Ccourse Call_Check <> 'A', MB_Group is Null, Parent > '' Campus < '9', Year_Qtr = 20022, Dpt_Number < '999', CSection/Ccourse Call_Check <> 'A', MB_Group = '1', Parent > ''

18056

Instructional_Course _Section

18056

1776

Instructional_Course _Section

1776

104

Instructional_Course _Section

104

Covansys Corporation Confidential Documentation

13

October 7, 2002

Extract, Transformation, and Load (ETL) Design Document

Name of Source

Selection Criteria

Number of Source Records

Name of Target

Number of Target Records

Comments

Campus < '9', Year_Qtr = 20022, Dpt_Number < '999', CSection/Ccourse Call_Check <> 'A', MB_Group = '1', Parent = '' Student_Current. Enroll_Yyyyq_Code = enrollment 20022 Enroll_Yyyyq_Code= 20022, Call_Num is not Null, Mobility_grp_code <> Student_Current. 2, crse_grade Final_Grade <> (K,KD,KM,EM), (Drop_Date is Null or Drop_Date > Extrctn_Cycle_Pt_Eff _Dt)

55

Instructional_Course _Section

55

61,657

Student

61,657 1,426 records were rejected because corresponding information was missing from Instructional_Course_ Section. Basically, the check digit in CSection was non-numeric (A).

164,932

Section_Fact_Table

163,506

Waitlist. Request

Yyyyq_code = 20022, Call_Num is not null, Result = C

5,937

Section_Fact_Table

5,870

67 records were rejected because corresponding information was missing from Instructional_Course_ Section Basically, the check digit in CSection was non-numeric (A).

Table 1 Unit and String Test Results

Covansys Corporation Confidential Documentation

14

October 7, 2002

Extract, Transformation, and Load (ETL) Design Document

5. Operational Process Flow


The operation process to load a Registration Cycle Point into the Course Analytics data mart is a manual procedure with specific steps that need to be completed. Eventually, this process can be automated and scheduled within the OSU Data Center Operations. 5.1. Explanation of Operation Flow The process to load the Course Analytics data mart is a ten-step process. The following information explains what is accomplished with each step.

Step
1.

Name/Description
Insert a record in the Extraction_Cycle table (The Extraction_Cycle table is joined to the course and student mappings to pull the records for a particular registration cycle. This process truncates the Extraction_Cycle table and inserts one record that contains the proper registration cycle record key for the data to be loaded.) Move all source data to staging database (This step moves all data from the source databases to the staging area (cpstage). These tables will be backed up from the staging area when the process is complete.) Run the preliminary updates for mapping (lookup) tables (The mapping tables are truncated and current values are loaded into perm_academic_unit_map, campus_map, fiscal_unit_map and college_map.) Copy staging tables to flat files. (This step copies relevant tables in cpstage to the DW_Staging_Backups directory on the D: drive of DWETL.) Load Instructional_Course_Section_Staging table (This step loads the Instructional_Course_Section_Staging table. The Ccourse_Current and Ccourse_Section tables are used as input. The Extraction_Cycle table is joined together to do all the lookups and outputs four groups of records. The output records are processed through the router and each output group is then transformed and inserted into the Instructional_Course_Section table.) Load Student_Dimensional_Stage table (This step loads the Student_Dimensional_Stage table. This table is identical to the student table with the exception of the Stdnt_Key, Eff_Beg_Dt, and Eff_End_Dt fields. These fields are generated by the Slowly Changing Dimensional load step.) Load Section_Fact_Stage table (This step loads the Section_Fact_Stage table. This table has all the fact data, plus the necessary fields necessary to do lookups against the dimensional tables to populate the dimensional key fields.)

2.

3.

4.

5.

6.

7.

Covansys Corporation Confidential Documentation

15

October 7, 2002

Extract, Transformation, and Load (ETL) Design Document 8. Load Instructional_Course_Section table (This step is called the slowly changing dimension mapping. The Type 2 Dimension/Effective Date Range mapping is used to update the slowly changing dimensional table. For each source row with a matching primary key in the target, the Expression compares user-defined source and target columns. If those columns do not match, the Expression marks the row as changed. Each time the Informatica server inserts a changed dimension, it updates the previous version of the target, using the current date to fill the end date column. A Sequence Generator creates a primary key for each row for the new row is to be inserted. It uses the current date to indicate the start of the effective date range. The transformation leaves the end date null, which indicates the new row contains current dimension data.) Load Student table (This step used the Student_Dimension_Stage table as input and performs a type 2 slowly changing dimension update to the Student table where appropriate.) Load Section_Fact_Table (This step loads the Section_Fact_Table in the Course Analytics data mart using the Section_Fact_Stage table as input. Lookups are performed to the various dimension tables to determine the surrogate key values.)

9.

10.

Covansys Corporation Confidential Documentation

16

October 7, 2002

Extract, Transformation, and Load (ETL) Design Document

5.2. Course Analytics Operational Process The process is used to load the Course Analytics data mart.

Environment:
All Informatica Workflows, Sessions, and Mappings exist in the ca_production folder of the etl_repository Informatica Repository (etl_repository DB on DWPROD server) Production JCL to submit these tasks exist on US.PANLIB WPJ148Zn02. Where n = 1 8 which corresponds to the step number below. Job WPJ148ZA02 can be used to execute all 8 steps at once. Copies also exist in DW.CNTL.CA.LOADS. US.PANLIB versions are authoritative.

Step
1.

Name/Description
Insert a record in the Extraction_Cycle table (J148Z102) Edit JCL : Change // SET VALUE=nnn Where nnn = reg_time_key of current cycle point. Execute Workflow s_m_extraction_cycle_load: a. s_m_extraction_cycle_load

2.

Move all source data to staging database (J148Z202) Execute Workflow b_source_to_staging: a. s_CCOURSE_CURRENT_landing b. s_CSECTION_CURRENT_landing c. s_m_source_to_staging_student_current d. s_m_source_to_staging_waitlist e. s_m_source_to_staging_mapfiles f. s_Course_Daily_Lookup_Files

3.

Run the preliminary updates for mapping (lookup) tables (J148Z302) Execute Workflow b_preliminary_mappings: a. s_Mapping_Staging b. s_Funded_Courses c. s_GEC_Courses

4.

Backup staging tables to flat files (J148Z402) Edit JCL : Change // SET VALUE=nnn Where nnn = reg_time_key of current cycle point. 17 October 7, 2002

Covansys Corporation Confidential Documentation

Extract, Transformation, and Load (ETL) Design Document Execute Workflow b_backup_staging_to_files: a. b. c. d. 5. s_m_staging_to_files_course_daily s_m_staging_to_files_mapfiles s_m_staging_to_files_student_current s_m_staging_to_files_waitlist

Load Dimension Staging Tables (J148Z502) Execute Workflow b_staging_tables:

a. s_Instructional_Course_Section_Staging_Current b. s_m_student_staging c. s_m_Section_Fact_Table_Staging


6. Load Instructional_Course_Section table (J148Z602) Execute Workflow s_Instructional_Course_Section_SCD: a. s_ Instructional_Course_Section_SCD 7. Load Student table (J148Z702) Execute Workflow s_m_Student_SCD: a. s_m_Student_SCD 8. Load Section_Fact_Table (J148Z802) Execute Workflow s_m_Section_Fact_Load: a. s_m_Section_Fact_Load

Covansys Corporation Confidential Documentation

18

October 7, 2002

Вам также может понравиться