Вы находитесь на странице: 1из 33

What is the difference between OLTP and OLAP? OLTP is the transaction system that collects business data.

Whereas OLAP is the reporting and analysis system on that data. OLTP systems are optimized for INSERT, UPDATE operations and therefore highly normalized. On the other hand, OLAP systems are deliberately denormalized for fast data retrieval through SELECT operations. Explanatory Note: In a departmental shop, when we pay the prices at the check-out counter, the sales person at the counter keys-in all the data into a "Point-Of-Sales" machine. That data is transaction data and the related system is a OLTP system. On the other hand, the manager of the store might want to view a report on out-of-stock materials, so that he can place purchase order for them. Such report will come out from OLAP system What is data mart? Data marts are generally designed for a single subject area. An organization may have data pertaining to different departments like Finance, HR, Marketting etc. stored in data warehouse and each department may have separate data marts. These data marts can be built on top of the data warehouse. What is ER model? ER model or entity-relationship model is a particular methodology of data modeling wherein the goal of modeling is to normalize the data by reducing redundancy. This is different than dimensional modeling where the main goal is to improve the data retrieval mechanism. What is dimensional modeling? Dimensional model consists of dimension and fact tables. Fact tables store different transactional measurements and the foreign keys from dimension tables that qualifies the data. The goal of Dimensional model is not to achive high degree of normalization but to facilitate easy and faster data retrieval. Ralph Kimball is one of the strongest proponents of this very popular data modeling technique which is often used in many enterprise level data warehouses. What is dimension? A dimension is something that qualifies a quantity (measure). For an example, consider this: If I just say 20kg, it does not mean anything. But if I say, "20kg of Rice (Product) is sold to Ramesh (customer) on 5th April (date)", then that gives a meaningful sense. These product, customer and dates are some dimension that qualified the measure - 20kg. Dimensions are mutually independent. Technically speaking, a dimension is a data element that categorizes each item in a data set into non-overlapping regions.

What is Fact? A fact is something that is quantifiable (Or measurable). Facts are typically (but not always) numerical values that can be aggregated. What are additive, semi-additive and non-additive measures? Non-additive Measures Non-additive measures are those which can not be used inside any numeric aggregation function (e.g. SUM(), AVG() etc.). One example of non-additive fact is any kind of ratio or percentage. Example, 5% profit margin, revenue to asset ratio etc. A non-numerical data can also be a non-additive measure when that data is stored in fact tables, e.g. some kind of varchar flags in the fact table. Semi Additive Measures Semi-additive measures are those where only a subset of aggregation function can be applied. Lets say account balance. A sum() function on balance does not give a useful result but max() or min() balance might be useful. Consider price rate or currency rate. Sum is meaningless on rate; however, average function might be useful. Additive Measures Additive measures can be used with any aggregation function like Sum(), Avg() etc. Example is Sales Quantity etc. At this point, I will request you to pause and make some time to read this article on "Classifying data for successful modeling". This article helps you to understand the differences between dimensional data/ factual data etc. from a fundamental perspective What is Star-schema? This schema is used in data warehouse models where one centralized fact table references number of dimension tables so as the keys (primary key) from all the dimension tables flow into the fact table (as foreign key) where measures are stored. This entity-relationship diagram looks like a star, hence the name.

Consider a fact table that stores sales quantity for each product and customer on a certain time. Sales quantity will be the measure here and keys from customer, product and time dimension tables will flow into the fact table.

What is snow-flake schema?


This is another logical arrangement of tables in dimensional modeling where a centralized fact table references number of other dimension tables; however, those dimension tables are further normalized into multiple related tables. Consider a fact table that stores sales quantity for each product and customer on a certain time. Sales quantity will be the measure here and keys from customer, product and time dimension tables will flow into the fact table. Additionally all the products can be further grouped under different product families stored in a different table so that primary key of product family tables also goes into the product table as a foreign key. Such construct will be called a snow-flake schema as product table is further snow-flaked into product family.

Note Snow-flake increases degree of normalization in the design.

What are the different types of dimension?


In a data warehouse model, dimension can be of following types, 1. 2. 3. 4. Conformed Dimension Junk Dimension Degenerated Dimension Role Playing Dimension

Based on how frequently the data inside a dimension changes, we can further classify dimension as 1. Unchanging or static dimension (UCD) 2. Slowly changing dimension (SCD) 3. Rapidly changing Dimension (RCD)

What is a 'Conformed Dimension'?


A conformed dimension is the dimension that is shared across multiple subject area. Consider 'Customer' dimension. Both marketing and sales department may use the same customer dimension table in their reports. Similarly, a 'Time' or 'Date' dimension will be shared by different subject areas. These dimensions are conformed dimension.

Theoretically, two dimensions which are either identical or strict mathematical subsets of one another are said to be conformed.

What is degenerated dimension?


A degenerated dimension is a dimension that is derived from fact table and does not have its own dimension table. A dimension key, such as transaction number, receipt number, Invoice number etc. does not have any more associated attributes and hence can not be designed as a dimension table.

What is junk dimension?


A junk dimension is a grouping of typically low-cardinality attributes (flags, indicators etc.) so that those can be removed from other tables and can be junked into an abstract dimension table. These junk dimension attributes might not be related. The only purpose of this table is to store all the combinations of the dimensional attributes which you could not fit into the different dimension tables otherwise. One may want to read an interesting document, De-clutter with Junk (Dimension)

What is a role-playing dimension?


Dimensions are often reused for multiple applications within the same database with different contextual meaning. For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of Delivery", or "Date of Hire". This is often referred to as a 'role-playing dimension'

What is SCD?
SCD stands for slowly changing dimension, i.e. the dimensions where data is slowly changing. These can be of many types, e.g. Type 0, Type 1, Type 2, Type 3 and Type 6, although Type 1, 2 and 3 are most common.

What is rapidly changing dimension?


This is a dimension where data changes rapidly.

Describe different types of slowly changing Dimension (SCD)


Type 0: A Type 0 dimension is where dimensional changes are not considered. This does not mean that the attributes of the dimension do not change in actual business situation. It just means that, even if the value of the attributes change, history is not kept and the table holds all the previous data.

Type 1: A type 1 dimension is where history is not maintained and the table always shows the recent data. This effectively means that such dimension table is always updated with recent data whenever there is a change, and because of this update, we lose the previous values. Type 2: A type 2 dimension table tracks the historical changes by creating separate rows in the table with different surrogate keys. Consider there is a customer C1 under group G1 first and later on the customer is changed to group G2. Then there will be two separate records in dimension table like below, Key 1 2 Customer C1 C1 Group G1 G2 Start Date 1st Jan 2000 1st Jan 2006 End Date 31st Dec 2005 NULL

Note that separate surrogate keys are generated for the two records. NULL end date in the second row denotes that the record is the current record. Also note that, instead of start and end dates, one could also keep version number column (1, 2 etc.) to denote different versions of the record. Type 3: A type 3 dimension stored the history in a separate column instead of separate rows. So unlike a type 2 dimension which is vertically growing, a type 3 dimension is horizontally growing. See the example below, Key 1 Customer C1 Previous Group G1 Current Group G2

This is only good when you need not store many consecutive histories and when date of change is not required to be stored. Type 6: A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar to type 2, but only you add one extra column to denote which record is the current record. Key Customer Group Start Date 1 2 C1 C1 G1 G2 End Date Current Flag Y

1st Jan 2000 31st Dec 2005 N 1st Jan 2006 NULL

What is a mini dimension?


Mini dimensions can be used to handle rapidly changing dimension scenario. If a dimension has a huge number of rapidly changing attributes it is better to separate those attributes in different table called mini dimension. This is done because if the main dimension table is designed as SCD type 2, the table will soon outgrow in size and create performance issues. It is better to segregate the rapidly changing members in different table thereby keeping the main dimension table small and performing.

What is a fact-less-fact?
A fact table that does not contain any measure is called a fact-less fact. This table will only contain keys from different dimension tables. This is often used to resolve a many-to-many cardinality issue. Explanatory Note: Consider a school, where a single student may be taught by many teachers and a single teacher may have many students. To model this situation in dimensional model, one might introduce a fact-less-fact table joining teacher and student keys. Such a fact table will then be able to answer queries like, 1. Who are the students taught by a specific teacher. 2. Which teacher teaches maximum students. 3. Which student has highest number of teachers.etc. etc.

What is a coverage fact?


A fact-less-fact table can only answer 'optimistic' queries (positive query) but can not answer a negative query. Again consider the illustration in the above example. A fact-less fact containing the keys of tutors and students can not answer a query like below, 1. Which teacher did not teach any student? 2. Which student was not taught by any teacher? Why not? Because fact-less fact table only stores the positive scenarios (like student being taught by a tutor) but if there is a student who is not being taught by a teacher, then that student's key does not appear in this table, thereby reducing the coverage of the table. Coverage fact table attempts to answer this - often by adding an extra flag column. Flag = 0 indicates a negative condition and flag = 1 indicates a positive condition. To understand this better, let's consider a class where there are 100 students and 5 teachers. So coverage fact table will ideally store 100 X 5 = 500 records (all combinations) and if a certain teacher is not teaching a certain student, the corresponding flag for that record will be 0.

What are incident and snapshot facts

A fact table stores some kind of measurements. Usually these measurements are stored (or captured) against a specific time and these measurements vary with respect to time. Now it might so happen that the business might not able to capture all of its measures always for every point in time. Then those unavailable measurements can be kept empty (Null) or can be filled up with the last available measurements. The first case is the example of incident fact and the second one is the example of snapshot fact.

What is aggregation and what is the benefit of aggregation?


A data warehouse usually captures data with same degree of details as available in source. The "degree of detail" is termed as granularity. But all reporting requirements from that data warehouse do not need the same degree of details. To understand this, let's consider an example from retail business. A certain retail chain has 500 shops accross Europe. All the shops record detail level transactions regarding the products they sale and those data are captured in a data warehouse. Each shop manager can access the data warehouse and they can see which products are sold by whom and in what quantity on any given date. Thus the data warehouse helps the shop managers with the detail level data that can be used for inventory management, trend prediction etc. Now think about the CEO of that retail chain. He does not really care about which certain sales girl in London sold the highest number of chopsticks or which shop is the best seller of 'brown breads'. All he is interested is, perhaps to check the percentage increase of his revenue margin accross Europe. Or may be year to year sales growth on eastern Europe. Such data is aggregated in nature. Because Sales of goods in East Europe is derived by summing up the individual sales data from each shop in East Europe. Therefore, to support different levels of data warehouse users, data aggregation is needed.

What is slicing-dicing?
Slicing means showing the slice of a data, given a certain set of dimension (e.g. Product) and value (e.g. Brown Bread) and measures (e.g. sales). Dicing means viewing the slice with respect to different dimensions and in different level of aggregations. Slicing and dicing operations are part of pivoting.

What is drill-through?
Drill through is the process of going to the detail level data from summary data. Consider the above example on retail shops. If the CEO finds out that sales in East Europe has declined this year compared to last year, he then might want to know the root cause of the

decrease. For this, he may start drilling through his report to more detail level and eventually find out that even though individual shop sales has actually increased, the overall sales figure has decreased because a certain shop in Turkey has stopped operating the business. The detail level of data, which CEO was not much interested on earlier, has this time helped him to pin point the root cause of declined sales. And the method he has followed to obtain the details from the aggregated data is called drill through.

The 101 Guide to Dimensional Data Modeling


Last Updated on Tuesday, 26 June 2012 16:51 Written by Akash Mitra

I know I should have written this article long before but as they say, "better late than never". In this multi part tutorial we will learn the basics of dimensional modeling and we will see how to use this modeling technique in real life scenario. At the end of this tutorial you will become a confident dimensional data modeler. Prerequisite No previous knowledge in dimensional modeling (or any other modeling) is a prerequisite for this tutorial. However I assume that you already know what is a data warehouse, you have working knowledge in database and preferably you have seen or worked in a data warehousing project before.

What is dimensional modeling?


Ok, so let's get started. Dimensional modeling is one of the methods of data modeling, that help us store the data in such a way that it is relatively easy to retrieve the data from the database. All the modeling techniques give us different ways to store the data. Different ways of storing data gives us different advantages. For example, ER Modeling gives us the advantage of storing data is such a way that there is less redundancy. Dimensional modeling, on the other hand, give us the advantage of storing data in such a fashion that it is easier to retrieve the information from the data once the data is stored in database. This is the reason why dimensional modeling is used mostly in data warehouses built for reporting. On the other side, dimensional model is not a good solution if your primary purpose of your data modeling is to reduce storage space requirement, reduce redundancy, speed-up loading time etc. Later on the tutorial we will learn why is it so.

I encourage you to read Ralph Kimball's book on this subject. If you don't already have the book, consider buying it.

Goals and Benefits of Dimensional Modeling


1. Faster Data Retrieval 2. Better Understandability 3. Extensibility

Now that we know the reasons behind creating a dimensional modeling, let's find out what exactly is done in this type of models. In dimensional model, everything is divided in 2 distinct categories - dimension or measures. Anything we try to model, must fit in one of these two categories. So let's say, I want to store information of how many burgers and fries are getting sold per day from a single McDonalds outlet, we will have to first classify this data in dimension and measures. And then we will have 2 different categories of tables (i.e. dimension table and measure table a.k.a fact table to store them). If you want to understand how to classify data in dimensions and facts in greater detail, please read Classify data for successful modeling In the following examples we will choose a practical business scenario and see how to identify dimensions and facts to model the scenario

Step by Step Approach to Dimensional Modeling

Objective: Our objective is to create a data model that can store how many burgers and fries are getting sold from a specific McDonalds outlet per day. Modeling Approach The whole modeling approach is divided in 4 or 5 steps as depicted below. Step 1: Identify the dimensions Dimensions are the object or context. That is - dimensions are the 'things' about which something is being spoken. In the above statement, we are speaking about 3 different things - we are speaking about some "food", some specific McDonalds "store" and some specific "day". So we have 3 dimensions - "food" (e.g. burgers and fries), "store" and "day". Burgers and fries are 2 different members of "food" dimension. We will have to create separate tables for separate dimensions. Step 2: Identify the measures Measures are the quantifiable subjects and these are often numeric in nature. In the above statement, the number of burgers/fries sold is a measure. Measures are not stored in the dimension tables. A separate table is created for storing measures. This table is called Fact Table. Step 3: Identify the attributes or properties of dimensions Now that we have decided we need 3 tables to store the information of 3 dimensions, next we need to know what are the properties or attributes of each dimension that we need to store in our table. This is important since knowing the properties let us decide what columns are required to be created in each dimension table. As you might have guessed, each dimension might have number of different properties, but for a given context, not all of them are relevant for us. As an example, let's take the dimension "food". We can think of so many different attributes of food - e.g. names of the food, price of the food, total calories in the food, color of the food and so on. But as I said, we need to check which of these attributes are relevant to us - that is - which of these attributes are required for reporting on this data. As for the given statement above, we just need to know only one attribute of the "food" dimension - i.e. name of the food (burger or fries). So the structure of our food dimension will be rather easy. Like below:
KEY NAME ----------1 Burger 2 Fries

Similarly, the structure of our store and day dimensions will be like this:
Store

KEY NAME ----------1 Store 1 .... .....

Day
KEY DAY ----------1 01 Jan 2012 2 02 Jan 2012 3 03 Jan 2012 .... ....

As I said, this is really a super simplified structure as we are only interested about basic attribute. But in a complex scenario, we might need to add tens or hundreds on columns to each dimension table if those attributes are required for reporting. Also note, each dimension table above has a key column. Key is a not null and unique column which help us identify each record of the table. Step 4: Identify the granularity of the measures I need to explain what is meant by "granularity". "Granularity" refers to the lowest (or most granular) level of information stored in any table. Lets take this example, if I say, a specific McDonalds store sells 200 burgers on a specific day and 5000 burgers on a specific month, then in the first case the granularity of my information is daily whereas in the second case the granularity of my information is monthly. It is important to identify the granularity of the information required. In our case, we need the information on a daily basis. But if my requirement was "To store how many burgers and fries are getting sold from a specific McDonalds outlet per month", required granularity would have changed to monthly from daily. Why is this important? This is important because this information help us decide what columns are required to be stored in our fact table. For example, since in our case the granularity is food getting sold per store per day, we will need to add key columns from food / store and Day dimensions to the Fact table like below:

Step 5: History Preservation (Optional) If you have followed all the above steps till now, you have then designed 3 dimension tables and 1 fact table. The fact table stores the "number" of food sold in "Quantity" column against a given store, food and day columns. These store/food/day columns are basically foreign key columns of the primary keys in respective dimension tables. The picture of this design is displayed below. This kind of schema is called "Star Schema" because of the star like formation. The above schema is certainly capable of storing all the information that we intended to store in our dimensional modeling. However, there is a subtle problem. The problem is, we are not sure what would happen if any attribute of any dimension get changed in the future. Let's say McDonalds decided to change the name of the food "burger" to "jumbo burger" for some promotional reason. If they do that, they would update the burger record in the dimension table and update the name to "jumbo burger". So far so good. But the problem is we will lose the old information once they change the name. This means, after they change the name if you look at the data in the model, you will not know that until now the product was called "burger" and not "jumbo burger". This is a problem if one of the objectives of your modeling is to store history. Fortunately, this can be solved by designing the dimension tables as "slowly changing dimension". Identify which dimensions are slowly changing (or fast changing or unchanging) is the last and final step of modeling. In the part 3 of this tutorial we will see how to handle these

different types of dimensions. But before that, let's continue our discussion on various schema on dimensional modeling next.

Requirement of different design schema


In Dimensional modeling, we can create different schema to suit our requirements. We need various schema to accomplish several things like accommodating hierarchies of a dimension or maintaining change histories of information etc. In this article we will discuss about 3 different schema, namely - Star, Snowflake and Conformed and we will also discuss how hierarchical information are modelled in these schemata. We will reserve the discussion on maintaining change histories for our next article. Storing hierarchical information in dimension tables From our previous article, we already know what is a dimension. Simply put, a dimension is something that qualifies a measure (number). For example, if I say, "McDonalds sell 5000" - that won't make any sense. But if I say, "McDonalds sell 5000 burgers per month" - then that would make perfect sense. Here, "burger" and "month" are the members of dimensions and they are qualifying the number 5000 in this sentence. It is important to notice that "burger" and "month" are not dimension themselves - they are just the members of the dimensions "food" and "time" respectively. "Burger" is just one of many different "food" that McDonalds sell and "month" is just one of different units by which time is measured. Typically a dimension will have several members and those members will be stored in separate rows in the dimension table. So the "food" dimension table of McDonalds will have one row for burger, one row for fries, one row for "drinks" etc. Similarly, "time" dimension may contain 12 different months as the members of that dimension. Often we may find that there are hierarchical relations among the members of a dimension. That is certain members of the dimension can be grouped under one group whereas other members can be grouped into a separate group. Consider this - french fries and twister fries both are "fries" and hence can be grouped under the same group "fries". Similarly chicken burger and fish burger both can be grouped as "burger".

French Fries

Twister Fries

This type of hierarchical relations can be stored in the model by following two different approaches. We can either store them in the same "food" dimension table (star schema approach) or we can create a separate dimension table in addition to "food" dimension - just to store the type of the foods (snowflake schema approach).

STAR SCHEMA DESIGN


Star schema is the most simple kind of schema where one fact table is present in the centre of the schema surrounded by multiple dimension tables. In a star schema all the dimension tables are connected only with the fact table and no dimension table is connected with any other dimension table.

Benefit of Star Schema Design Star schema provides a de-normalized design Star schema is probably most popular schema in dimensional modeling because of its simplicity and flexibility. In a Star schema design, any information can be obtained just by traversing a single join, which means this type of schema will be ideal for information retrieval (faster query processing). Here, note that all the hierarchies (or levels) of the members of a dimension are stored in the single dimension table - that means, lets say if you wish to group (veggie burger and chicken burger) in "burger" category and (french fries and twister fries) in "fries" category, you have to store that category information in the same dimension table. Storing Hierarchy in star schema As depicted above, we will store hierarchical information in a flattened pattern in the single dimension table in star schema. So our food dimension table will look like this:
KEY 1 2 3 4 NAME Chicken Burger Veggie Burger French Fries Twister Fries TYPE Burger Burger Fries Fries

SNOW-FLAKE SCHEMA DESIGN

Snow flake schema is just like star schema but the difference is, here one or more dimension tables are connected with other dimension table as well as with the central fact table. See the example of snowflake schema below. Here we are storing the information in 2 dimension tables instead of one. We are storing the food type in one dimension ("type" table as shown below) and food in other dimension. This is a snowflake design.
TYPE ==== KEY TYPE_NAME 1 BURGER 2 FRIES FOOD ==== KEY 1 2 3 4

TYPE_KEY 1 1 2 2

NAME Chicken Burger Veggie Burger French Fries Twister Fries

If you are familiar with the concept of data normalization, you can understand that snow flaking actually increase the level of normalization in the data. This has obvious disadvantage in terms of information retrieval since we need to read more tables (and traverse more SQL joins) in order to get the same information. Example, if you wish to find out all the food, food type sold from store 1, the SQL queries from star and snowflake schemata will be like below:

SQL Query For Star Schema


Select distinct f.name, f.type From food f, sales_fact t where f.key = t.food_key and t.store_key = 1

SQL Query For SnowFlake Schema


Select distinct f.name, tp.type_name From food f, type tp, sales_fact t where f.key = t.food_key and f.type_key = tp.key and t.store_key = 1

As you can see in this example, compared to star schema, snowflake schema requires one more join (to connect one more table) to retrieve the same information. This is why snowflake schema is not good performance wise. Then why do we use snowflake schema? Let me give a quick and short answer to that. I won't explain it in detail right now but I will leave it to you for your comprehension. The reason we do it is, suppose we have another fact table with granularity store, food type and day. This fact will use the key of "type" dimension table instead of "food" dimension table. Unless you have this dimension table in your schema, you won't get the "type" key. This is the reason we need to snowflake the "food" dimension to "type" dimension.

History Preserving in Dimensional Modeling


Last Updated on Saturday, 21 July 2012 10:26 Written by Akash Mitra

In our earlier article we have seen how to design a simple dimensional data model for a point-ofsale system (as an example we took the case of McDonald's fast-food shop). In this article we will begin with the same model and we will see how we may enhance the model to store historical changes in the attributes of dimension table.

Nothing Lasts Forever


One of the important objectives while doing data modeling is, to develop a model which can capture the states of the system with respect to time. You know, nothing lasts forever! Product prices change over time, people change their addresses, marital status, employers and even their names. If you are doing data modeling for a data warehouse where we are particularly interested about historical analysis - it is crucial that we develop some method of capturing these changes in our data model. As an example, let's say we store the price of products in the "Food" dimension table that we created earlier and we want to be able to capture the historical changes in "Food" price. In this article we will see what change we need to do in our data model to be able to do this. Note: The simple "Food" dimension we created earlier did not have any "Price" information. But to illustrate the point of this article, we will add a "price" column to our "Food" dimension table. So henceforth our "Food" dimension table will look like this:
KEY 1 2 3 4 NAME Chicken Burger Veggie Burger French Fries Twister Fries TYPE_KEY 1 1 2 2 PRICE 3.70 3.20 2.00 2.20

In case if you have not read my previous article and wondering what "TYPE_KEY" means, this is a foreign key coming from one other table that contains the type of the food e.g., Burger, Fries etc. Also notice, above table only tells us the price of the food as of current point in time. It does not tell us what the price was, let's say 6 months ago. If the price of Veggie Burger changes from $3.20 to $3.25 tomorrow, the new price will be updated in the table and then we will have no way to know what was the earlier price. So our objective is to change the above table structure in such a way so that we can store all the historical and future prices of the foods.

Types of Changing Dimensions


There are a few different ways to store the historical changes of values in data model. And any particular way you want to adopt will depend on the type of changing dimension. For example, some dimensions can change quite rapidly, some dimensions do not change at all but most dimensions change very slowly. That is why we can differentiate dimensions in these 3 types depicted below.

Unchanging Dimension There are some dimensions that do not change at all. For example, let's say you have created a dimension table called "Gender". Below are the structure and data of this dimension table:
ID 1 2 VALUE Male Female

The "Value" column in the above dimension is the attribute of this table that won't normally change. This is an unchanging dimension - "male" will be always called "male" and "female" will be always called "female". Off course, for some crazy reason, one may wish to change the texts "Male"/"Female" to something else e.g. "man" / "woman". But that's really not a change that we should be concerned about as such changes do not alter the "meaning" of the attribute (the words man / male still mean the same thing). So if some changes need to be done, we can simply update the "Value" column in dimension table. For all practical intent and purpose, this dimension remains as an "Unchanging dimension". Slowly Changing Dimension Here comes the most popular dimension - "slowly changing dimension". These are the dimensions where one or more attributes can change slowly with respect to time. Look at the "food" dimension from our earlier example. "Price" is one such attribute which is variable in this dimension. But "price" of french fries or burgers do not change very often, may be they change once in a season. This is an example of slowly changing dimension. Let me give you one more example. Let's say you have created a dimension table on employees. And in the "employee" dimension you have a column called "Marital_Status". This can definitely change (from unmarried to married for example) with respect to time. But again, like the previous example, this is a slowly changing attribute. Doesn't change so often. Later in the article, we will see how to make necessary changes in our dimension table design to store history for such slowly changing dimensions. Rapidly Changing Dimensions If you design a dimension table that has a rapidly changing attribute, then your dimension table will become rapidly changing dimension. As for example, let's say you have a "Subscriber" dimension where you store the details of all the subscribers to a particular pre-paid mobile service plan. You have a "status" column in the "Subscriber" dimension table which can have several different values based on the current account balance of the subscriber. For example, if your balance is less than $0.1, the status becomes "No Outgoing call". If your balance is less than $5, the status becomes "Restricted Call Service". If your balance is less than $10, the status becomes "No Long Distance Call" and if the balance is greater than $10 then status becomes "Full Service", etc. Every month, the status of

any subscriber keeps on changing multiple times based on his or her account balance thereby making the "Subscribers" dimension one rapidly changing dimension. One must remember the way we design a rapidly changing dimension is often quite different from the way we design a slowly changing dimension. In the next article however, we will only look into designing of slowly changing dimension.

Dimensional Modeling Approach for Various Slowly Changing Dimensions


Last Updated on Sunday, 22 July 2012 07:59 Written by Akash Mitra

In our earlier article we have discussed the need of storing historical information in dimensional tables. We have also learnt about various types of changing dimensions. In this article we will pick "slowly changing dimension" only and learn in detail about various types of slowly changing dimensions and how to design them. Slowly changing dimensions, referred as SCD henceforth, can be modeled basically in 3 different ways based on whether we want to store full histories, partial histories or no history. These different types are called Type 2, Type 3 and Type 1 respectively. Next we will learn them in details.

SCD Type 1
As mentioned above, we design a dimension as SCD type 1 when we do not want to store the history. That is, whenever some values are modified in the attributes, we just want to update the old values with the new values and we do not care about storing the previous history. We do not store any history in SCD Type 1 Please mind, this is not same as "Unchanged Dimension" discussed in the previous article. In case of an unchanged dimension, we assume that the values of the attributes of that dimension will not change at all. On the other hand, here in case of a SCD Type 1 dimension, we assume that the values of the attributes will change slowly, however, we are not interested to store those changes. We are only interested to store the current or latest value. So every time it changes we will update the old value with new ones. Handling SCD Type 1 Dimension in ETL Process Technically, from ETL design perspective (Now, if you don't know what is ETL, you don't have to bother about this paragraph - you can go to the next section) SCD Type 1 dimensions are loaded using "Merge" operation which is also known as "UPSERT" as an abbreviation of "Update else Insert". SCD Type 1 dimensions are loaded by Merge operations

In "UPSERT" method, each row coming from the source is compared will all the records present in the target dimension table based on the natural key and checked if the source record already exists in the target or not. If the row exists in the target, the target row is updated with new values coming from source system. However if the row is not present in the target system, the source row is inserted in the target table. In pure ANSI SQL syntax, there is a particular statement that help you achieve the UPSERT operation. It's called "MERGE" statement
MERGE INTO Target_Dimension_Table tgt USING source_table src ON tgt.natural_key = src.natural_key WHEN MATCHED THEN UPDATE SET tgt.column1 = src.value1, tgt.column2 = src.value2, ... WHEN NOT MATCHED THEN INSERT (tgt.column1 , tgt.column2 ...) VALUES (src.value1 , src.value2 ...

As obvious from this example, you have to store the natural key of the data in the target dimension table in order to perform this comparison. Later, I will write a separate article on ETL architecture design, where I will talk about this in more detail. But from a modeling perspective, please note that as a data modeler you should add one extra column in your target dimension table as a place holder to store the natural key of the data.

SCD Type 2
Arguably, this is the most popular type of slowly changing dimensions. So we will try to learn this as clearly as possible. Let me come one step backward here and remind you again about what is our objective here. As you can recall, in the previous articles we have learnt how the values of the attributes (or columns) in the dimension table change with time. We are trying to store the histories of such changes for the purpose of analysis. In Type 1, we were not storing any history. However, now we are going to learn how may we design a dimension table so that we can store the full history and always extract the history of changes as and when we require that. We will take our "Food" dimension table as an example here, where "Price" is a variable factor.
KEY 1 2 3 4 NAME Chicken Burger Veggie Burger French Fries Twister Fries TYPE_KEY 1 1 2 2 PRICE 3.70 3.20 2.00 2.20

Design of SCD Type 2 Dimension In order to design the above table as SCD Type 2, we will have to add 3 more columns in this table, "Date From", "Date To" and "Latest Flag". These columns are called type 2 metadata columns. See below:
KEY 1 2 3 4 NAME TYPE_KEY Chicken Burger 1 Veggie Burger 1 French Fries 2 Twister Fries 2 PRICE 3.70 3.20 2.00 2.20 DATE_FROM 01-Jan-11 01-Jan-11 01-Jan-11 01-Jan-11 DATE_TO 31-Dec-99 31-Dec-99 31-Dec-99 31-Dec-99 Latest_FLG Y Y Y Y

Notice here, how the values of these 3 new columns are populated. In the very beginning, when any new record is loaded in the table, we automatically default the values of "date from" to the date of the day of the loading, "Date To" to some far future date (e.g., 31st December 2099) and "Latest Flag" to "Y". What is the meaning of these 3 metadata columns? These 3 columns basically tell us whether a particular record in the table is latest or not and what is the time period during which the record was latest (Also known as active period). For example, data in the above table basically says that all the 4 records are latest (active) and they are active from the day of loading (in this case 1st January 2011) until an indefinite future date (31st December 2099). But how does these columns help us store the change history? Lets assume, today is 15 March 2011, and McDonald has decided to increase the price of "Veggie Burger" from $3.20 to $3.25. If this happens we will not straight away update the price from $3.20 to $3.25. Instead to store this new information (and also the old information), we will insert a new record in the "Food" dimension table which will look like below:
KEY 1 2 3 4 5 NAME TYPE_KEY Chicken Burger 1 Veggie Burger 1 French Fries 2 Twister Fries 2 Veggie Burger 1 PRICE 3.70 3.20 2.00 2.20 3.25 DATE_FROM 01-Jan-11 01-Jan-11 01-Jan-11 01-Jan-11 15-Mar-11 DATE_TO 31-Dec-99 14-Mar-11 31-Dec-99 31-Dec-99 14-Mar-11 Latest_FLG Y N Y Y Y

Observe the change in the records with Key 2 and 5. Record 2, which was the original record for the veggie burger, has now got updated as its latest flag has become 'N' and "Date To" column value has changed to "14-Mar-2011". This means, Record 2 is no longer latest or active (Latest Flag = "N") and it was active earlier during the period 1st Jan 2011 (Date From) to 14 Mar 2011 (Date To). So, if Record 2 is not active, what is the latest record for "Veggie Burger" now? Record 5! Its latest flag is set to "Y" and it says that that the record is active since 15 March 2011.

This record will remain active many years in the far-off future (until 31 Dec 2099) or at least unless a new record is inserted again with latest flag Y and this record is updated again with Latest Flag N. So next time again, let's say on 20 Dec 2011, McDonalds again decide to change the price of Veggie Burger back to $3.20 and increase the price of the chicken burger from $3.70 to $3.90, we will see 2 more new records in the table as below:
KEY 1 2 3 4 5 6 7 NAME TYPE_KEY Chicken Burger 1 Veggie Burger 1 French Fries 2 Twister Fries 2 Veggie Burger 1 Chicken Burger 1 Veggie Burger 1 PRICE 3.70 3.20 2.00 2.20 3.25 3.80 3.20 DATE_FROM 01-Jan-11 01-Jan-11 01-Jan-11 01-Jan-11 15-Mar-11 20-Dec-11 20-Dec-11 DATE_TO 19-Dec-11 14-Mar-11 31-Dec-99 31-Dec-99 19-Dec-11 31-Dec-99 31-Dec-99 Latest_FLG N N Y Y N Y Y

As you can see from the design above, it is now possible to go back to any date in the history and figure out what was the value of the "Price" attribute of "Food" dimension at that point in time. Surrogate key for SCD Type 2 dimension Note from the above example that, each time we generate a new row in the dimension table, we also assign a new key to the record. This is the key that flows down to the fact table in a typical Star schema design. The value of this key, that is the numbers like 1, 2, 3, . , 7 etc. are not coming from the source systems. Instead those numbers are just like sequential running numbers which are generated automatically at the time of inserting these records. These numbers are unique, so as to uniquely identify each record in the table, and are called "Surrogate Key" of the table. As obvious, multiple surrogate keys may be related to the same item, however, each key will relate to one particular state of that item in time. In the above example, keys 2, 5 and 7 are all linked to "Veggie Burger" but they represent the state of the record in 3 different time spans. It's worth noting that there would be only one record with latest flag = "Y" among multiple records of the same item. Alternate Design of SCD Type 2: Addition of Version Number A slight variation of design of SCD Type 2 dimension is possible where we can store the version numbers of the records. The initial record will be called version 1 and as and when new records are generated, we will increment the version number by 1. In this design pattern, the records with highest version will always be the latest record. If we utilize this design in our earlier example, the dimension table will look like this:
KEY 1 2 3 4 5 6 NAME TYPE_KEY Chicken Burger 1 Veggie Burger 1 French Fries 2 Twister Fries 2 Veggie Burger 1 Chicken Burger 1 PRICE 3.70 3.20 2.00 2.20 3.25 3.80 DATE_FROM 01-Jan-11 01-Jan-11 01-Jan-11 01-Jan-11 15-Mar-11 20-Dec-11 DATE_TO 19-Dec-11 14-Mar-11 31-Dec-99 31-Dec-99 19-Dec-11 31-Dec-99 Version 1 1 1 1 2 2

Veggie Burger

3.20

20-Dec-11

31-Dec-99

Off course, we can also keep the "Latest Flag" column in the above table if we wish. Handling SCD Type 2 Dimension in ETL Process Again, if you do not know what is ETL - you can safely skip this section. But if you have some ETL background then I suppose you have already pin-pointed the fact that, unlike SCD Type 1, Type 2 requires you to insert new records in the table as and when any attribute changes. This is obviously different from SCD Type 1. Because in case of SCD Type 1, we were only updating the record. But here, we will need to update old record (e.g. changing the latest flag from "Y" to "N", updating the "Date To") as well as we will need to insert a new record. Like before, we can use the "natural key" to first compare if the source record is existing in the target or not. If not, we will simply insert the record in the target with new surrogate key. But if it already exists in the target, we will have to check if any value of the attributes has changed between source and target - if not, we can ignore the source record. But if yes, we will have to update the existing record as "N" and insert a new record with new surrogate key. As I mentioned before, I will write a separate article on the ETL handling later. Performance Considerations of SCD Type 2 Dimension SCD type 2, by design, tend to increase the volume of the dimension tables considerably. Think of this: Let's say you have an "employee" dimension table which you have designed as SCD Type 2. The employee dimensions has 20 different attributes and there are 10 attributes in this table which change at least once in a year on average (e.g. employee grade, manager's name, department, salary, band, designation etc.). This means if you have 1,000 employees in your company, at the end of just one year, you are going to get 10,000 records in this dimension table (i.e. assuming on an average 10 attributes change per year - resulting into 10 different rows in the dimension table). As you can see, this is not a very good thing performance wise as this can considerably slow down loading of your fact table as you will require to "look up" this dimension table during your fact loading. One may argue that, even if we have 10,000 records, we will actually have only 1,000 records with Latest_Flag = 'Y' and since we will only lookup records with Latest_Flag = 'Y', the performance will not detoriate. This is not entirely true. While utilizing the Latest_Flag = 'Y' filter may decrease the size of the lookup cache, but database will generally need to do a full table scan (FTS) to identify latest records. Moreover, in many cases ETL developer will not be able to make use of Latest_Flag = 'Y' column if the transactional records do not always belong to the latest time (e.g. late arriving fact records or loading fact table at later point in time - month end load / week end load etc.). In those cases, putting latest_flag = 'Y' filter will be functionally incorrect as you should determine the correct return key on the basis of "Date To", "Date From" columns. (If you do not understand what I am talking about in this para, just ignore me for now. I am going to explain these things later in some other article)

SCD Type 3

As I mentioned before, type 3 design is used to store partial history. Although theoretically it is possible to use the type 3 design to store full history, that would be not possible practically. So, what is type 3 design? In Type 2 design above, we have seen that whenever the values of the attributes change, we insert new rows to the table. In case of type 3, however, we add new column to the table to store the history. So let's say, we have a table where we have 2 column initially - "Key" and "attribute".
KEY 1 2 3 ATTRIBUTE A B C

If the record 1 changes its attribute from A to D, we will add one extra column to the table to store this change.
KEY 1 2 3 ATTRIBUTE D B C ATTRIBUTE_OLD A

If the record again change attribute values, we will again have to add columns to store the history of the changes
KEY 1 2 3 ATTRIBUTE E B C ATTRIBUTE_OLD D ATTRIBUTE_OLD_1 A

Isn't it SCD Type 3 very cumbersome? As you can see, storing the history in terms of changing the structure of the table in this way is quite cumbersome and after the attributes are changed a few times the table will become unnecessarily big and fat and difficult to manage. But that does not mean SCD Type 3 design methodology is completely unusable. In fact, it is quite usable in a particular circumstance where we just need to store the partial history information. Let's think about a special circumstance where we only need to know the "current value" and "previous value" of an attribute. That is, even though the value of that attribute may change numerous times, at any time we are only concerned about its current and previous values. In such circumstances, we can design the table as type 3 and keep only 2 columns - "current value" and "previous value" like below.
KEY 1 2 3 Current_Value D B C Previous_Value A

I can't find a very good example of this scenario right away, however, I can give you one example from one of my previous projects in telecom domain, wherein a certain calculated field in the report used to depend on the latest and previous values of the customer status. That calculated attribute was called "Churn Indicator" (churn in telecom business generally means leaving a telephone connection) and the rule to populate the churn indicator was (in a very very simplified way) like below:
Churn Indicator = "Voluntary Churn" (if customer's current status = 'Inactive' and previous status = 'Active') = "Involuntary Churn", (if customer's current status = 'Inactive' and previous status = 'Suspended')

As you can guess, in order to find out the correct value of churn indicator, you do not need to know complete history of changes of customer's status. All you need to know is the current and previous status. In this kind of partial history scenario, SCD Type 3 design is very useful. Note here, compared to SCD Type 2, type 3 does not increase the number of records in the table thereby easing out performance concerns.

Performance Considerations for Dimensional Modeling


Last Updated on Tuesday, 11 September 2012 05:43 Written by Akash Mitra

Performance of a data warehouse is as important as the correctness of data in the data warehouse because unacceptable performance may render the data warehouse as useless. There is this increasing awareness about the fact that its much effective to build the performance from the beginning rather than to tune the performance at the end. In this article we have a few points that you may consider for optimally building the data model of a data warehouse. We will only consider performance considerations for dimensional modeling. Good design, cleanly crafted code and optimal logic / algorithm will give you far better performance than that you can achieve by augmenting your hardware You can spend literally millions of dollars on your hardware but you still cant scale down a problem as efficiently as possible only by means of good design. So stop wasting money on better and bigger hardware, instead invest more on a good data architect / modeler.

Surrogate Key is not mandatory


Surrogate key is used in place of natural key because natural key may not be trusted (in the sense that natural key may fail to adhere to the fundamental properties of a key e.g. uniqueness, not null ability etc.) and natural key may not be standardized in terms of sizes and data types with other keys of your data warehouse. But this does not mean that its mandatory to replace all your natural keys with a standardized surrogate keys. What if your natural key has already the

properties of your data warehouse level candidate keys? Do you still want to replace them with surrogate keys? Consider the fact that introduction of surrogate keys bring in additional burden on data loading. This is because introduction to surrogate keys in dimension tables necessitates additional lookup operation when doing the fact data loading and lookup is a very costly affair, if not the most costly affair. A lot of time, the source of your dimension data is a RDBMS source system. Since the system is RDBMS, you are already assured about the quality of keys and may be the source key is already numeric in nature just like the other keys of your data warehouse. Dont let this natural key die in your ETL process in favor of a standardized surrogate key. Bring this as it is to your data warehouse and while doing reporting; base your join on this key.

KISS your data model


Not literally. For, KISS is a very important design principle which stands for Keep It Simple and Sweet (although some people favor another 4 letter word for the second S). Ask yourself if you really require the plethora of those attributes that you have kept in your dimension table? Or do you really need that additional table that you are thinking to introduce in your model? Dont do it because you can do it. Do it only when you have to do it. Such minimalistic approach would make a lot of difference at the end and get you going through the tough time. Keep it simple nothing succeeds like simplicity. If you dont believe me, let me enlighten you with one example. Suppose you have decided to keep one column in your fact table just because its available in source and you think its good to have this information in your data warehouse because it may be required in the future although you are fully aware of the fact that there is no explicit reporting requirement on this column as of now. We will call this column A and we will assume the data type of the column is NUMBER(10). Mere introduction of this column, which currently serves no purpose other than giving you a false sense of wholesomeness, will require:

Additional Programming effort - mapping effort from source to staging and staging to data warehouse Additional testing effort - 1 additional test case to be written and tested by the testers to check the correctness of the data Additional space requirement If your fact table is going to contain 100 million records in next 2/3 years (which is not a big number by any standard), its going to take following extra space.
extra space = space for storing 1 number X No. of records in table =(ROUND((length(p)+s)/2))+1) bytes X 100,000,000 (where, p= precision and s=1 if number is negative, 0 otherwise) = (ROUND((10+ 0)/2))+1 ) bytes X 100,000,000

= 600,000,000 bytes = over half Giga byte And since most data warehouse has a mirrored backup, actual space requirement will be double of this (nearly 1.5 Giga byte).

Now think of this for a moment. If such an innocent looking NUMBER(10) column can add up so much extra space and processing burden, then consider what may happen if you continue introducing a lot of such attributes with data types like VARCHAR(50 or more) or DATE etc. If you can stop the urge of such frivolity and take a considerate, minimalist approach you can easily save 1TB of space and numerous coding / testing efforts throughout the building process.

Reduce Complexity
This is similar to the above point where we stressed about the simplicity of data models. But this is not exactly same. While simplicity will make your model considerably light weight, reducing complexity will make it easier to maintain. There are several ways of reducing complexities in data model. Reducing pre-calculated values We often make the mistake of creating extra columns in our tables that contain the result of a complex calculation. Let me give you a simple example. You need a measure called Yearly Bonus which is determined using the below rule:
For grade A employees, bonus = 0.83 X yearly salary For grade B, C employees, bonus = 0.69 X yearly salary

Obviously you can create one column in your data model to store the bonus of each employee along with their respective salary. However, that will be wrong thing to do unless you plan to use that column somewhere else in the time of data loading. It would be much wiser to calculate those columns in the time of reporting on-the-fly. This will reduce the space requirement and improve loading time. The rule of thumb is, dont use your data model to solve a programming problem. Removing Unnecessary Constraints This is not a good idea to put the database level constraints such as check constraints, not null constraints etc. in the tables meant for batch data loading. They make things slower (although slightly) and creates maintenance problem. Instead try to ensure data quality in the ETL process itself. I know the whole idea of maintaining primary key/ foreign key constraints in data warehouse is itself widely debated and there are different schools of thoughts on this. But as far as I am concerned, I will probably not enforce the integrity through database level keys whereas I can easily enforce them in my ETL logic (batch processing logic).

Stop snow-flaking
Why do you snow-flake when de-normalization is your goal of dimensional data modeling? Clarity, brevity, reducing data redundancy all these arguments are very lame when you compare them against the cost of maintaining foreign-key relation (additional lookup operation) during data loading. Only snowflake, if you intend to provide an aggregated table (fact table) with the same granularity as that of your snow-flaked dimension.

Choose the attributes of SCD Type 2 dimensions judiciously


In a typical implementation, SCD Type 2 tables preserve the history by means of adding new row to the table whenever a change happens. Because of this property, SCD type 2 tables tend to grow larger and larger day by day thereby degrading the performance. This is a problem because growth of dimension tables hurts the performance even more than the growth of fact table as dimension tables are typically conformed and used across different schema. In order to control this growth, it is very important that we consider only those columns in SCD implementation, which we really want to preserve the history of. This means, if you have one hundred columns / entities in the source system for a specific table, dont start tracking SCD changes for all of them. Rather, carefully consider the list of the columns where you really need to preserve the history track only those as Type 2 in your SCD implementation and just consider the other columns as Type 1 (so that even if they change, no row is added to the table).

Dont Create a Snapshot Fact and a SCD Type 2 table for the same purpose
This is obviously a little complex to explain. A snapshot fact always show the latest (or last known) state of the measures. The latest records of a SCD type 2 dimension also do the same. The only difference is a fact shows the state of the measures whereas a SCD Type 2 table shows the state of the attributes. But there are cases where this difference can become very blurry. Where do you store the attributes such as No of telephone lines of one customer or No of bank accounts of one customer? Is number of telephone lines of a customer an attribute of customer itself? Not quite. Then obviously we wont store it in dimension table as we need to store this in fact table. But here comes the problem all such records in fact table get different surrogate keys for the same customer since the key comes from a SCD Type 2 table. Because of such key variance, it is impossible to join two such facts in a single query if those facts are not loaded in the same time. To preserve the key invariance, the dimension table needs to be selfjoined twice in the middle and this causes lot of performance issues.

Consider Indexing Carefully


Indexes are used to speed-up data retrieval from a relational database. As a part of physical data modeling (PDM design) a data modeler should make considerate effort to carefully evaluate and

suggest initial indexing scheme for the data warehouse. It is not possible to fix a rigid indexing scheme in such an early phase of project. But it is important to at least come up with the basic scheme while doing the data model and later on the same scheme can be fine tuned by adding or dropping a few indexes. Databases come with different indexing methodologies which are quite varied in nature. In Oracle, for example, b-tree indexes are most common. Its a good idea to put b-tree indexes in the columns that are used for joining (e.g. Key columns of the table). There is a special index type called bitmap which is useful for low cardinality columns (i.e. columns having a lot of duplicate values) used in joining. Then off course there are other special indexes such as function-based indexes etc. You may read my other article if you want to know more about indexing. Speaking about performance, bitmap index can make data retrieval from a fact table significantly faster if a special technique called Star Schema Transformation with Bitmap is utilized. In this method, you must create one bitmap index in each foreign key column of the fact table (and you should set the Oracle initialization parameter STAR_TRANSFORMATION_ENABLED to true). When a data warehouse satisfies these conditions, the majority of the star queries running in the data warehouse will use a query execution strategy known as the star transformation. The star transformation provides very efficient query performance for star queries. One caveat is bitmap index may cause failure to your data loading due to locks if you are trying to load data in parallel. In other databases, e.g. SQL Server, you have clustered and non-clustered indexing scheme. A clustered index is much faster since it reorganizes the physical orientation of the data in the table in line with the index keys. But this may cause a lot of fragmentations in the data and may require frequent rebuilding. Non-clustered indexes can be used in cases where you already have one clustered index in the table or in any other general purpose uses. The problem with all indexes is same. It makes your data loading (INSERT Operation) slower (beside they take huge amount of space). So dont fill-up your data warehouse with lots of them. Use them prudently and monitor them often, If you see one particular index in not being used so much, you may remove it.

Consider Partitioning
If you have not realized it already, let me tell you youre going to love it. Partitioning is a great thing and you arent going to go anywhere without its help. So as a data modeler, its your job to come up with appropriate partitioning scheme for your tables while you do the physical data modeling. The concept of partitioning is not restricted to database. In fact, you can implement partitioning completely independent of database. Consider vertical partitioning. Lets say you have a table called Product which has 5 attributes and two more tables called Retail_Product and Healthcare_Product each having additional 50 attributes that are only applicable to either retail

or healthcare segment of products. Three tables share the same set of keys however other columns are different. And you use only the table you need instead of using a single table with all the attributes. This concept of vertical partitioning can be extended further to horizontal partitioning as well. When it comes to partitioning, the most popular choice is obviously database level horizontal partitioning. In this type of partitioning, database create separate partitions based on the list or range of values in one column. A preferred choice of column is TIME or DAY key in your fact table. You can easily do range partition on this column to segregate data of each year or each month in separate partitions. Since most of the business queries on this table will contain a question pertaining to a specific date, month or year, database will just access the specific partition containing the data for that query instead of accessing the entire table (this is called partition pruning). So for example, a SQL query like below will access only 1 partition (that of January 2012) instead of reading the whole table which may be containing 5 years of data:
SELECT sum(revenue) FROM Sales WHERE Month_Key = 201201 -- results for Jan 2012, using intelligent key

The partitioning we discussed above is called range partition. There is one more fundamental types of partitioning called List partitioning wherein data pertaining to a set of given list is kept in one partition. This partition is good when you have a fixed list of values that can occur in the data of your partitioned column. Lets say you are building a data model where your sales fact table contains data from 4 divisions north, east, west and south. You can use list partition on your division column then.
SELECT sum(revenue) FROM Sales WHERE Division = north;

You may use one advantage of partitioning to speed up your data loading. This is called Partition Exchange. Partition exchange, although not supported by all databases, help you exchange one partition of one table with that of other. If your first table is staging table and the final table is data warehouse table, using this method you will be able to load data to your data warehouse table very quickly just by exchanging the partitions. Partition exchange is quick since it does not need to physically move the data. Partitions not only help to improve performance, it also makes maintenance a much easy affair. For example, data purging from the fact tables can be done pretty quickly by dropping partitions rather than using conventional delete operation.

Avoid Fact to Fact Join


Make conscious effort in your design to avoid fact to fact join in the future. This is tricky. A lot of time we end up creating data models where one fact table typically contains information pertaining to one specific subject area. But what if we want to do cross-subject area reporting? Lets say you have sales fact and marketing fact and you want to evaluate the effect of a certain

marketing campaign of your sales. How do you do it? One solution is writing a SQL query which will bring data after joining the both fact tables. This does work but it takes its toll on the performance. One solution is, you can implement a special fact table with only the related measures from across more than one subject areas. So consider those situations from the very beginning and make necessary arrangements in your data model.

Give some attention to your server infrastructure


Server infrastructure may be mis-configured more often than you think. But even if they are not, its a good idea to review them as you go along. There are a lot of the parameters that need to be tweaked especially for data warehousing environment. So if your DBA does not have experience with data warehouses before, s/he may need some guidance. Look carefully through options such as, asynchronous filesystem IO, shared pool size of the database, caches sizes (of database, of IO subsystems), database block size, database level multiblock read/write options etc.

Вам также может понравиться