Data Warehouse Best Practices

White Paper
Supporting Data Warehouse Design

Best Practices with Kalido
Achieving return on investments in enterprise data warehousing
Cliff Longman,
Chief Technology Officer
Kalido
Supporting Data Warehouse Design Best Practices with Kalido
Executive summary
Kalido automates the implementation of 20 data warehouse best practices as defined by Ralph Kimball and Bill
Inmon, the two best known data warehouse gurus.
This results in three major benefits:
Time to value
Kalido data warehouses are developed 50% faster than conventional data warehouses. In addition, Kalido
data warehouses are flexible, so adapting to new requirements is also much faster than with conventional
data warehouses.
Reduction in total cost
The total number of person days required to deliver and then maintain a Kalido data warehouse is 75% less
than for a conventional data warehouse.
Risk mitigation
Human error is significantly reduced, resulting in accurate business reports even through complex business
change.
This paper addresses the detailed explanation of how Kalido implements these best practices. It is written with
the project manager, data architect and data warehouse developer in mind as they consider how to use Kalido in
their information management projects.
Table of Contents
1. Introduction
2. Design goals
2.1. Incremental development
4.9. Additive, semi-additive,
4.10. Currency and other
2.2. Efficiency and ease

of understanding
2.3. Graceful adaptation to change
3. A
rchitectural structure
and features
3.1. Warehouse structure
5
5
5
5
3.2. Meta data
3.3. Terminology
4. D
esign techniques and how
Kalido supports them
4.1. Dimensional models
4.2. Surrogate keys
7
7
8
4.3. Slowly changing dimensions
4.4. Meta data-driven data warehouse
4.5. Roles
4.6. Aggregates and summaries
4.7. Time as a dimension
10
4.8. Sparse facts
11
non-additive facts/measures
UOM conversions
4.11. Multiple grain transaction data
11
11
11
4.12. Many-Many Dimensions
12
4.13. Degenerate dimensions
13
4.14. Junk dimensions
13
4.15. Factless fact tables
13
4.16. Organization and parts hierarchies
13
4.17. Ragged hierarchies/

unpredictably deep hierarchies
4.18. Audit dimension
14
14
4.19. Staging
14
4.20. Building end user applications
15
5. Summary table
16
6. Conclusion
17
7. About Kalido
17
1. Introduction
We are frequently asked how Kalido supports industry best practices for building and running data warehouses.
This paper answers that question. Kalido delivers a packaged software solution for implementing tried and
tested industry best practices in a way that eliminates human error and greatly speeds the development and management of data warehouse environments.
We have developed a list of industry best practices culled directly from The Data Warehouse Lifecycle
Toolkit by Ralph Kimball, Laura Reeves, Margy Ross and Warren Thornthwaite (Wiley, 1998). We have
enhanced this list to include techniques recommended by Bill Inmon in his specification for DW 2.0
(www.inmoncif.com/registration/news/dw2.php) and also some more recent Kimball design tips from
www.ralphkimball.com.
For each technique, we have paraphrased a description and given references to The Data Warehouse Lifecycle
Toolkit book, DW 2.0 specification (page numbers in angle brackets <>) or design tip at which you can find a
full description of the best practice advice. We have then described how Kalido enforces or supports the practice
through its software products.
This paper is divided into three main sections. The first section highlights the major design goals that Kimball
and Inmon recommend as drivers for data warehouse development. The second section describes how Kalido
fits into the overall architecture of a data warehouse. The third section itemizes 20 best practices one by one and
describes how Kalido enforces or supports these best practices. Finally, the summary gives a quick reference
table of all the items discussed with a rating for Kalidos support for each.
2. Design goals
Three primary data warehouse design goals stand out from the literature. These design goals have been at the
forefront of the Kalido design teams consideration from the earliest stages of the softwares initial development.
Having been developed to address the needs of the Royal Dutch Shell Corporation as it attempted to reconcile
performance data from its world-wide operations during the early 1990s, Kalido software has been developed with
the following in mind:
2.1. Incremental development

Kimball recommends satisfying business needs incrementally towards enterprise coverage by building one data
mart at a time <page 18>. Inmon recommends the spiral methodology for DW 2.0 development.
Kalido was designed to support incremental development. The most successful Kalido projects have been
developed incrementally. Kalido is designed for extreme flexibility, and this makes evolution in small steps much
easier than would be the case without it.
Behind the ability to support incremental development lies Kalidos ability to support a changing requirement
without having to modify any of the system you have already developed and without having to unload/reload
data stored by previous iterations. New sources (even at different grains) can be added, levels in hierarchies
changed, whole new marts added without having to modify previously created loads, updates, extracts, etc. Even
archive/purge and restore continue to work through changes to the model to accommodate new requirements.
Because of this agility, we frequently find that Kalido is used to elicit business requirements by having business
people develop their own output (almost invariably changing their minds once they see the results). Note that
this is almost the opposite of the idea that business requirements can be discovered in advance of building a
data warehouse. We believe that requirements are discovered as a data warehouse is developed.
2.2. Efficiency and ease of understanding

Kimball is insistent that dimensional modeling is a mandatory data warehouse design pattern <page 17> to
deliver performance and ease of understanding.
Kalido automatically delivers a dimensional model to achieve this goal. Each dimension is created as a dimensional
table in the underlying relational database by Kalido, with the (sometimes very complex) hierarchical levels
structured as columns in the table. This makes it simple for end user query tools to access hierarchies at any
level, while still being very efficient for the DBMS to process queries that aggregate to levels in the hierarchies.
2.3. Graceful adaptation to change

Adapting to change gracefully, without warehouse reorganization, is a suggested design goal by Kimball <page
10>, and Inmon introduces the concepts of semantic change and content change to achieve the same ends.
This is Kalidos number one design goal. By driving the physical structures from a logical business model that
Kalido manages directly, it is possible to change the logical model and allow Kalido to re-generate and populate
new physical structures appropriately. Because Kalido also holds master data in a highly generic schema as
well as in the dimensional tables, the dimensional tables can be regenerated and populated when a structural
change occurs with no manual intervention. This allows graceful evolution of both schema and data.
Kalido shares the three most important design goals laid down by Ralph Kimball and supported by Bill Inmon.
The Kalido difference, however, lies in the way that best practice techniques that have emerged from these goals
have been packaged into a robust product that either enforces or provides guidance and support. The scope for
manual error is significantly reduced and risk reduced as a result.
3. Architectural structure and features

This section describes how Kalido fits overall in the Kimball data warehouse architecture, and into the integrated
sector in DW 2.0.
3.1. Warehouse structure

Kimball advocates a scheme in which a number of subject-specific data marts are developed independently but
fed from a common staging area and with common data (dimensions and facts) conformed between the marts.
In this scheme, there is no one place that all data is stored. Inmon advocates an integrated store of all data he
calls the data warehouse, from which data is copied into various data marts.
Kalido offers two alternative architectures in support of the Kimball and Inmon architectures. These can be safely
mixed if desired. Kalido works on the basis of Kalido Instances. A Kalido instance is a set of tables in a database (running on one of the supported DBMS platforms). Each instance contains a body of master data, a body
of transaction data, a star schema and a number of mini-star schemas called result sets. In the case of a single
Kimball data mart, this would be implemented as a single Kalido instance. This would also support the Inmon
architecture (for a single mart). For multiple marts, there are two options:
A single Kalido instance, spinning off multiple result sets.
Multiple Kalido instances connected through Kalidos federation facilities.
It is of course possible to have a mixture (for example 10 marts supported by three Kalido instances).
The decision regarding which Kalido architecture to choose is driven by the disparity of the marts (similar marts
tend to favor a single Kalido instance with the marts supported by results sets, dissimilar marts tend to favor
multiple Kalido instances).
If the marts have to be supported by different DBMSs, then there must be at least one Kalido instance for each
DBMS to be used.
3.2. Meta data

Both Kimball and Inmon stress the importance of meta data. A meta-data-driven data warehouse is more flexible
than one that assumes a fixed data structure (though it is typically more difficult to design and build).
Kalido is heavily meta data driven. Kalido defines data in a real-world business model (i.e., a description of
real-world objects rather than descriptions of database tables). This business model is used by Kalido to automatically generate physical database structures, as well as load and query SQL statements. In this way, Kalido
data warehouses are relatively easy to change because there is relatively little traditional coding involved.
Kalido keeps track of all this meta data as well as meta data to do with the system operation (such as batch load
statistics) and makes it available through its API.
3.3. Terminology
The following table shows the Kalido terminology equivalents to the Kimball/Inmon terms.
Kimball/Inmon
Kalido
Comment
Dimension
Dimension
Dimensional Table
Mapping table, attribute table
Fact
Transaction
Source System
Transaction data set
Fact type
Class of transaction
For example: The class of transaction sales

might have 2 transaction data sets, one from an
ERP and one from a legacy sales system.
Surrogate key
OID
Kalido automatically allocates and manages

object identity
Naming scheme
Kalido allows as many alternative identifiers as

you like for any dimensional data
Class of business entity
A distinct entity type (Kalido will automatically

structure a dimensional table to include any
number of related classes of business entity)
Attribute tables are an optional form of snowflaking
4. Design techniques and how Kalido supports them

4.1. Dimensional models
Kimball <page 18> and Inmon both advocate presentation of data warehouse data to the end user query tool
in a de-normalized form. The accepted best practice for OLAP and hierarchical aggregation queries is the
star schema. In a star schema, fact data in one or more fact tables is joined to dimensional data stored in
dimensional tables. The most important feature of this design is that the dimensional tables pre-join hierarchies
that the user might wish to navigate and report on, unlike a normalized design that would separate individual
levels of each hierarchy and have the end user query tool generate the join at output time. The dimensional
approach makes the output to the user both easier to understand (the end user does not have to worry about
joins), and faster (dimensional data is pre-joined). Pre-joining speeds up aggregations and also speeds up
drill-down operations as users navigate hierarchies.
Kalido enforces a strict dimensional model for the output of data through what it calls mapping tables. These
mapping tables are created and populated by Kalido without the designers having to design or write code.
(See Figure 4.1) Kalido also relieves the data warehouse designer from many of the complexities of dimensional
design by:
Automatically collapsing multiple hierarchies into a single dimensional table
Providing a dimensional loader that automatically takes data fed via staging tables or files and populates
dimensional tables (note Kalido does this via a generic store that also records the historical state of
dimensional data)
Managing change to the dimensional schema over time (if you add or modify hierarchies to the data
warehouse, Kalido will automatically re-generate and upgrade the rows in the relevant dimensional
tables making it easy to evolve even complex schemas over time)
Dimensional models
Kalido enforces a strict dimensional model for the output

of data through what it calls mapping tables. These
mapping tables are created and populated by Kalido
without the designers having to design or write code.
Figure 4.1.
4.2. Surrogate keys

Surrogate keys are system-generated unique identifiers that are used in place of real world identifiers (e.g.,
the surrogate key value 14289564 identifies the person we know as Tom Smith rather than a social security
number, or a combination of name or date of birth, etc.) Kimball strongly advocates the use of surrogate keys
<pages 191193 >.
Kalido enforces surrogate key identification for all master data elements and automatically allocates these
surrogate keys at creation time. Kalido provides utilities for loading meta data, dimensional data and fact data,
and all three automatically assign and use surrogate key values as necessary. Kalidos fact loader also provides
lookup facilities (selecting one of a number of alternative algorithms for best performance) so that as fact data is
loaded, the real-world identifiers are converted into the appropriate surrogates. Kalido replaces the surrogates
with the relevant real-world identifiers (also called natural keys) as data is output from the system so that the
end user is typically unaware of the surrogate keys.
One difference between the Kimball and Kalido surrogate keys is seen in building the dimension tables. As
explained in section 5.3, Slowly changing dimensions, Kimball considers that when there is an update on any
dimensional entity, like an attribute change or a hierarchical parent change, a new version of that entity is created,
hence this new version should be given a new surrogate key. Kalido, however, keeps the original surrogate key
and expires the old mapping table record with an end date, creating a new record with same surrogate key and
a start date. In Kimball design, creating a new surrogate key for each version keeps the fact table to dimension
cardinality as many-to-one, whereas in Kalido, the cardinality of the surrogate key join (alone) between fact and
mapping tables is many-to-many. For this reason, Kalido also uses effective date predicates between the fact
and mapping tables to ensure many-to-one aggregation of facts.
The Kimball method requires an ETL tool to maintain the natural-to-surrogate key lookup tables in the ETL
platform. When there is a change in dimensional data, the new surrogate keys are created in the lookup tables.
Translating the natural keys in a fact record with the new surrogate keys is also an ETL responsibility. As a result,
we can see different surrogate keys in the Kimball fact table for the same dimensional entity (e.g., a product).
In Kalido, surrogate-to-natural key mappings are kept within Kalido, which means that the ETL tool can bring the
source transactions in their local natural keys, without a need for translation. This is also useful for maintaining a
time variant mapping to surrogate keys.
4.3. Slowly changing dimensions

Slowly changing dimensions refers to a scheme for recording changes to master data over time so that history
can be correctly queried for output. Kimball invented the term and describes 3 alternative schemes (type 1, type
2, type 3) for implementing slowly changing dimensions, each with different accuracy, performance and ease
of implementation characteristics <page 180>. Inmon describes discrete and continuous time variance with
special treatment for transaction snapshots, e.g., end of period balances.
Kalido automatically implements and enforces advanced time variance for all categories of master data. In a sense
what Kalido does is beyond type 3 slowly changing dimensions, combining the benefits of types 2 and 3, because
every attribute value is separately identifiable in Kalidos generic store (giving simple traceability of every attribute
and relationship at the atomic level). In other words, Kalidos time variance is at the individual field level, whereas
it is at the record level in Third Normal Form or in De-normalized form schemas (like Kimball dimension tables).
Kalido also maintains a replica of the master data in a dimensional form with what it calls mapping tables.
These mapping tables are in type 3 form. Kalido automatically manages change to the mapping tables so if a
hierarchy is rearranged, the correctly time-stamped rows are created in the mapping table by the system.
Kalido also ensures that other features of the system as a whole are kept intact alongside slowly changing
dimensions, for example automated summaries, queries that run across periods of change, currency and other
unit of measure conversions, etc. All these other features are executed in conjunction with the slowly changing
dimensions so that the business user gets the right information when Kalido produces output.
4.4. Meta data-driven data warehouse

Inmon and Kimball strongly suggest driving as much data warehouse behavior as possible from meta data so
that the systems behavior changes when the meta data is changed <pages 331332>.
Kalido is heavily meta data driven. Kalido offers the user what is called a business model as a configuration
tool. Classes, attributes, associations between classes, the relationships between activities (business transactions)
and their business context and KPIs can be created and maintained by business users. The necessary database
schema, loader functionality and mart structure and content are all generated by Kalido automatically driven by
the business model. In addition, the change management is handled automatically, so that a change to the
business model results in potentially radically different schemas and SQL can be used the next time the system
is used. The business model also defines most of the analytical business rules.
4.5. Roles
Roles describe a situation in which the same class is used for two different purposes. An example would be the
Company class being used in the role of Customer and also Supplier (i.e., we sell to companies; we also
buy from companies). Kimball describes roles <pages 223226> focusing especially on examples of dates/times
as a class with many roles.
Kalido has roles built in. Part of the business model involves specifying the roles that a class can play, and also
the role that is being played when the class is referred to by a fact (e.g., the customer and the supplier for a
purchase fact). These roles apply to relationships defined in the business model.
Kalido also implements subtypes which assist with validation (sometimes confused with roles). For example, a
customer may have a mandatory credit-rating attribute, whereas a supplier would not but may have a maximum
lead time attribute. This is achieved through class of business entity subtypes with rich cardinality rules like a
super-type entity can be zero, one or many of its sub-types.
Kalido would also validate the multiple role fields in each transaction record against whether that company is
allowed to play a particular (e.g., supplier) role.
4.6. Aggregates and summaries

When data warehouses get large, particularly in terms of the number of facts and the number of users accessing
it, it becomes inefficient to continually calculate the same summaries over and over again. It is common practice
(Kimball <pages 543567>) to calculate a summary data set (e.g., the sum of total order value by sales person,
by customer by month). These summaries are stored physically, so that if an answer can be derived from a summary without going back to the raw data, it is (saving significant re-calculation).
Kalido automates summary implementation and management. All DBAs have to do is indicate which summaries
they want created (by selecting the pieces of the business model) and when they should be calculated following
new data loads, and Kalido does the rest. (See Figure 4.6) When Kalido generates a query that it knows can be
satisfied in whole or in part from a summary, then it will use the summary as the source. Kalido is summary aware.
In addition, Kalido addresses the management of change so that summaries invalidated after a business model
change are not used incorrectly. Kalido also provides a scheme for compressing summaries following incremental
loads (aggregating multiple summary rows with the same values for the dimensions to be summarized).
Kalido also allows result sets (data generated as output) to be physically stored and labeled as part of the
business model (by creating a new transaction dataset). This allows complex algorithmically defined summaries
(such as allocations) to be calculated using custom algorithms and then stored under the scope of the Kalido
business model, affording them the ability to be selected and evolved under business model control.
Time variance needs to be taken into account when creating summaries. Each summarization process requires
a choice of aggregation using current parents or with parents at the time that each transaction took place. Kalido
automates this for you.
aggregates and summaries
Kalido automates summary implementation and

management. All DBAs have to do is indicate which
summaries they want created (by selecting the
pieces of the business model) and when they should
be calculated following new data loads, and Kalido
does the rest.
Figure 4.6.
4.7. Time as a dimension

Kimball specifically advocates the use of time as a dimension <page 175>.
Kalido also requires time as a dimension and embellishes it with a number of complex time features such as
corresponding time periods (this month last year) or relative time periods (year-to-date) that can be extended
by business users in business terms. (See Figure 4.7) Kalido generates the (often scary and highly complex)
SQL needed to produce the right output. Kalido also allows multiple calendars that are customizable (so if your
accounting month called January actually starts Jan 3rd in 2006, this is catered for).
Kalido comes with default calendar data and also has built-in functionality for absolute, relative and typical periods
of time, allowing complex calendar and time functions to be customized for each Kalido implementation.
Time as a dimension
Kalido also requires time as a dimension and

embellishes it with a number of complex time
features such as corresponding time periods (this
month last year) or relative time periods (year-to-date)
that can be extended by business users in
business terms.
Figure 4.7.
10
4.8. Sparse facts

Sparse facts are facts in which there are a large number of measures, many of which have no value in each fact
record <Kimball Design Tip #20: Sparse Facts And Facts With Short Lifetimes>.
Kalido provides two alternative ways to support the sparse fact. The first is the way that Kimball suggests in his
design tip that is, to model a generic fact type (as a transaction data set in Kalido) and a special dimension
whose members represent the type of the fact. Kalido then automatically allows filtering on the relevant fact
types when a result set is generated. In this case, Kalido provides a concept called an aggregated measure,
which is a definition that re-establishes the specific business name for the measure (e.g., lubricant usage), but
refers to the generic underlying sparse fact implementation (e.g., value where value type is lubricant usage).
Aggregated measures are stored predicates such as cost where cost type = labor cost. This way, the technical
implementation of sparse fact is a manageable one, and business users get to use their familiar terms.
There is another alternative with Kalido, which is to model each measure with its own transaction dataset
(note they can all share the same physical table if it is a Kalido generic warehouse section). This will result in
a larger number of transaction datasets, which may add complexity to the load processing but may be an
acceptable trade-off.
4.9. Additive, semi-additive, non-additive facts/measures

Some fact types (e.g., sales volumes) can be summarized against any dimension. Others (such as inventory
levels) can be summed against any dimension except time. Yet others (e.g., temperature of a liquid in a
manufacturing process) cannot be summed against any dimension. Kimball describes these properties of
facts at <page 193>.
Kalido supports all the above and includes an additional option which is summable against only a single dimension
(useful for financial summaries against account). Note that the definition in Kalido is done at the business model
level and the necessary technical implementation is generated (in this case Kalido will only generate valid SQL
statements to calculate the sum depending on the measures chosen). Kalido will never generate SQL that sums
facts inappropriately, even if the sum is buried in a complex calculated measure a comforting piece of risk
reduction for business and technical users alike.
4.10. Currency and other UOM conversions

Kimball describes currency conversion and multi-currency data warehouses at <page 248>.
Kalido has built-in currency recording and conversion facilities. Kalido identifies special transaction data sets
as exchange rate data sets, and these are used to provide (multiple alternative) exchange rates with which to
convert currency units. This is useful if there are standard accounting exchange rates used for a financial year,
true current exchange rates, agreed exchange rates for a contract and so on. Exchange rate data is loaded
as any other transaction data set would be. Currency conversions can also be defined as meta data definitions
called converted measures. When transactions are loaded in any currency, a converted measure (e.g., cost in
Euros) can then be included in result set definitions without re-defining the currency conversion each time.
Kalido extends the general conversion principle to other units of measure (not just currencies) which can be
satisfied by a flat conversion rate (e.g., volume to weight at standard temperature and pressure).
4.11. Multiple grain transaction data

Kimball explains that when data is available at a higher grain than the grain required for analysis, a form of
allocation is required <page 240>. An example would be when sales revenues are available for each packed
product on the invoice lines and product cost is available at product group and monthly levels, and we need
11
margin by individual sales. Note that the algorithms and schemes for allocating to a lower grain are specific to
each company and sometimes have multiple alternatives within a single company.
Kalido supports allocation by defining a new measure and transaction data set for each allocation method (so,
for example, there may be a measure called costs allocated by headcount and a TDS called daily headcount
allocated costs). A script, program or ETL tool is used to implement the allocation algorithm depending on the
complexity of the allocation algorithm. With Kalido, implementers should bear in mind that models may change,
so the most flexible (but most difficult) allocation algorithms are driven from the Kalido meta data rather than
being hard coded to expect certain classes, attributes and associations in a defined configuration. Implementers
should also be careful to re-run an allocation should any of the relevant master data change historically (e.g., if
an employee entity is updated on 1/1/2006 to be in a different department historically, such as between 1/1/2005
and 6/1/2005.
Sometimes, the same type of transaction data may be available from multiple sources with different granularity.
For example, we may have a legacy sales order processing system capturing the revenue and volume by product,
customer and day and another ERP system with sales transactions capturing revenue, volume, discount and
distribution cost by product, customer-shipped-to (as a child of customer-sold-to), sales representative and day.
Having some different measures and referring to different columns in the dimension tables would make the
aggregation at common levels more difficult, resulting in complex SQL unions. Kalido handles such multi-granular
transactions automatically to resolve the common granularity from its meta data and adds up measure values
correctly while more granular totals are still available as much as the individual source granularities.
Kalido also provides a construct known as the coding structure which allows facts that may be recorded against
one of a number of levels in a hierarchy (e.g., a budget that may be for a team, a department, a division or a
company). Kalido automatically generates the right SQL to aggregate facts recorded against a coding structure.
4.12. Many-many dimensions

This occurs where a fact effectively has multiple entries for a single dimension <page 218>. If the facts are
aggregated using the dimension for which there are multiple entries, they will be double counted (i.e., the grand
total reported will be greater than the sum of all the facts). What is more frequently required is a query in which
the aggregate is created from a subset of facts limited by one or more values in the multi-valued dimension.
Kimball uses the example of a billable patient encounter fact for which there is a diagnosis dimension, and
there can be many diagnoses on each encounter. If one aggregated the billable value by diagnosis, the sum
of the sub-totals would be greater than the total billable value. What is more likely is a query that aggregates
billable value for all encounters that include a specific diagnosis.
Kalido does not support double counted many-many dimensions. What Kalido does have is a many-many association
capability, so that a diagnosis group can be defined (as in Kimballs solution to the problem) and a many-many
association created between the group and the diagnosis class. Kalido then allows a filter to be applied (when
building a query definition) by dragging individual diagnoses into the query constraints.
When there are just a few possible multiple parents of a dimensional entity, Kalido also provides a solution to
dissolve each parent into multiple binary parents called a dimension set. So, for example, if a road accident can
be attributed to many reasons including road condition, car failure, weather conditions and driver error, these
possibilities would be automatically modeled as separate parents like car failure yes/no? type binary and
many-to-one relationships. Kalido allows the data to be loaded with a simple list as though it were a many to
many, but reported on as though it were a set of many-to-one relationships.
12
4.13. Degenerate dimensions

Kimball describes the degenerate dimension at <page 188>. This occurs when a field such as an order number
is left over because the only data we are interested in for our fact is the order line data. These are generally
used to link a chain of different types of fact to each other.
Kalido allows any additional fields to be added to the measure and dimension data in a transaction. In this way,
the invoice number for each invoice line transaction can be stored and referenced in group by statements
generated by a BI tool.
There are also some special fields that Kalido ascribes special behavior to, such as a transaction identifier (such
as a point of sale system transaction ID), which Kalido ensures uniqueness for. This helps prevent accidentally
loading the same transaction twice and can help track transactions back to the source system if need be.
4.14. Junk dimensions

Kimball describes junk dimensions as dimensions used to accumulate various status and switch flags (such as
yes/no flags) <page 189>.
With Kalido, junk dimensions are simply modeled as dimensions and classes of business entities in the normal
way (as they would be in a Kimball implementation).
4.15. Factless fact tables

Some facts simply record business events (such as a person attending a training class) in which there is no
measure. Kimball describes these as factless facts <page 212>. There are also some facts that represent a
useful combination of dimensional items, which Kimball calls coverage facts (an example here would be that
a certain product is sold through a certain channel in a certain country this represents a way of recording the
valid combinations from the members of a number of dimensions).
With Kalido, factless facts can be implemented as a Transaction data set with a count measure that always has
the value 1 (much as Kimball advises).
Kalido also has a concept called mapped business entities, which allow valid combinations of business entities
to be collected and treated as a single unit (for example, you might collect a product, a channel and a country as
an allowed combination). This approach is only used for relatively small (up to 100,000) combinations such as
market segmentations.
Kalidos mapped business entity feature is also used for modeling business rules as a function of multiple dimensions.
For example, if the product class is services and if the state is New Hampshire, then the tax attribute is 5%. This
could be more complex, such as if the product class is services, except the product sub-class education, and if
the state is New Hampshire, then tax is 5%.
4.16. Organization and parts hierarchies

Kimball describes the organization and other any level parts hierarchies <page 226> and how SQL provides
inadequate means of aggregating such networks. Kimball suggests the implementation of a helper table to
flatten the hierarchy so that facts can be aggregated to any level with a single join.
Kalido flattens any level dimensional hierarchies also but does so in the same way that it deals with fixed level
hierarchies, by generating columns in the mapping tables. The result is a mapping table that has one column for
each level in the hierarchy. Kalido of course generates the correct SQL to aggregate to the level requested by the user
(the user can also optionally name the levels, which in the case of organization structures is often attempted).
This results in efficient query time processing as there are no joins required at all. What it does mean, however,
13
is that when the master data changes, more (or fewer) columns in the mapping table may appear (based on the
number of levels in the data). Kalido automatically handles this upgrade to both table and data, but it does mean
that table structure may change in a production environment, unless this type of change is handled through the
development/test/QA/release environment (as it typically is in most non-Kalido systems). It is worth noting that
with Kalido this is simply modeled as an involution on the business model, and although this never results in
a SQL CONNECT BY statement (to Kimballs point) it does simply work by providing the right results for the
user (and does this in a way that continues to work despite changes in the hierarchical structure such as
re-organizations, including aggregations that span a reorganization).
4.17. Ragged hierarchies/unpredictably deep hierarchies

Some hierarchies are not a fixed number of levels. They may have some nodes that are 3 levels deep, some
that are 8, some that are 2. Kimball gives examples at <page 231>.
Kalido handles these in the same way that it would a fixed level hierarchy (i.e., the number of columns in a
mapping table equals the largest number of levels in the whole hierarchy). Levels that are not appropriate
are populated with a user-defined phrase such as Not available at this level, so the user will always have the
correct data and always return the right number of rows.
4.18. Audit dimension

Kimball suggests an optional audit table to be populated with useful execution information each time a batch of
fact data is loaded <page 237>.
Kalido automatically provides this data as part of the transaction loader logging the data against a system object
called a batch. In cases where the batch is broken into separate chunks (e.g., erroneous records being corrected
and reloaded), Kalido generates sub batches.
4.19. Staging
Kimball describes staging in great depth <pages 609663>. The description deals with approaches to surrogate
key management, database versus flat file staging, incremental loading, final data loading, error handling/flagging,
dimensional change logic, historic data loads, data quality and cleansing. In the Kimball model, the staging area
is a significant part of the data warehouse architecture. Kalido provides either automation or support for each of
the items mentioned above, and as a result, the staging area in a Kalido implementation tends to be somewhat
simpler than would be the case in a Kimball implementation. That is not to trivialize what can be a very complex
and difficult part of a data warehouse project poor quality data in particular still needs to be sorted out it is
just that Kalido automates or obviates the need for a significant proportion of standard Kimball-style staging
activity. Here are the main points:
Database versus flat file staging
Kalido supports both. Files can be on the server running the load utility or via an ftp connection.
Incremental loading
Kalido will do delta detection for dimensional data if required (so a complete dump of a dimension can be
loaded each period, allowing Kalido to do the incremental update as necessary). If source system or ETL
tools are to do the change data detection, Kalido will accept a file of deltas.
14
Final data loading

Kalido does this automatically. It is not possible to do the final load any way other than with the Kalido loaders.
For fact data, Kalido will optionally use the DBMS native loader as an option.
Error handling/flagging
Kalido has an error management subsystem for presenting errors in fact data loads. This allows the user to
examine the error and either correct for reload or reject. For master data loads, errors are rejected but may
be managed as part of the data cleansing process (see below).
Dimensional change logic
Kalido automates this for Kalido-controlled dimensions. No code is necessary. Kalido also offers an option to
manage dimensions manually if there is some specialist processing required. If this option is taken for a
dimension, the dimensional change logic must be handled in ETL or other processing external to Kalido.
Historic data loads
Kalido treats these as normal loads, though there are special options on master data loaders allowing
correction of history if an anomaly in historical data is discovered.
Data quality and cleansing
Kalido provides no automated data cleansing or standardization but does provide a complete application
to support data governance, allowing business users to develop authorized golden copies where
manual decisions are required (such as missing classifications) of master data before it is used in the data
warehouse itself.
4.20. Building end user applications

Kimball describes a number of techniques to address building end user applications <page 655>.
Kalido does not replace or reduce the need for these techniques but does provide a means of automating
the production of the configuration for certain BI tools (and manages the incremental update of these configurations
when the business model changes). Support is currently available for BusinessObjects, Excel and Analysis
services (offline cubes).
15
5. Summary table
The table below summarizes Kalidos support for the industry best practices for developing data warehouses.
For each feature we have given an indication of whether Kalido supports the best practice or not, and if it does
support the best practice, the level of support Kalido offers.
In the Kalido support column:

Enforced (E) means Kalido provides explicit support for this feature, and the use of the feature is not optional.
Supported (S) means Kalido provides explicit support for the feature, but use of the feature is optional.
Allowed (A) means that Kalido provides no explicit support for the feature, but using other Kalido features can
achieve the same results. Prevented (P) means that Kalido will not allow or cannot support the feature.
In the level of support column:

Exceeds (E) means that Kalido incorporates the best practice but also goes further in offering additional
functionality. Supports (S) means that Kalido automates the approach suggested by industry best practices.
No Support (N) means that Kalido does not support the approach.
16
Warehouse Feature
Kalido support
Level of Support
Dimensional models
Surrogate keys
Slowly changing dimensions
Meta data-driven data warehouse
Roles
Aggregates and summaries
Time as a dimension
Sparse facts
Additive, semi-additive, non-additive facts/measures
Currency and other UOM conversions
Multiple grain transaction data
Many-many dimensions
Degenerate dimensions
Junk dimensions
Factless fact tables
Organization and parts hierarchies
Ragged hierarchies/unpredictably deep hierarchies
Audit dimension
Staging
Building end-user applications
6. Conclusion
While additional data warehousing best practices are advocated by both Ralph Kimball and Bill Inmon, as well
as by many other practitioners, this paper has focused on the 20 forming the bulk of the published literature for
data warehouse design as defined by Kimball and Inmon. From its inception, Kalido has kept the most critical of
these best practices firmly in mind while developing and improving its software products: incremental development,
efficiency and ease of understanding and graceful adaptation to change.
By using Kalido to build and manage their data warehouse, organizations can rest assured that it conforms to the
best practices as advocated by the industrys best known experts. More important, a Kalido data warehouse will
deliver the accurate, consistent, accessible information your company needs to manage and run your business
over time, as it changes.
For more information about Kalido software, please visit our web site at http://www.kalido.com. To learn more
about how Kalido supports industry best practices, please contact us at info@kalido.com.
7. About Kalido
Kalido delivers active information management for business. With Kalidos unique business-model-driven
technology, decisions are fueled by accurate, accessible and consistent information, delivered in real time, to
dramatically improve corporate performance. Kalido software can be deployed at a fraction of the time and cost
of traditional information management methods.
Kalido software is installed at over 250 locations in more than 100 countries with market leading companies.
Headquartered in Burlington, Massachusetts, Kalido is backed by Atlas Venture, Benchmark Capital and Matrix
Partners. More information about Kalido can be found at: http://www.kalido.com.
Copyright 2007 Kalido. All rights reserved. Kalido, the

Kalido logo and Kalidos product names are trademarks of
Kalido. References to other companies and their products
use trademarks owned by the respective companies and
are for reference purpose only. No portion hereof may be
reproduced or transmitted in any form or by any means,
electronic or mechanical, including photocopying, recording or information storage and retrieval systems, for any
purpose other than the recipients personal use, without
the express, written permission of Kalido. The information
contained herein is subject to change without notice. Kalido
shall not be liable for errors contained herein or consequential damages in connection with furnishing, performance or use hereof. Any Kalido software described herein
is licensed exclusively subject to the conditions set forth in
the Kalido license agreement.
17
Contact Information
US Tel:
+1 781 202 3200
Eur Tel:
+44 (0)845 224 1236
Email:
info@kalido.com
or visit our website at

www.kalido.com
Copyright 2007 Kalido. All rights reserved. Kalido, the Kalido logo and Kalidos product names are trademarks of Kalido.
References to other companies and their products use trademarks owned by the respective companies and are for reference purpose only.
WP-DWBP0307

Data Warehouse Best Practices

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Warehouse Best Practices

Загружено:

Авторское право:

Доступные форматы

White Paper

Supporting Data Warehouse Design

Supporting Data Warehouse Design Best Practices with Kalido

2.1. Incremental development

4.9. Additive, semi-additive,

4.10. Currency and other

2.2. Efficiency and ease

3.1. Warehouse structure

3.2. Meta data

4.1. Dimensional models

4.2. Surrogate keys

4.3. Slowly changing dimensions

4.4. Meta data-driven data warehouse

4.6. Aggregates and summaries

4.7. Time as a dimension

4.8. Sparse facts

4.12. Many-Many Dimensions

4.13. Degenerate dimensions

4.14. Junk dimensions

4.15. Factless fact tables

4.16. Organization and parts hierarchies

4.17. Ragged hierarchies/

4.18. Audit dimension

4.20. Building end user applications

Supporting Data Warehouse Design Best Practices with Kalido

2.1. Incremental development

2.2. Efficiency and ease of understanding

2.3. Graceful adaptation to change

3. Architectural structure and features

3.1. Warehouse structure

Supporting Data Warehouse Design Best Practices with Kalido

3.2. Meta data

Mapping table, attribute table

Transaction data set

For example: The class of transaction sales

Kalido automatically allocates and manages

Kalido allows as many alternative identifiers as

Class of business entity

A distinct entity type (Kalido will automatically

Attribute tables are an optional form of snowflaking

4. Design techniques and how Kalido supports them

Kalido enforces a strict dimensional model for the output

Supporting Data Warehouse Design Best Practices with Kalido

4.2. Surrogate keys

4.3. Slowly changing dimensions

4.4. Meta data-driven data warehouse

4.6. Aggregates and summaries

Supporting Data Warehouse Design Best Practices with Kalido

Kalido automates summary implementation and

4.7. Time as a dimension

Kalido also requires time as a dimension and

4.8. Sparse facts

4.9. Additive, semi-additive, non-additive facts/measures

4.10. Currency and other UOM conversions

4.11. Multiple grain transaction data

Supporting Data Warehouse Design Best Practices with Kalido

4.12. Many-many dimensions

4.13. Degenerate dimensions

4.14. Junk dimensions

4.15. Factless fact tables

4.16. Organization and parts hierarchies

Supporting Data Warehouse Design Best Practices with Kalido

4.17. Ragged hierarchies/unpredictably deep hierarchies

4.18. Audit dimension

4.9. Additive, semi-additive,

4.10. Currency and other

2.2. Efficiency and ease

4.4. Meta data-driven data warehouse

4.16. Organization and parts hierarchies

4.17. Ragged hierarchies/

Final data loading