Вы находитесь на странице: 1из 16

Data warehouse

Index
Data warehouse page 3 Strengths and weaknesses of implementing a Data warehouse page 5 Data Warehouse Architecture page 7 Data Warehouse Schema page 9 Data mining page 10 Data mining used for page 11 Data Mining Techniques page 13

Data warehouse
A data warehouse in its simplest perception is no more than a collection of the key pieces of information used to manage and direct the business for the most profitable outcome. This could be anything from deciding the level of stock on supermarket shelf, through focused targeting of customers for loyalty schemes and direct mail promotions to strategic decisions making on major market segments and company profitability. In other word, a data warehouse is the data (meta/fact/dimension/aggregation) and the process managers (load/warehouse/query) that make information available, enabling people to make informed decisions. In computing, a data warehouse or enterprise data warehouse is a database used for reporting and data analysis. It is a central repository of data, which is created by integrating data from multiple disparate sources. Data warehouses store current as well as historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons. The data stored in the warehouse are uploaded from the operational systems (such as marketing, sales etc.,
3

shown in the figure to the right). The data may pass through an operational data store for additional operations before they are used in the DW for reporting. The typical ETL-based data warehouse uses staging, integration, and access layers to house its key functions. The staging layer or staging database stores raw data extracted from each of the disparate source data systems. The integration layer integrates the disparate data sets by transforming the data from the staging layer often storing this transformed data in an operational data store (ODS) database. The integrated data are then moved to yet another database, often called the data warehouse database, where the data is arranged into hierarchical groups often called dimensions and into facts and aggregate facts. The combination of facts and dimensions is sometimes called a star schema. The access layer helps users retrieve data. A data warehouse constructed from an integrated data source systems does not require ETL, staging databases, or operational data store databases. The integrated data source systems may be considered to be a part of a distributed operational data store layer. Data federation methods or data virtualization methods may be used to access the distributed integrated source data systems to consolidate and aggregate data directly into the data warehouse database tables. Unlike the ETL-based data warehouse, the integrated source data systems and the data warehouse are all integrated since there is no transformation of dimensional or reference data. This integrated data warehouse architecture supports the drill down from the aggregate data of the data warehouse to the

transactional data of the integrated source data systems. Data warehouses can be subdivided into data marts. Data marts store subsets of data from a warehouse.

Strengths and weaknesses of implementing a Data warehouse


Strengths
A Data Warehouse Delivers Enhanced Business Intelligence By providing data from various sources, managers and executives will no longer need to make business decisions based on limited data or their gut. In addition, data warehouses and related BI can be applied directly to business processes including marketing segmentation, inventory management, financial management, and sales. A Data Warehouse Saves Time Since business users can quickly access critical data from a number of sourcesall in one placethey can rapidly make informed decisions on key initiatives.
5

They wont waste precious time retrieving data from multiple sources. Not only that but the business execs can query the data themselves with little or no support from ITsaving more time and more money. That means the business users wont have to wait until IT gets around to generating the reports, and those hardworking folks in IT can do what they do bestkeep the business running. A Data Warehouse Enhances Data Quality and Consistency A data warehouse implementation includes the conversion of data from numerous source systems into a common format. Since each data from the various departments is standardized, each department will produce results that are in line with all the other departments. So you can have more confidence in the accuracy of your data. And accurate data is the basis for strong business decisions. A Data Warehouse Provides Historical Intelligence A data warehouse stores large amounts of historical data so you can analyze different time periods and trends in order to make future predictions. Such data typically cannot be stored in a transactional database or used to generate reports from a transactional system. A Data Warehouse Generates a High ROI Finally, the piece de resistancereturn on investment. Companies that have implemented data warehouses and complementary BI systems have generated more
6

revenue and saved more money than companies that havent invested in BI systems and data warehouses.

Weaknesses
Data warehouses are not the optimal environment
for unstructured data.

Because data must be extracted, transformed and


loaded into the warehouse, there is an element of latency in data warehouse data.

Over their life, data warehouses can have high


costs.

Data warehouses can get outdated relatively


quickly. There is a cost of delivering suboptimal information to the organization.

Data Warehouse Architecture


Different data warehousing systems have different structures. Some may have an ODS (operational data
7

store), while some may have multiple data marts. Some may have a small number of data sources, while some may have dozens of data sources. In view of this, it is far more reasonable to present the different layers of a data warehouse architecture rather than discussing the specifics of any one system. In general, all data warehouse systems have the following layers: Data Source Layer Data Extraction Layer Staging Area ETL Layer Data Storage Layer Data Logic Layer Data Presentation Layer Metadata Layer System Operations Layer

Data Source Layer This represents the different data sources that feed data into the data warehouse. The data source can be of any format -- plain text file, relational database, other types of database, Excel file, etc., can all act as a data source. Many different types of data can be a data source: Operations -- such as sales data, HR data, product data, inventory data, marketing data, systems data. Web server logs with user browsing data.
8

Internal market research data. Third-party data, such as census data, demographics data, or survey data. All these data sources together form the Data Source Layer. Data Extraction Layer Data gets pulled from the data source into the data warehouse system. There is likely some minimal data cleansing, but there is unlikely any major data transformation. Staging Area This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration. ETL Layer This is where data gains its "intelligence", as logic is applied to transform the data from a transactional nature to an analytical nature. This layer is also where data cleansing happens. The ETL design phase is often the most time-consuming phase in a data warehousing project, and an ETL tool is often used in this layer. Data Storage Layer This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities can be found here: data warehouse, data mart, and operational data store (ODS). In any given system, you may have just one of the three, two of the three, or all three types.
9

Data Logic Layer This is where business rules are stored. Business rules stored here do not affect the underlying data transformation rules, but do affect what the report looks like. Data Presentation Layer This refers to the information that reaches the users. This can be in a form of a tabular / graphical report in a browser, an emailed report that gets automatically generated and sent every day, or an alert that warns users of exceptions, among others. Usually an OLAP tool and/or a reporting tool is used in this layer. Metadata Layer This is where information about the data stored in the data warehouse system is stored. A logical data model would be an example of something that's in the metadata layer. A metadata tool is often to used to manage metadata. System Operations Layer This layer includes information on how the data warehouse system operates, such as ETL job status, system performance, and user access history.

Data Warehouse Schema


Star Schemas The star schema is the simplest data warehouse
10

schema. It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of one or more fact tables and the points of the star are the dimension tables Snowflake schema The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. In a star schema, each dimension is represented by a single dimensional table, whereas in a snowflake schema, that dimensional table is normalized into multiple lookup tables, each representing a level in the dimensional hierarchy.

Data mining
Data mining is a process of statistical analysis. Analysts use technical tools to query and sort through terabytes of data looking for patterns. Usually, the analyst will develop a hypothesis, such as customers who buy product X usually buy product Y within six months. Running a query on the relevant data to prove or disprove this theory is data mining. Businesses then
11

use this information to make better business decisions based on how they understand their customers and suppliers' behaviors.

Data mining used for


Collection

All companies with a large customer base have a number of customers who do not pay their dues on time. Collecting these payments from the debtors requires a great deal of resources, and a large proportion of this work is wasted on customers that are difficult or impossible to recover. By applying predictive data mining to historical customer debt data, the collection procedure can be optimized by identifying the debtors most likely to pay and finding the most effective contact methods or legal actions for each debtor.
Scientific applications

In pharmaceutical companies, chemistry is one of the most resource intensive areas within the research and development (R&D) activities. The whole purpose of the pharmaceutical companys R&D is to produce new chemical entities (NCE) that will make it all the way through clinical trials to the market as new drugs. The search for new chemical compounds is in essence a trial-and-error process. The job of the R&D chemist is to synthesise (produce) new compounds for testing in the
12

laboratory. One such chemist may spend up to three weeks for making just one such compound. Despite the expensive production of new compounds, pharmaceutical companies test very many compounds in their R&D activities and stores all the results in large databases (there is also a fairly large industry selling chemical compounds of great diversity). By applying predictive data mining to such historical laboratory test data, it is possible to reasonably predict the outcome of the laboratory tests without having to synthesise the compounds. This means that the chemist can find out the most likely properties of the alternative compounds and choose to work on the most promising ones, before spending the next three weeks in the laboratory, thereby increasing the quality of the resulting compounds.
Recommendation systems

All companies want to sell more to existing customers. This is often the most effective way of increasing the profitability. For companies that sell many different products to large customer bases and that keep records of sales transactions for their customers, it is possible to apply predictive data mining to identify sales opportunities as products likely to appeal to a particular customer who has not yet bought them. This type of application is also commonly referred to as cross-selling, and some of the most notable examples of companies using it is Flipkart.com where you will get relevant books recommended.
13

Other applications

Other applications of predictive data mining include fraud detection (e.g. within credit card transactions, taxation, telephony, and insurance industry) and risk management (e.g. for determining insurance policy rates or managing credit applications).

Data Mining Techniques


Association Association is one of the best known data mining technique. In association, a pattern is discovered based on a relationship of a particular item on other items in the same transaction. For example, the association technique is used in market basket analysis to identify what products that customers frequently purchase together. Based on this data businesses can have corresponding marketing campaign to sell more products to make more profit. Classification Classification is a classic data mining technique based on machine learning. Basically classification is used to
14

classify each item in a set of data into one of predefined set of classes or groups. Classification method makes use of mathematical techniques such as decision trees, linear programming, neural network and statistics. In classification, we make the software that can learn how to classify the data items into groups. For example, we can apply classification in application that given all past records of employees who left the company, predict which current employees are probably to leave in the future. In this case, we divide the employees records into two groups that are leave and stay. And then we can ask our data mining software to classify the employees into each group. Clustering Clustering is a data mining technique that makes meaningful or useful cluster of objects that have similar characteristic using automatic technique. Different from classification, clustering technique also defines the classes and put objects in them, while in classification objects are assigned into predefined classes. To make the concept clearer, we can take library as an example. In a library, books have a wide range of topics available. The challenge is how to keep those books in a way that readers can take several books in a specific topic without hassle. By using clustering technique, we can keep books that have some kind of similarities in one cluster or one shelf and label it with a meaningful name. If readers want to grab books in a topic, he or she would only go to that shelf instead of looking the whole in the whole library. Prediction The prediction as it name implied is one of a data mining techniques that discovers relationship between
15

independent variables and relationship between dependent and independent variables. For instance, prediction analysis technique can be used in sale to predict profit for the future if we consider sale is an independent variable, profit could be a dependent variable. Then based on the historical sale and profit data, we can draw a fitted regression curve that is used for profit prediction. Sequential Patterns Sequential patterns analysis in one of data mining technique that seeks to discover similar patterns in data transaction over a business period. The uncover patterns are used for further business analysis to recognize relationships among data.

16

Вам также может понравиться