Вы находитесь на странице: 1из 11

Q1. Differentiate between Data Mining and Data Warehousing. Ans.

Data mining is the process of finding patterns in a given data set. These patterns can often provide meaningful and insightful data to whoever is interested in that data. Data mining is used today in a wide variety of contexts in fraud detection, as an aid in marketing campaigns, and even supermarkets use it to study their consumers. Data warehousing can be said to be the process of centralizing or aggregating data from multiple sources into one common repository. The terms data mining and data warehousing are often confused by both business and technical staff. The entire field of data management has experienced a phenomenal growth with the implementation of data collection software programs and the decreased cost of computer memory. The primary purpose behind both these functions is to provide the tools and methodologies to explore the patterns and meaning in large amount of data. The primary differences between data mining and data warehousing are the system designs, methodology used, and the purpose. Data mining is the use of pattern recognition logic to identity trends within a sample data set and extrapolates this information against the larger data. Data warehousing is the process of extracting and storing data to allow easier reporting. Data mining is a general term used to describe a range of business processes that derive patterns from data. Typically, a statistical analysis software package is used to identify specific patterns, based on the data set and queries generated by the end user. A typical use of data mining is to create targeted marketing programs, identify financial fraud, and to flag unusual patterns in behavior as part of a security review. Data Mining is actually the analysis of data. It is the computer-assisted process of digging through and analyzing enormous sets of data that have either been compiled by the computer or have been inputted into the computer. Data warehousing is the process of compiling information or data into a data warehouse. A data warehouse is a database used to store data.

Data Mining is actually the analysis of data. It is the computer-assisted process of digging through and analyzing enormous sets of data that have either been compiled by the computer or have been inputted into the computer. In data mining, the computer will analyze the data and extract the meaning from it. It will also look for hidden patterns within the data and try to predict future behavior. Data Mining is mainly used to find and show relationships among the data. The purpose of data mining, also known as knowledge discovery, is to allow businesses to view these behaviors, trends and/or relationships and to be able to factor them within their decisions. This allows the businesses to make proactive, knowledge-driven decisions. The term data mining comes from the fact that the process of data mining, i.e. searching for relationships between data, is similar to mining and searching for precious materials. Data mining tools use artificial intelligence, machine learning, statistics, and database systems to find correlations between the data. These tools can help answer business questions that traditionally were too time consuming to resolve. Data Mining includes various steps, including the raw analysis step, database and data management aspects, data preprocessing, model and inference considerations, interestingness metrics, complexity considerations, postprocessing of discovered structures, visualization, and online updating.

In contrast, data warehousing is completely different. However, data warehousing and data mining are interrelated. Data warehousing is the process of compiling information or data into a data warehouse. A data warehouse is a database used to store data. It is a central repository of data in which data from various sources is stored. This data warehouse is then used for reporting and data analysis. It can be used for creating trending reports for senior management reporting such as annual and quarterly comparisons. The purpose of a data warehouse is to provide flexible access to the data to the user. Data warehousing generally refers to the combination of many different databases across an entire enterprise. The main difference between data warehousing and data mining is that data warehousing is the process of compiling and organizing data into one common database, whereas data mining is the process of extracting meaningful data from that database. Data mining can only be done once data warehousing is complete. Q3. Explain the concepts of Data Integration and Transformation.

Ans. Data Integration Overview Data integration is responsible for moving, cleansing and transforming set-based dataoften very large data sets from source(s) into the Production data area and then into the Consumption data area as shown in Figure 3-1.

Figure 3-1: The role of data integration in a data warehouse project The requirements for the data integration component include: Trust Business consumers must be able to trust the results obtained from the data warehouse. One version of the truth Consolidating disparate sources into an integrated view supports business consumers need for an enterprise-level view of data. Current and historical views of data The ability to provide both a historical view of data as well as a recent view supports key business consumer activities such as trend analysis and predictive analysis. Availability Data integration processes must not interfere with business consumers ability to get results from the data warehouse.

The challenges for the data integration team in support of these requirements include: Data quality The data integration team must promote data quality to a first-class citizen. Transparency and audit ability Even high-quality results will be questioned by business consumers. Providing complete transparency into how the data results were produced will be necessary to allay business consumers concerns around data quality.

Tracking history The ability to correctly report results at a particular period in time is an ongoing challenge, particularly when there are adjustments to historical data. Reducing processing times Efficiently processing very large volumes of data within ever shortening processing windows is an ongoing challenge for the data integration team.

This section introduces the following key Data integration concepts, which we will look at in more detail as we present best practices later in this chapter: Consolidation, normalization, and standardization Data integration paradigms (ETL and ELT) ETL processing categories Incremental loads Detecting net changes Data integration management concepts

Consolidation, Normalization, and Standardization Data integration processes typically have a long shelf lifeits not uncommon for an ETL process to be in production for more than10 years. These processes undergo many revisions over time, and the number of data processes grows over time as well. In addition, different development teams often work on these processes without coordination or communication. The result is duplication of effort and having multiple ETL processes moving the same source data to different databases. Each developer often uses different approaches for common data integration patterns, error handling, and exception handling.Worse yet, the lack of error and exception handling can make diagnosing error and data exceptions very expensive. The absence of consistent development patterns and standards results in longer development cycles and increases the likelihood that the ETL code will contain bugs. Longer development times, inconsistent error and exception handling, and buggy code all contribute to increasing data integration TCO. Well-run data integration shops have recognized that time spent up-front on ETL consistency is well worth the effort, reducing both maintenance and development costs. ETL consistency is achieved through three practicesconsolidation, normalization, and standardization: Consolidation is the practices of managing the breadth of processes and servers that handle ETL operations. This includes both the operations that perform ETL, such as SSIS packages, and the databases and files stores that support the ETL, such asData In and Production databases. If your environment has dozens of databases and servers that do not generate data but merely copy and transform data (and often the same data!), you are not alone. However, you likely spend a lot oftime managing these duplicate efforts. Normalization involves being consistent in your ETL processing approach. You can develop an ETL package in many different ways to get to the same result. However, not all approaches are efficient, and using different approaches to accomplish similar ETL scenarios makes managing the solutions difficult. Normalization is about being consistent in how you tackle data processing taking a normal or routine implementation approach to similar tasks. Standardization requires implementing code and detailed processes in a uniform pattern. If normalization is about processes, standardization is about the environment and code practices. Standardization in ETL can involve naming conventions, file management practices, server configurations, and so on.

Data integration standards, like any standards, need to be defined up-front and then enforced. ETL developers and architects should implement the standards during development. You should never agree to implement standards later. Lets look at each of these practices more closely. Consolidation Suppose you work for a moderately large organization, such as an insurance company. The core systems involve policy management, claims, underwriting, CRM, accounting, and agency support. As with most insurance companies, your organization has multiple systems performing similar operations due to industry consolidation and acquisitions or to support the various insurance products offered. The supporting LOB systems or department applications far outnumber the main systems because of the data-centric nature of insurance. However, many of the systems require data inputs from the core systems, making the Web of information sharing very complicated. Figure 3-4 shows the conceptual data layout and connection between the systems.

Figure 3-4: System dependency scenario for an insurance company Each line in Figure 3-4involves ETL of some nature. In some cases,the ETL is merely an import and exportof raw data. Other cases involve more complicated transformation or cleansing logic, or even the integration of third-party data for underwriting or marketing. If each line in the diagram were a separate, uncoordinated ETL process, the management and IT support costs of this scenario would be overwhelming. The fact is that a lot of the processes involve the same data, making consolidation of the ETL greatly beneficial. The normalization of consistent data processes (such as the summary of claims data) would help stabilize the diversity of operations that perform an aggregation. In addition, the sheer number of ETL operations involved between systems would benefit from a consolidation of servers handling the ETL, as well as from the normalization of raw file management and standardization of supporting database names and even the naming conventions of the ETL package. Normalization Because normalization applies to being consistent about the approach to processes, ETL has several layers of normalization. In fact, a large part of this chapter is dedicated to normalization, first as we look at common patterns found in ETL solutions and then later as we cover best practices. Normalization in ETL includes but is not limited to:

Common data extraction practices across varying source systems Consistent approach to data lineage and metadata reporting Uniform practices for data cleansing routines Defined patterns for handling versioning and data changes Best practice approaches for efficient data loading

Standardization Although it sounds basic, the first step toward standardization is implementing consistent naming conventions for SSIS packages. Consider the screen shot in Figure 3-5, which represents a small slice of ETL packages on a single server. For someone trying to track down an issue or identify the packages that affect a certain system, the confusion caused by the variety of naming styles creates huge inefficiencies. It is hard enough for an experienced developer or support engineer to remember all the names and processes, but add a new developer or IT support person to the mix, and the challenges increase. b. In statistics, data transformation refers to the application of a deterministic mathematical function to each point in a data set that is, each data point zi is replaced with the transformed value yi = f(zi), where f is a function. Transforms are usually applied so that the data appear to more closely meet the assumptions of a statistical inference procedure that is to be applied, or to improve the interpretability or appearance of graphs. Nearly always, the function that is used to transform the data is invertible, and generally is continuous. The transformation is usually applied to a collection of comparable measurements. For example, if we are working with data on peoples' incomes in some currency unit, it would be common to transform each person's income value by the logarithm function. Guidance for how data should be transformed, or whether a transform should be applied at all, should come from the particular statistical analysis to be performed. For example, a simple way to construct an approximate 95% confidence interval for the population mean is to take the sample mean plus or minus two standard error units. However, the constant factor 2 used here is particular to the normal distribution, and is only applicable if the sample mean varies approximately normally. The central limit theorem states that in many situations, the sample mean does vary normally if the sample size is reasonably large. However if the population is substantially skewed and the sample size is at most moderate, the approximation provided by the central limit theorem can be poor, and the resulting confidence interval will likely have the wrong coverage probability. Thus, when there is evidence of substantial skew in the data, it is common to transform the data to a symmetric distribution before constructing a confidence interval. If desired, the confidence interval can then be transformed back to the original scale using the inverse of the transformation that was applied to the data. Data can also be transformed to make it easier to visualize them. For example, suppose we have a scatter plot in which the points are the countries of the world, and the data values being plotted are the land area and population of each country. If the plot is made using untransformed data (e.g. square kilometers for area and the number of people for population), most of the countries would be plotted in tight cluster of points in the lower left corner of the graph. The few countries with very large areas and/or populations would be spread thinly around most of the graph's area. Simply rescaling units (e.g. to thousand square kilometers, or to millions of people) will not change this. However, following logarithmic transformations of area and population, the points will be spread more uniformly in the graph. A final reason that data can be transformed is to improve interpretability, even if no formal statistical analysis or visualization is to be performed. For example, suppose we are comparing cars in terms of their fuel economy. These data are usually presented as "kilometers per liter" or "miles per gallon." However if the goal is to assess how much additional fuel a person would use in one year when driving one car compared to another, it is more natural to work with the data transformed by the reciprocal function, yielding liters per kilometer, or gallons per mile.

The logarithm and square root transformations are commonly used for positive data, and the multiplicative inverse (reciprocal) transformation can be used for non-zero data. The power transform is a family of transformations parameterized by a non-negative value that includes the logarithm, square root, and multiplicative inverse as special cases. To approach data transformation systematically, it is possible to use statistical estimation techniques to estimate the parameter in the power transform, thereby identifying the transform that is approximately the most appropriate in a given setting. Since the power transform family also includes the identity transform, this approach can also indicate whether it would be best to analyze the data without a transformation. In regression analysis, this approach is known as the Box-Cox technique. The reciprocal and some power transformations can be meaningfully applied to data that include both positive and negative values (the power transform is invertible over all real numbers if is an odd integer). However when both negative and positive values are observed, it is more common to begin by adding a constant to all values, producing a set of non-negative data to which any power transform can be applied. A common situation where a data transformation is applied is when a value of interest ranges over several orders of magnitude. Many physical and social phenomena exhibit such behavior incomes, species populations, galaxy sizes, and rainfall volumes, to name a few. Power transforms, and in particular the logarithm, can often be used to induce symmetry in such data. The logarithm is often favored because it is easy to interpret its result in terms of "fold changes." The logarithm also has a useful effect on ratios. If we are comparing positive quantities X and Y using the ratio X / Y, then if X < Y, the ratio is in the unit interval (0,1), whereas if X > Y, the ratio is in the half-line (1,), where the ratio of 1 corresponds to equality. In an analysis where X and Y are treated symmetrically, the log-ratio log(X / Y) is zero in the case of equality, and it has the property that if X is K times greater than Y, the log-ratio is the equidistant from zero as in the situation where Yis K times greater than X (the log-ratios are log(K) and log(K) in these two situations). Q4. Differentiate between database management systems (DBMS) and data mining. Ans. A DBMS (Database Management System) is a complete system used for managing digital databases that allows storage of database content, creation/maintenance of data, search and other functionalities. On the other hand, Data Mining is a field in computer science, which deals with the extraction of previously unknown and interesting information from raw data. Usually, the data used as the input for the Data mining process is stored in databases. Users who are inclined toward statistics use Data Mining. They utilize statistical models to look for hidden patterns in data. Data miners are interested in finding useful relationships between different data elements, which is ultimately profitable for businesses. DBMS DBMS, sometimes just called a database manager, is a collection of computer programs that is dedicated for the management (i.e. organization, storage and retrieval) of all databases that are installed in a system (i.e. hard drive or network). There are different types of Database Management Systems existing in the world, and some of them are designed for the proper management of databases configured for specific purposes. Most popular commercial Database Management Systems are Oracle, DB2 and Microsoft Access. All these products provide means of allocation of different levels of privileges for different users, making it possible for a DBMS to be controlled centrally by a single administrator or to be allocated to several different people. There are four important elements in any Database Management System. They are the modeling language, data structures, query language and mechanism for transactions. The modeling language defines the language of each database hosted in the DBMS. Currently several popular approaches like hierarchal, network, relational and object are in

practice. Data structures help organize the data such as individual records, files, fields and their definitions and objects such as visual media. Data query language maintains the security of the database by monitoring login data, access rights to different users, and protocols to add data to the system. SQL is a popular query language that is used in Relational Database Management Systems. Finally, the mechanism that allows for transactions help concurrency and multiplicity. That mechanism will make sure that the same record will not be modified by multiple users at the same time, thus keeping the data integrity in tact. Additionally, DBMS provide backup and other facilities as well. Data Mining Data mining is also known as Knowledge Discovery in Data (KDD). As mentioned above, it is a felid of computer science, which deals with the extraction of previously unknown and interesting information from raw data. Due to the exponential growth of data, especially in areas such as business, data mining has become very important tool to convert this large wealth of data in to business intelligence, as manual extraction of patterns has become seemingly impossible in the past few decades. For example, it is currently been used for various applications such as social network analysis, fraud detection and marketing. Data mining usually deals with following four tasks: clustering, classification, regression, and association. Clustering is identifying similar groups from unstructured data. Classification is learning rules that can be applied to new data and will typically include following steps: preprocessing of data, designing modeling, learning/feature selection and Evaluation/validation. Regression is finding functions with minimal error to model data. And association is looking for relationships between variables. Data mining is usually used to answer questions like what are the main products that might help to obtain high profit next year in Wal-Mart. What is the difference between DBMS and Data mining? DBMS is a full-fledged system for housing and managing a set of digital databases. However Data Mining is a technique or a concept in computer science, which deals with extracting useful and previously unknown information from raw data. Most of the times, these raw data are stored in very large databases. Therefore Data miners use the existing functionalities of DBMS to handle, manage and even preprocess raw data before and during the Data mining process. However, a DBMS system alone cannot be used to analyze data. But, some DBMS at present have inbuilt data analyzing tools or capabilities. Q5. Differentiate between K-means and Hierarchical clustering. Ans. k-means clustering is a method of vector quantization originally from signal processing, that is popular for cluster analysis in data mining. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Verona cells. The problem is computationally difficult (NP-hard); however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. These are usually similar to the expectationmaximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Additionally, they both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes. Given a set of observations (x1, x2, , xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k n) S = {S1, S2, , Sk} so as to minimize the withincluster sum of squares (WCSS):

Where i is the mean of points in Si. Standard algorithm The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is often called the kmeans algorithm; it is also referred to as Lloyd's algorithm, particularly in the computer science community. Given an initial set of k means m1(1),,mk(1) (see below), the algorithm proceeds by alternating between two steps: [7] Assignment step: Assign each observation to the cluster whose mean yields the least within-cluster sum of squares (WCSS). Since the sum of squares is the squared Euclidean distance, this is intuitively the "nearest" mean. (Mathematically, this means partitioning the observations according to the Voronoi diagram generated by the means).

Where each

is assigned to exactly one

, even if it could be is assigned to two or more of them.

Update step: Calculate the new means to be the centroids of the observations in the new clusters.

Since the arithmetic mean is a least-squares estimator, this also minimizes the within-cluster sum of squares (WCSS) objective. The algorithm has converged when the assignments no longer change. Since both steps optimize the WCSS objective, and there only exists a finite number of such partitioning, the algorithm must converge to a (local) optimum. There is no guarantee that the global optimum is found using this algorithm. The algorithm is often presented as assigning objects to the nearest cluster by distance. This is slightly inaccurate: the algorithm aims at minimizing the WCSS objective, and thus assigns by "least sum of squares". Using a different distance function other than (squared) Euclidean distance may stop the algorithm from converging. It is correct that the smallest Euclidean distance yields the smallest squared Euclidean distance and thus also yields the smallest sum of squares. Various modifications of k-means such as spherical k-means and k-medoids have been proposed to allow using other distance measures. In data mining, hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types: Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram. In the general case, the complexity of agglomerative clustering is data sets. Divisive clustering with an exhaustive search is , which makes them too slow for large , which is even worse. However, for some special

cases, optimal efficient agglomerative methods (of complexity and CLINK for complete-linkage clustering.

) are known: SLINK[1] for single-linkage

In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pair wise distances of observations in the sets. The choice of an appropriate metric will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. For example, in a 2-dimensional space, the distance between the point (1,0) and the origin (0,0) is always 1 according to the usual norms, but the distance between the point (1,1) and the origin (0,0) can be 2, or 1 under Manhattan distance, Euclidean distance or maximum distance respectively. Some commonly used metrics for hierarchical clustering are:

Names

Formula

Euclidean distance

squared Euclidean distance

Manhattan distance

maximum distance

Mahalanobis distance

where S is the covariance matrix

cosine similarity

For text or other non-numeric data, metrics such as the Hamming distance or Levenshtein distance are often used.

A review of cluster analysis in health psychology research found that the most common distance measure in published studies in that research area is the Euclidean distance or the squared Euclidean distance.

Q6. Differentiate between Web content mining and Web usage mining. Ans. Web usage mining is the third category in web mining. This type of web mining allows for the collection of Web access information for Web pages. This usage data provides the paths leading to accessed Web pages. This information is often gathered automatically into access logs via the Web server. CGI scripts offer other useful information such as referrer logs, user subscription information and survey logs. This category is important to the overall use of data mining for companies and their internet/ intranet based applications and information access. Usage mining allows companies to produce productive information pertaining to the future of their business function ability. Some of this information can be derived from the collective information of lifetime user value, product cross marketing strategies and promotional campaign effectiveness. The usage data that is gathered provides the companies with the ability to produce results more effective to their businesses and increasing of sales. Usage data can also be useful for developing marketing skills that will out-sell the competitors and promote the companys services or product on a higher level. Usage mining is valuable not only to businesses using online marketing, but also to e-businesses whose business is based solely on the traffic provided through search engines. The use of this type of web mining helps to gather the important information from customers visiting the site. This enables an in-depth log to complete analysis of a companys productivity flow. E-businesses depend on this information to direct the company to the most effective Web server for promotion of their product or service. This web mining also enables Web based businesses to provide the best access routes to services or other advertisements. When a company advertises for services provided by other companies, the usage mining data allows for the most effective access paths to these portals. In addition, there are typically three main uses for mining in this fashion. The first is usage processing, used to complete pattern discovery. This first use is also the most difficult because only bits of information like IP addresses, user information, and site clicks are available. With this minimal amount of information available, it is harder to track the user through a site, being that it does not follow the user throughout the pages of the site. The second use is content processing, consisting of the conversion of Web information like text, images, scripts and others into useful forms. This helps with the clustering and categorization of Web page information based on the titles, specific content and images available. Finally, the third use is structure processing. This consists of analysis of the structure of each page contained in a Web site. This structure process can prove to be difficult if resulting in a new structure having to be performed for each page. Analysis of this usage data will provide the companies with the information needed to provide an effective presence to their customers. This collection of information may include user registration, access logs and information leading to better Web site structure, proving to be most valuable to company online marketing. These present some of the benefits for external marketing of the companys products, services and overall management.

Internally, usage mining effectively provides information to improvement of communication through intranet communications. Developing strategies through this type of mining will allow for intranet based company databases to be more effective through the provision of easier access paths. The projection of these paths helps to log the user registration information giving commonly used paths the forefront to its access. Therefore, it is easily determined that usage mining has valuable uses to the marketing of businesses and a direct impact to the success of their promotional strategies and internet traffic. This information is gathered on a daily basis and continues to be analyzed consistently. Analysis of this pertinent information will help companies to develop promotions that are more effective, internet accessibility, inter-company communication and structure, and productive marketing skills through web usage mining.

Вам также может понравиться