MC0088

WINTER 2013 ASSIGNMENT
MC0088- DATA WAREHOUSING & DATA MINING
1. Differentiate between Data Minin an! Data Ware"#$%in . Ans. Data mining is actually the analysis of data. It is the computer-assisted process of digging through and analyzing enormous sets of data that have either been compiled by the computer or have been inputted into the computer. In data mining, the computer will analyze the data and extract the meaning from it. It will also look for hidden patterns within the data and try to predict future behavior. Data ining is mainly used to find and show relationships among the data. !he purpose of data mining, also known as knowledge discovery, is to allow businesses to view these behaviors, trends and"or relationships and to be able to factor them within their decisions. !his allows the businesses to make proactive, knowledge-driven decisions. !he term #data mining$he fact that the process comes of data from mining, i.te. searching for relationships between data, is similar to mining and searching for precious materials. Data mining tools use artificial intelligence, machine learning, statistics, and database systems to find correlations between the data. !hese tools can help answer business %uestions that traditionally were too time consuming to resolve. Data ining includes various steps, including the raw analysis step, database and data management aspects, data preprocessing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. In contrast, data warehousing is completely different. &owever, data warehousing and data mining are interrelated. Data warehousing is the process of compiling information or data into a data warehouse. A data warehouse is a database used to store data. It is a central repository of data in which data from various sources is stored. !his data warehouse is then used for reporting and data analysis. It can be used for creating trending reports for senior management reporting such as annual and %uarterly comparisons. !he purpose of a data warehouse is to provide flexible access to the data to the user. Data warehousing generally refers to the combination of many different databases across an entire enterprise.
!he main difference between data warehousing and data mining is that data warehousing is the process of compiling and organizing data into one common database, whereas data mining is the process of extracting meaningful data from that database. Data mining can only be done once data warehousing is complete. &. De%'ribe t"e (e) feat$re% #f a Data Ware"#$%e*
Ans. !he data warehousing contains the combinations of technology, methodologies, tools and techni%ues , user management system and data manipulation systems. 'ut according to the dictionary definition warehousing is to gather up the data from the different and various kind or resources and interrelated the complied information collected from the different types of data resources. ay leading organization has their separate warehouse for the collection and maintenance of data. Data warehousing currently hold important position into the market concerning the organization it is done for. It captures huge area in the present economy. the most famous people into data are housing foundation and origination are the (alph )imball and 'ill Dinmont .they are collectively known as the pioneer of the data warehousing. 'efore the arrival of the data warehousing there was no concept for the data storage and synchronization according to the need of the data. any reach papers were published in the year *++* .it was the same year when it was found. 'ut the core concept was evolved in ,--+ by the 'ill. .eatures of the Data warehousing/ !here are very uni%ue and significant properties of the data warehousing .some of the ma0or ones are as follows Decision making support/ 1arehousing provide great support in the entire decision making process because its core components involves all the ma0or plans, methodologies and techni%ue that will be implemented to achieve the goal. 2onceptualized and complied form of data is nicely helpful in taking %uick and accurate decisions. 3ub0ect orientation/ Another important characteristic of the warehousing is that it s sub0ect oriented. the data is gathered from the different resources each resource has different background and application al secularities .this helps in smoothing the companies regular operations because grounded knowledge about all re%uired is available with the help of warehousing. Integration/ Another important and fundamental characteristic of the warehouse is the integration of the data. !he data is gathered form the different resources and then merged after compiling it to the single database. 1hich is dynamically and diversely
applicable. !ime flexibility/ All the data that is stored at the warehouses are identified through the specific time period according to the need of the data. 4on volatile form f data/ 'efore the arrival of warehousing properly it was known the secondary storage is the best way to save the information but warehousing also supported the integration, cohesiveness and culti dimensional application of the data. 1arehousing is one of the finest way to preserve the entire knowledge for the effective utilization in the future. !he data stored in the warehouses remain stable and safe. !his makes data more reliable. 'ulk storage/ Data can be stored in the large volumes according to the sizing of the warehouse. it depends on the organization what kind and amount of data they re%uired to store or the future use. Accurate and grounded/ Another property of the data stored at the warehouses is that the data is accurate and grounded containing all the practically possible theories and techni%ues. 1e can say that essence of the related field is stored at the warehousing. 4umber of technologies is involved in preserving the data which make it discrete, effective and culti dimensional. .uture perception/ 1arehousing was officially introduced in *++* and it is becoming famous day by day. at present there are many organizations especially larger one having own warehouses .for preserving different types of data.the advanced engineers and programmer area working to set online warehousing systems to make the access of data more efficient and %uicker. 1arehousing is one of the most effective techni%ues for saving the large and dynamic amount of data. +.Differentiate between Data Inte rati#n an! Tran%f#r,ati#n Ans.Data integration involves combining data residing in different sources and providing users
with a unified view of these data.!his process becomes significant in a variety of situations, which include both commercial 5when two similar companies need to merge their databases6 and scientific 5combining research results from different bio informatics repositories, for example6 domains. Data integration appears with increasing fre%uency as the volume and the need to share existing data explodes.It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. In management circles, people fre%uently refer to data integration as 78nterprise Information Integration7 58II6. !he theory of data integration forms a subset of database theory and formalizes the underlying concepts of the problem in first-order logic. Applying the theories gives indications as to the feasibility and difficulty of data
integration. 1hile its definitions may appear abstract, they have sufficient generality to accommodate all manner of integration systems.9citation needed:
Definitions9edit:Data integration systems are formally defined as a triple where is the global 5or mediated6 schema, is the heterogeneous set of source schemas, and is the mapping that maps %ueries between the source and the global schemas. 'oth and are expressed in languages over alphabets composed of symbols for each of their respective relations. !he mapping consists of assertions between %ueries over and %ueries over . 1hen users pose %ueries over the data integration system, they pose %ueries over and the mapping then asserts connections between the elements in the global schema and the source schemas. A database over a schema is defined as a set of sets, one for each relation 5in a relational database6. !he database corresponding to the source schema would comprise the set of sets of tuples for each of the heterogeneous data sources and is called the source database. 4ote that this single source database may actually represent a collection of disconnected databases. !he database corresponding to the virtual mediated schema is called the global database. !he global database must satisfy the mapping with respect to the source database. !he legality of this mapping depends on the nature of the correspondence between and . !wo popular ways to model this correspondence exist/ ;lobal as <iew or ;A< and =ocal as <iew or =A<.
Data transformation is the process of converting data from one format 5e.g. a database file, > = document, or 8xcel sheet6 to another. 'ecause data often resides in different locations and formats across the enterprise, data transformation is necessary to ensure data from one application or database is intelligible to other applications and databases, a critical feature for applications integration. In a typical scenario where information needs to be shared, data is extracted from the source application or data warehouse, transformed into another format, and then loaded into the target location. 8xtraction, transformation, and loading 5together known as 8!=6 are the central processes of data integration. Depending on the nature of the integration scenario, data may need to be merged, aggregated, enriched, summarized, or filtered. !he first step of data transformation is data mapping. Data mapping determines the relationship between the data elements of two applications and establishes instructions for how the data from the source application is transformed before it is loaded into the target application. In other words, data mapping produces the critical metadata that is needed before the actual data conversion takes place. .or instance, in field mapping, the information in one application might be rendered in lowercase letters while another application stores information in uppercase letters. !his means the data from the source application needs to be converted to uppercase letters before being loaded into the corresponding fields in the target application.
!he structure of stored data may also vary between applications, re%uiring semantic mapping prior to the transformation process. .or instance, two applications might store the same customer credit card information using slightly different structures. -. Differentiate between !ataba%e ,ana e,ent %)%te,% .D/MS0 an! !ata ,inin . Ans. Database anagement 3ystem 5D' 36 is the software that manages data on physical storage devices. Data ining/ Data mining is the process of discovering relationships among data in the database.
Area !ask
D' 3 8xtraction of detailed and summary data
Data mining )nowledge discovery of hidden patterns and insights
!ype of result Information
Insight and ?rediction
ethod
Deduction 5Ask the %uestion, verify the data6
Induction 5'uild the model, apply it to new data, get the result6 1ho will buy a mutual fund in the next A months and whyB
8xample %uestion
1ho purchased mutual funds in the last @ yearsB
Data mining is concerned with finding hidden relationships present in business data to allow businesses to make predictions for future use. It is the process of data-driven extraction of not so obvious but useful information from large databases. !he aim of data mining is to extract implicit, previously unknown and potentially useful 5or actionable6 patterns from data. Data mining consists of many up-to-date techni%ues such as classification 5decision trees, naive 'ayes classifier, k-nearest neighbor, and neural networks6, clustering 5k-means, hierarchical clustering, and density-based clustering6, association 5one-dimensional, multidimensional, multilevel association, constraint-based association6.
Data warehousing is defined as a process of centralized data management and retrieval. Data 1arehouse is an enabled relational database system designed to support very large databases 5<=D'6 at a significantly higher level of performance and manage ability. Data warehouse is an 8nvironment, not a product. It is an architectural construct of information that is hard to access or ?resent in traditional operational data stores. 1. Differentiate between 2-,ean% an! Hierar'"i'a3 '3$%terin Ans. &ierarchal clustering is the sort that you might apply when there is a 7tree7 structure to the data. !hink of the classification of living things. At the top, all of them, then splitting into plants, animals and other things such as fung hi. Cnce you are on the animal branch, these splits into mammals, reptiles, etc, and you can keep going until you get down to individual species. A! 4C !I 8, when things have been split off from the rest of the data onto one of the branches, do subsets ever move to other branches. Dou might think about whether this is appropriate for your data. Cnce you have split your data up into two sets this split is final, and the process only subdivides further nothing from set one ever moves back into set two. )-means clustering does not assume a tree structure. In its pure form you might ask the computer - split these data values into three groups or four groups, but you canEt guarantee that merging two groups from the four-group solution will produce the same as the three-group solution. If you have only two or three dimensions 5or can sensibly reduce your data by factor analysis6 you can plot it and see what sort of relationships you have. Are you looking for nice spherical clusters, or are long chains more suitableB Dou might consider that your data values were generated from multivariate normal random variables from groups with different means, and you might consider how best to identify these groups and their means. 3ometimes data values fall into such clear groups that almost all clustering methods will find the same clusters. 1here the boundaries are fuzzy, the solutions may be very different. IEll end with a little parable. 3uppose I have a very willing idiot working for me, and I ask him to arrange my books nicely. &e might do this by author or by sub0ect, or by the clour of the cover, or the size of the book, or by weight, or by date of publication. If I simply ask for a 7nice arrangement7 I ought not to complain about any of these, and I might find one or more useful. If you 0ust ask 3?33 to use cluster analysis to produce a 7nice
arrangement7 then, according to the method chosen, the order of the data and a possible random element, you might get one of many rather different nice arraignments, and the 7best7 of these depends on what you want the clustering for. 4. Differentiate between Web '#ntent ,inin an! Web $%a e ,inin . Ans.1eb 2ontent ining
1eb content mining targets the knowledge discovery, in which the main ob0ects are the traditional collections of multimedia documents such as images, video, and audio, which are embedded in or linked to the web pages. It is different from Data mining because 1eb data are mainly semi-structured and"or unstructured, while Data mining deals primarily with structured data. 1eb content mining could be differentiated from two points of view/ Agent-based approach or Database approach. !he first approach aims on improving the information finding and filtering. !he second approach aims on modeling the data on the 1eb into more structured form in order to apply standard database %uerying mechanism and data mining applications to analyze it. 1eb usage mining 1eb Fsage ining focuses on techni%ues that could predict the behavior of users while they are interacting with the 111. 1eb usage mining, discover user navigation patterns from web data, tries to discovery the useful information from the secondary data derived from the interactions of the users while surfing on the 1eb. 1eb usage mining collects the data from 1eb log records to discover user access patterns of web pages. !here are several available research pro0ects and commercial tools that analyze those patterns for different purposes. !he insight knowledge could be utilized in personalization, system improvement, site modification, business intelligence and usage characterization. !he only information left behind by many users visiting a 1eb site is the path through the pages they have accessed. ost of the 1eb information retrieval tools only use the textual information, while they ignore the link information that could be very valuable. In general, there are mainly four kinds of data mining techni%ues applied to the web mining domain to discover the user navigation pattern/ Association (ule mining 3e%uential pattern 2lustering 2lassification

MC0088

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

MC0088

Загружено:

Авторское право:

Доступные форматы

WINTER 2013 ASSIGNMENT

MC0088- DATA WAREHOUSING & DATA MINING

D' 3 8xtraction of detailed and summary data

Data mining )nowledge discovery of hidden patterns and insights

!ype of result Information

Insight and ?rediction

Deduction 5Ask the %uestion, verify the data6

1ho purchased mutual funds in the last @ yearsB

Вам также может понравиться