Вы находитесь на странице: 1из 9

MC0088- Data Mining

(Book ID: B1009)

Q1. What is operational intelligence? Ans:Operational intelligence(OI) is a form of real-time dynamic, business analytics that delivers visibility and insight into business operations. The purpose of OI is to monitor business activities and identify and detect situations relating to inefficiencies, opportunities, and threats. OI helps to quantify following: Efficiency of the business activities Impact of IT infrastructure and unexpected events on the business activities Execution of the business activities contributing to revenue gains or losses.

Features Different operational intelligence solutions may use many different technologies and beimplemented in different ways. This section lists the common features of an operationalintelligence solution: Real-time monitoring Real-time situation detection Real-time dashboards for different user roles

Correlation of events - Industry-specific dashboards - Multidimensional analysis o Root cause analysis o Time Series and trending analysis Comparison OI is often linked to or compared with business intelligence (BI) or real time businessintelligence, in the sense that both help make sense out of large amounts of information.But there are some basic differences: OI is primarily activity-centric, whereas BI isprimarily data-centric. (As with most technologies, each of these could be suboptimallycoerced to perform the other's task.) OI is, by definition real-time, unlike BI which istraditionally an after-the-fact and report-based approach to identifying patterns, andunlike real time BI which relies on a database as the sole source of events. Q2. What is Business Intelligence? Explain the components of BI architecture Ans:Business intelligence is actually an environment in which business users receive datathat is reliable, consistent, understandable, easily manipulated and timely. With thisdata, business users are able to conduct analyses that yield overall understanding of where the business has been, where it is now and where it will be in the near future.Business intelligence serves two main purposes; it monitors the financial andoperational health of the organization (reports, MC0088 Page 1

alerts, alarms, analysis tools, keyperformance indicators and dashboards). It also regulates the operation of theorganization providing two-way integration with operational systems and informationfeedback analysis. There are various definitions given by the experts; some of thedefinitions are given below: Converting data into knowledge and making it available throughout theorganization are the jobs of processes and applications known as BusinessIntelligence. BI is a term that encompasses a broad range of analytical software and solutionsfor gathering, consolidating, analyzing and providing access to information in away that is supposed to let the users of an enterprise make better businessdecisions.

Business Intelligence Infrastructure Business organizations can gain a competitive advantage with well-designed businessintelligence (BI) infrastructure. Think of the BI infrastructure as a set of layers that beginwith the operational systems information and Meta data and end in delivery of businessintelligence to various business user communities. Based on the overall requirements of business intelligence, the data integration layer isrequired to extract, cleanse and transform data into load files for the informationwarehouse. This layer begins with transaction-level operational data and Meta data about theseoperational systems. Typically this data integration is done using a relational staging database and utilizingflat file extracts from source systems. The product of a good data-staging layer is high-quality data, a reusable infrastructureand meta data supporting both business and technical users.

The information warehouse is usually developed incrementally over time and isarchitected to include key business variables and business metrics in a structure thatmeets all business analysis questions required by the business groups. 1. The information warehouse layer consists of relational and/or OLAP cubeservices that allow business users to gain insight into their areas of responsibilityin the organization. 2. Customer Intelligence relates to customer, service, sales and marketing informationviewed along time periods, location/geography, and product and customer variables. 3.Business decisions that can be supported with customer intelligence range frompricing, forecasting, promotion strategy and competitive analysis to up-sellstrategy and customer service resource allocation. 4. Operational Intelligence relates to finance, operations, manufacturing,distribution, logistics and human resource information viewed along time periods,location/geography, product, project, supplier, carrier and employee.

MC0088

Page 2

5. The most visible layer of the business intelligence infrastructure is theapplications layer, which delivers the information to business users. 6. Business intelligence requirements include scheduled report generation anddistribution, query and analysis capabilities to pursue special investigations andgraphical analysis permitting trend identification. This layer should enablebusiness users to interact with the information to gain new insight into theunderlying business variables to support business decisions. 7. Presenting business intelligence on the Web through a portal is gainingconsiderable momentum. Portals are usually organized by communities of usersorganized for suppliers, customers, employers and partners. 8. Portals can reduce the overall infrastructure costs of an organization as well asdeliver great self-service and information access capabilities. 9. Web-based portals are becoming commonplace as a single personalized point of accessfor key business information.

Q3.

Differentiate between database management systems (DBMS) and data mining.

Ans: Database Management System (DBMS) is the software that manages dataon physical storage devices. Data Mining: Data mining is the process of discovering relationships among data in the database. Area Task Type result Method DBMS Extraction of detailed and summary data of Information Deduction (Ask the question, verify the data) Data mining Knowledge discovery of hidden patterns and insights Insight and Prediction Induction (Build the model, apply it to new data, get the result) Who will buy a mutual fund in the next 6 months and why? Page 3

Example question MC0088

Who purchased mutual funds in the last 3 years?

Data mining is concerned withfinding hidden relationships present in business data to allow businesses to makepredictions for future use. It is the process of data-driven extraction of not so obviousbut useful information from large databases.The aim of data mining is to extractimplicit, previously unknown and potentially useful (or actionable) patterns from data.Data mining consists of many up-to-date techniques such as classification (decisiontrees, naive Bayes classifier, k-nearest neighbor, and neural networks), clustering (k-means, hierarchical clustering, and density-based clustering), association (one-dimensional, multidimensional, multilevel association, constraint-based association). Data warehousing isdefined as a process of centralized data management and retrieval. Data warehouse is an enabled relational database system designed to supportvery large databases (VLDB) at a significantly higher level of performance andmanageability. Data warehouse is an environment, not a product. It is an architecturalconstruct of information that is hard to access or present in traditional operational datastores. Q4. What is Neural Network? Explain in detail. Ans: An Artificial Neural Network (ANN) is an information-processing paradigm that is inspired by the way biological nervous systems, such as the brain, process information. The key element of this paradigm is the novel structure of the information processing system. It is composed of a large number of highly interconnected processing elements(neurons) working in unison to solve specific problems. Neural Networks are made up of many artificial neurons. An artificial neuron is an electronically modelled biological neuron. Number of neurons used depends on the task at hand. It could be as few as three or as many as several thousand. There are many different ways of connecting artificial neurons together to create a neural network. There are different types of Neural Networks, each of which has different strengths particular to their applications. The abilities of different networks can be related to their structure dynamics and learning methods .

Q5. What is partition algorithm? Explain with the help of suitable example Ans:

MC0088

Page 4

The partition algorithm is based on the observation that the frequent sets are normallyvery few in number compared to the set of all itemsets. As a result, if we partition the setof transactions to smaller such that each segment can be accommodated in the mainmemory, then we can compute the set of frequent sets of each of these partitions. It isassumed that these sets (set of local frequent sets) contain a reasonably small number of itemsets. Hence, we can read the whole database (the unsegmented one) once, tocount the support of the set of all local frequent sets.The partition algorithm uses two scans of the database to discover all frequent sets. Inone scan, it generates a set of all potentially frequent itemsets by scanning thedatabase once. This set is a superset of all frequent itemsets, i.e., it may contain falsepositives; but no false negatives are reported. During the second scan, counters for each of these itemsets are set up and their actual support is measured in one scan of the database.The algorithm executes in two phases. In the first phase, the partition algorithm logicallydivides the database into a number of nonoverlapping partitions. The partitions areconsidered one at a time and all frequent itemsets for that partitions are generated.Thus, if there are n partitions, Phase I of the algorithm takes n iterations. At the end of Phase I, these frequent itemsets are merged to generate a set of all potential frequentitemsets. In this step, the local frequent itemsets of same lengths from all n partitionsare combined to generate the global candidate itemsets. In Phase II, the actual supportfor these itemsets are generated and the frequent itemsets are identified. The algorithmreads the entire database once during Phase I and once during Phase II. The partitionsizes are chosen such that each partition can be accommodated in the main memory,so that the partitions are read only once in each phase. A partition P of the databases refers to any subset of the transactions contained in thedatabase. Any two partitions are non-overlapping. We define local support for anitemset as the fraction of the transaction containing that particular itemset in partition.We define a local frequent itemset as an itemset whose local support in a partition is atleast the user defined minimum support. A local frequent itemset may or may not befrequent in the context of the entire database. Partition Algorithm P = partition _ database (T); n = Number of partitions // Phase I for i = 1 to ndo begin read _ in _ partition (T1 in P) Li = generate all frequent of Ti using a priori method in main memory. end // Merge Phase for (k = 2;Lki , i = 1, 2, , n; k++)do begin CkG = i k n i L 1 end // Phase II Fori = 1 to ndo begin read _ in _ partition (T1in P)for all candidates c CG computes s(c)Ti end LG = {cCG| s(c)Ti } Answer = LGThe partition algorithm is based on the premise that the size of the global candidate setis considerably smaller than the set of all possible itemsets. The intuition behind this isthat the size of the global candidate set is bounded by n times the size of the largest of the set of locally frequent sets. For sufficiently large partition sizes, the number of localfrequent itemsets is likely to be comparable to the number of frequent itemsetsgenerated MC0088 Page 5

for the entire database. If the data characteristics are uniform acrosspartitions, then large numbers of itemsets generated for individual partitions may becommon. Example Let us take same database T, given in Example 6.2, and the same . Let us partition, for the sake of illustration, T into three partitions T1, T2, and T3, each containing 5transactions. The first partition T1 contains transactions 1 to 5, T2 contains transactions6 to 10 and, similarly, T3 contains transactions 11 to 15. We fix the local supports asequal to the given support that is 20%. Thus, 1 = 2= 3 = = 20 %. Any item set thatappears in just one of the transactions in any partition is a local frequent set in thatpartition.The local frequent sets of the T1 partition are the item sets X, such that s(X)T11.L1:= {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {1,5}, {1,6}, {1,8}, {2,3}, {2,4}, {2,8}, {4,5}, {4,7},{4,8}, {5,6}, {5,8}, {5,7}, {6,7}, {6,8}, {1,6,8},{1,5,6}, {1,5,8}, {2,4,8}, {4,5,7},{5,6,8},{5,6,7},{1,5,6,8}} Similarly, L2:= {{2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {2,3}, {2,4}, {2,6}, {2,7}, {2,9}, {3,4}, {3,5}, {3,7},{5,7}, {6,7}, {6,9}, {7,9}, {2,3,4}, {2,6,7}, {2,6,9}, {2,7,9}, {3,5,7}, {2,6,7,9}} L3:= {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {1,3}, {1,5}, {1,7}, {2,3}, {2,4}, {2,6},{2,7}, {2,9},{3,5}, {3,7}, {3,9}, {4,6}, {4,7}, {5,6}, {5,7}, {5,8}, {6,7}, {6,8}, {1,3,5}, {1,3,7}, {1,5,7},{2,3,9}, {2,4,6}, {2,4,7}, {3,5,7}, {4,6,7}, {5,6,8}, {1,3,5,7}, {2,4,6,7}} In Phase II, we have the candidate set as C: = L1 L2L3 C:= {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {1,3}, {1,5},{1,6}, {1,7},{1,8}, {2,3}, {2,4},{2,6},{2,7}, {2,8},{2,9}, {3,4},{3,5}, {3,7}, {3,9}, {4,5},{4,6}, {4,7},{4,8}, {5,6}, {5,7}, {5,8},{6,7},{6,7}, {6,8},{6,9},{7,9}, {1,3,5}, {1,3,7},{1,5,6}, {1,5,7},{1,5,8}, {1,6,8},{2,3,4},{2,3,9},{2,4,6}, {2,4,7},{2,4,8},{2,6,7}, {2,6,9},{2,7,9}, {3,5,7}, {4,5,7},{4,6,7},{5,6,8},{5,6,7},{1,5,6,8},{2,6,7,9}, {1,3,5,7}, {2,4,6,7}} Read the database once to compute the global support of the sets in C and get the finalset of frequent sets. Q6. Describe the following with respect to Web Mining: a. Categories of Web Mining (5) b. Applications of Web Mining (5) Ans: a. Categories of Web Mining Web is broadly defined as the discovery and analysis of useful informationfrom the World Wide Web.Web mining is divided into three categories: 1. Web Content Mining 2. Web Structure Mining 3. Web Usage Mining.

MC0088

Page 6

All of the three categories focus on the process of knowledge discovery of implicit,previously unknown and potentially useful information from the web. Each of themfocuses on different mining objects of the web. Content mining is used to search, collate and examine data by search enginealgorithms (this is done by using Web Robots).Structure mining is used to examine the structure of a particular website and collate andanalyze related data.Usage mining is used to examine data related to the client end, such as the profiles of the visitors of the website, the browser used, the specific time and period that the sitewas being surfed, the specific areas of interests of the visitors to the website, andrelated data from the form data submitted during web transactions and feedback. Web Content Mining Web content mining targets the knowledge discovery, in which the main objects are thetraditional collections of multimedia documents such as images, video, and audio, whichare embedded in or linked to the web pages.It is different from Data mining because Web data are mainly semi-structuredand/or unstructured, while Data mining deals primarily with structured data. Web content mining could be differentiated from two points of view:Agentbasedapproachor Database approach. The first approach aims on improving theinformation finding and filtering. The second approach aims on modeling the data on theWeb into more structured form in order to apply standard database queryingmechanism and data mining applications to analyze it. Web Structure Mining Web Structure Mining focuses on analysis of the link structure of the web and one of itspurposes is to identify more preferable documents. The different objects are linked insome way. The intuition is that a hyperlink from document A to document B implies thatthe author of document. A thinks document B contains worthwhile information. Webstructure mining helps in discovering similarities between web sites or discoveringimportant sites for a particular topic or discipline or in discovering web communities. However, the appropriate handling of the links couldlead to potential correlations, and then improve the predictive accuracy of the learnedmodels. The goal of Web structure mining is to generate structural summary about the Web siteand Web page. Technically, Web content mining mainly focuses on the structure of inner-document, while Web structure mining tries to discover the link structure of thehyperlinks at the inter-document level. Based on the topology of the hyperlinks, Webstructure mining will categorize the Web pages and generate the information, such asthe similarity and relationship between different Web sites. Web structure mining can also have another directiondiscovering the structure of Web document itself. This type of structure mining can be used to reveal the structure(schema) of Web pages; this would be good for navigation purpose and make itpossible to compare/integrate Web page schemes. This type of structure mining willfacilitate introducing database techniques for accessing information in Web pages byproviding a reference schema. Web Usage Mining

MC0088

Page 7

Web Usage Mining focuses on techniques that could predict the behavior of users whilethey are interacting with the WWW. Web usage mining, discover user navigationpatterns from web data, tries to discovery the useful information from the secondarydata derived from the interactions of the users while surfing on the Web. Web usagemining collects the data from Web log records to discover user access patterns of webpages. There are several available research projects and commercial tools that analyzethose patterns for different purposes. The insight knowledge could be utilized inpersonalization, system improvement, site modification, business intelligence and usagecharacterization.The only information left behind by many users visiting a Web site is the path throughthe pages they have accessed. Most of the Web information retrieval tools only use thetextual information, while they ignore the link information that could be very valuable.In general, there are mainly four kinds of data mining techniques applied to the webmining domain to discover the user navigation pattern: Association Rule mining Sequential pattern Clustering Classification b. Applications of Web Mining With the rapid growth of World Wide Web, Web mining becomes a very hot and popular topic in Web research. E-commerce and E-services are claimed to be the killer applications for Web mining, and Web mining now also plays an important role for E-commerce website and Eservices to understand how their websites and services areused and to provide better services for their customers and users. A few applications are: E-commerce Customer Behavior Analysis commerce Transaction Analysis E-commerce Website Design E-banking M-commerce Web Advertisement Search Engine Online Auction

c. Web Mining Software Open source software for web mining includesRapidMiner, which provides modules for text clustering, text categorization, information extraction, named entity recognition, andsentiment analysis. RapidMiner is used for example in applications like automated newsfiltering for personalized news surveys. It is also used in automated content-baseddocument and e-mail routing, sentiment analysis from web blogs and product reviews ininternet discussion groups. Information extraction from web pages also utilizesRapidMiner to create mash-ups which combine information from various web servicesand web pages, to perform web log mining and web usage mining. SASData Quality Solution provides an enterprise solution for profiling, cleansing,augmenting and integrating data to create consistent, reliable information. With SASData Quality Solution you can automatically incorporate data quality into dataintegration and business intelligence projects to dramatically improve returns on your organizations strategic initiatives. MC0088 Page 8

Weka is a collection of machine learning algorithms for data mining tasks. Thealgorithms can either be applied directly to a dataset or called from your own Java code.Weka contains tools for data pre-processing, classification, regression, clustering,association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General PublicLicense.

No water marks no pdf formate in this website

MC0088

Page 9

Вам также может понравиться