Академический Документы
Профессиональный Документы
Культура Документы
Discuss k-means
clustering algorithm in detail:
Clustering is a Machine Learning technique that involves the grouping of data points. Given a set of data
points, we can use a clustering algorithm to classify each data point into a specific group. In theory, data
points that are in the same group should have similar properties and/or features, while data points in
different groups should have highly dissimilar properties and/or features. Clustering is a method of
unsupervised learning and is a common technique for statistical data analysis used in many fields.
In Data Science, we can use clustering analysis to gain some valuable insights from our data by seeing
what groups the data points fall into when we apply a clustering algorithm.
Let’s understand this with an example. Suppose, you are the head of a rental store and wish to
understand preferences of your costumers to scale up your business. Is it possible for you to look at
details of each costumer and devise a unique business strategy for each one of them? Definitely not.
But, what you can do is to cluster all of your costumers into say 10 groups based on their purchasing
habits and use a separate strategy for costumers in each of these 10 groups. And this is what we call
clustering.
Association rule consists of an antecedent and a consequent, both of which are a list of items. Note that
implication here is co-occurrence and not causality. For a given rule, itemset is the list of all the items in
the antecedent and the consequent.
Support
Confidence
Lift
Others: Affinity, Leverage
Support means how much historical data supports your rule and Confidence means how confident are
we that the rule holds.
Support can be calculated as the fraction of rows containing both A and B or joint probability of A and B.
Among rows containing A, Confidence is the fraction of rows containing B or conditional probability of B
given A.
Q: What do you understand by the term multidimensional
analysis? Discuss OLAP models in detail. Discuss essential
differences between MOLAP and ROLAP.
Multidimensional analysis is the analysis of dimension objects organized in meaningful hierarchies.
Multidimensional analysis allows users to observe data from various viewpoints. This enables them to
spot trends or exceptions in the data.
In Web Intelligence you can use drill up or down to perform multi-dimensional analysis.
A data mart is a condensed version of Data Warehouse and is designed for use by a specific department,
unit or set of users in an organization. E.g., Marketing, Sales, HR or finance. It is often controlled by a
single department in an organization.
Data Mart usually draws data from only a few sources compared to a Data warehouse. Data marts are
small in size and are more flexible compared to a Datawarehouse.
1. Dependent: Dependent data marts are created by drawing data directly from operational,
external or both sources.
2. Independent: Independent data mart is created without the use of a central data warehouse.
3. Hybrid: This type of data marts can take data from data warehouses or operational systems.
It is a multi-disciplinary skill that uses machine learning, statistics, AI and database technology.
The insights derived via Data Mining can be used for marketing, fraud detection, and scientific discovery,
etc.
Data mining is also called as Knowledge discovery, Knowledge extraction, data/pattern analysis,
information harvesting, etc.
Types of Data
Relational databases
Data warehouses
Advanced DB and information repositories
Object-oriented and object-relational databases
Transactional and Spatial databases
Heterogeneous and legacy databases
Multimedia and streaming database
Text databases
Text mining and Web mining
Data Mining Techniques
1.Classification:
This analysis is used to retrieve important and relevant information about data, and metadata. This data
mining method helps to classify data in different classes.
2. Clustering:
Clustering analysis is a data mining technique to identify data that are like each other. This process helps
to understand the differences and similarities between the data.
3. Regression:
Regression analysis is the data mining method of identifying and analyzing the relationship between
variables. It is used to identify the likelihood of a specific variable, given the presence of other variables.
4. Association Rules:
This data mining technique helps to find the association between two or more Items. It discovers a
hidden pattern in the data set.
5. Outer detection:
This type of data mining technique refers to observation of data items in the dataset which do not
match an expected pattern or expected behavior. This technique can be used in a variety of domains,
such as intrusion, detection, fraud or fault detection, etc. Outer detection is also called Outlier Analysis
or Outlier mining.
6. Sequential Patterns:
This data mining technique helps to discover or identify similar patterns or trends in transaction data for
certain period.
7. Prediction:
Prediction has used a combination of the other data mining techniques like trends, sequential patterns,
clustering, classification, etc. It analyzes past events or instances in a right sequence for predicting a
future event.
Example
A bank wants to search new ways to increase revenues from its credit card operations. They want to
check whether usage would double if fees were halved.
Bank has multiple years of record on average credit card balances, payment amounts, credit limit usage,
and other key parameters. They create a model to check the impact of the proposed new business
policy. The data results show that cutting fees in half for a targeted customer base could increase
revenues by $10 million.
Q: Classification and Clustering
Classification is the process of learning a model that elucidate different predetermined classes
of data. It is a two-step process, comprised of a learning step and a classification step. In
learning step, a classification model is constructed and classification step the constructed
model is used to prefigure the class labels for given data.
For example, in a banking application, the customer who applies for a loan may be classified as
a safe and risky according to his/her age and salary. This type of activity is also called supervised
learning. The constructed model can be used to classify new data. The learning step can be
accomplished by using already defined training set of data. Each record in the training data is
associated with an attribute referred to as a class label, that signifies which class the record
belongs to. The produced model could be in the form of a decision tree or in a set of rules.
A decision tree is a graphical depiction of the interpretation of each class or classification rules.
Regression is the special application of classification rules. Regression is useful when the value
of a variable is predicted based on the tuple rather than mapping a tuple of data from a relation
to a definite class. Some common classification algorithms are decision tree, neural networks,
logistic regression, etc.
Clustering is a technique of organizing a group of data into classes and clusters where the
objects reside inside a cluster will have high similarity and the objects of two clusters would be
dissimilar to each other. Here the two clusters can be considered as disjoint. The main target of
clustering is to divide the whole data into multiple clusters. Unlike classification process, here
the class labels of objects are not known before, and clustering pertains to unsupervised
learning.
In clustering, the similarity between two objects is measured by the similarity function where
the distance between those two object is measured. Shorter the distance higher the similarity,
conversely longer the distance higher the dissimilarity.
Classification and clustering are the methods used in data mining for analyzing the data sets
and divide them on the basis of some particular classification rules or the association between
objects. Classification categorizes the data with the help of provided training data. On the other
hand, clustering uses different similarity measures to categorize the data.
Nodes
Each individual server in the Teradata is referred to as a node. Each node has its own operating system,
CPU, memory, own copy of Teradata RDBMS software and disk space.
There are several types of data analysis techniques that exist based on business and technology. The
major types of data analysis are:
Text Analysis
Statistical Analysis
Diagnostic Analysis
Predictive Analysis
Prescriptive Analysis
Text Analysis
Text Analysis is also referred to as Data Mining. It is a method to discover a pattern in large data sets
using databases or data mining tools. It used to transform raw data into business information. Business
Intelligence tools are present in the market which is used to take strategic business decisions. Overall it
offers a way to extract and examine data and deriving patterns and finally interpretation of the data.
Statistical Analysis
Statistical Analysis shows "What happen?" by using past data in the form of dashboards. Statistical
Analysis includes collection, Analysis, interpretation, presentation, and modeling of data. It analyses a
set of data or a sample of data. There are two categories of this type of Analysis - Descriptive Analysis
and Inferential Analysis.
Descriptive Analysis
analyses complete data or a sample of summarized numerical data. It shows mean and deviation for
continuous data whereas percentage and frequency for categorical data.
Inferential Analysis
analyses sample from complete data. In this type of Analysis, you can find different conclusions from the
same data by selecting different samples.
Diagnostic Analysis
Diagnostic Analysis shows "Why did it happen?" by finding the cause from the insight found in Statistical
Analysis. This Analysis is useful to identify behavior patterns of data. If a new problem arrives in your
business process, then you can look into this Analysis to find similar patterns of that problem. And it
may have chances to use similar prescriptions for the new problems.
Predictive Analysis
Predictive Analysis shows "what is likely to happen" by using previous data.
Prescriptive Analysis
Prescriptive Analysis combines the insight from all previous Analysis to determine which action to take
in a current problem or decision. Most data-driven companies are utilizing Prescriptive Analysis because
predictive and descriptive Analysis are not enough to improve data performance. Based on current
situations and problems, they analyze the data and make decisions.