Вы находитесь на странице: 1из 10

Q: How does cluster technique work?

Discuss k-means
clustering algorithm in detail:
Clustering is a Machine Learning technique that involves the grouping of data points. Given a set of data
points, we can use a clustering algorithm to classify each data point into a specific group. In theory, data
points that are in the same group should have similar properties and/or features, while data points in
different groups should have highly dissimilar properties and/or features. Clustering is a method of
unsupervised learning and is a common technique for statistical data analysis used in many fields.

In Data Science, we can use clustering analysis to gain some valuable insights from our data by seeing
what groups the data points fall into when we apply a clustering algorithm.

Let’s understand this with an example. Suppose, you are the head of a rental store and wish to
understand preferences of your costumers to scale up your business. Is it possible for you to look at
details of each costumer and devise a unique business strategy for each one of them? Definitely not.
But, what you can do is to cluster all of your costumers into say 10 groups based on their purchasing
habits and use a separate strategy for costumers in each of these 10 groups. And this is what we call
clustering.

Q: What do you mean by association mining?


Association Rules is one of the very important concepts of machine learning being used in market basket
analysis. In a store, all vegetables are placed in the same aisle, all dairy items are placed together and
cosmetics form another set of such groups. Investing time and resources on deliberate product
placements like this not only reduces a customer’s shopping time, but also reminds the customer of
what relevant items (s)he might be interested in buying, thus helping stores cross-sell in the process.
Association rules help uncover all such relationships between items from huge databases.

Association rule consists of an antecedent and a consequent, both of which are a list of items. Note that
implication here is co-occurrence and not causality. For a given rule, itemset is the list of all the items in
the antecedent and the consequent.

The measures of effectiveness of the rule are as Follows:

 Support
 Confidence
 Lift
 Others: Affinity, Leverage

Support means how much historical data supports your rule and Confidence means how confident are
we that the rule holds.

Support can be calculated as the fraction of rows containing both A and B or joint probability of A and B.

Among rows containing A, Confidence is the fraction of rows containing B or conditional probability of B
given A.
Q: What do you understand by the term multidimensional
analysis? Discuss OLAP models in detail. Discuss essential
differences between MOLAP and ROLAP.
Multidimensional analysis is the analysis of dimension objects organized in meaningful hierarchies.

Multidimensional analysis allows users to observe data from various viewpoints. This enables them to
spot trends or exceptions in the data.

A hierarchy is an ordered series of related dimensions. An example of a hierarchy is Geography, which


may group dimensions such as Country, Region, and City.

In Web Intelligence you can use drill up or down to perform multi-dimensional analysis.

There are 3 main types of OLAP servers are as following:


1. Relational OLAP (ROLAP) – Star Schema based –
The ROLAP is based on the premise that data need not to be stored multidimensionally in order
to viewed multidimensionally, and that it is possible to exploit the well-proven relational
database technology to handle multidimensionality of data. In ROLAP data is stored in a
relational database. In essence, each action of slicing and dicing is equivalent to adding a
“WHERE” clause in SQL statement. ROLAP can handle large amounts of data. ROLAP can
leverage functionalities inherent in the relational database.

2. Multidimensional OLAP (MOLAP) – Cube based –


MOLAP stores data on disks in specialized multidimensional array structure. OLAP is performed
on it relying on the random access capability of the arrays. Arrays element are determined by
dimension instances, and the fact data or measured value associated with each cell is usually
stored in the corresponding array element. In MOLAP, the multidimensional array is usually
stored in a linear allocation according to nested traversal of the axes in some predetermine
order.
But unlike ROLAP, where only records with non-zero facts are stored, all array elements are
defined in MOLAP and as a result, the arrays generally tend to sparse, with empty elements
occupying a greater part of it. Since both storage and retrieval costs are important while
assessing online performance efficiency, MOLAP systems typically include provision such as
advanced indexing and hashing to locate data while performing queries for handling sparse
arrays. MOLAP cubes are fast data retrieval, optimal for slicing and dicing and it can perform
complex calculation. All calculation are pre-generated when the cube is created.

3. Hybrid OLAP (HOLAP) –


HOLAP is a combination of ROLAP and MOLAP. HOLAP servers allows storing the large data
volumes of detail data. On the one hand, HOLAP leverages the greater scalability of ROLAP. On
the other hand, HOLAP leverages the cube technology for faster performance and for summary-
type information. Cubes are smaller than MOLAP since detail data is kept in the relational
database. The database are used to stores data in the most functional way possible.

Some other types of OLAP:

4. Web OLAP (WOLAP) –


It is a Web browser based technology. In traditional OLAP application is accessible by the
client/server but in this OLAP application is accessible by the web browser. It is a three tier
architecture which consist of client, middleware and database server. The most appealing
features of this style of OLAP was (past tense intended, since few products categorize
themselves this way) the considerably lower investment involved on the client side (“all that’s
needed is a browser”) and enhanced accessibility to connect to the data. A Web based
application requires no deployment on the client machine. All that is required is a Web browser
and a network connection to the intranet or Internet.
5. Desktop OLAP (DOLAP) –
DOLAP stands for desktop analytical processing. In that user can download the data from the
source and work with the dataset, or on their desktop. Functionality is limited compare to other
OLAP application. It has cheaper cost.
6. Mobile OLAP (MOLAP) –
MOLAP is wireless functionality or mobile devices. User is work and access the data through the
mobile devices.
7. Spatial OLAP (SOLAP) –
Merge capabilities of both Geographic Information Systems (GIS) and OLAP into single user
interface, SOLAP egress. SOLAP is created because the data come on the form of alphanumeric,
image and vector. This provides the easy and quick exploration of data that resides on a spatial
database.

Q: Write a detailed note on Data mart.


What is Data Mart?
A data mart is focused on a single functional area of an organization and contains a subset of data
stored in a Data Warehouse.

A data mart is a condensed version of Data Warehouse and is designed for use by a specific department,
unit or set of users in an organization. E.g., Marketing, Sales, HR or finance. It is often controlled by a
single department in an organization.

Data Mart usually draws data from only a few sources compared to a Data warehouse. Data marts are
small in size and are more flexible compared to a Datawarehouse.

Why do we need Data Mart?


 Data Mart helps to enhance user's response time due to reduction in volume of data
 It provides easy access to frequently requested data.
 Data mart are simpler to implement when compared to corporate Datawarehouse. At the same
time, the cost of implementing Data Mart is certainly lower compared with implementing a full
data warehouse.
 Compared to Data Warehouse, a datamart is agile. In case of change in model, datamart can be
built quicker due to a smaller size.
 A Datamart is defined by a single Subject Matter Expert. On the contrary data warehouse is
defined by interdisciplinary SME from a variety of domains. Hence, Data mart is more open to
change compared to Datawarehouse.
 Data is partitioned and allows very granular access control privileges.
 Data can be segmented and stored on different hardware/software platforms.

Type of Data Mart


There are three main types of data marts are:

1. Dependent: Dependent data marts are created by drawing data directly from operational,
external or both sources.
2. Independent: Independent data mart is created without the use of a central data warehouse.
3. Hybrid: This type of data marts can take data from data warehouses or operational systems.

Q: What is Data Mining?


Data mining is looking for hidden, valid, and potentially useful patterns in huge data sets. Data Mining is
all about discovering unsuspected/ previously unknown relationships amongst the data.

It is a multi-disciplinary skill that uses machine learning, statistics, AI and database technology.

The insights derived via Data Mining can be used for marketing, fraud detection, and scientific discovery,
etc.

Data mining is also called as Knowledge discovery, Knowledge extraction, data/pattern analysis,
information harvesting, etc.

Types of Data

Data mining can be performed on following types of data

 Relational databases
 Data warehouses
 Advanced DB and information repositories
 Object-oriented and object-relational databases
 Transactional and Spatial databases
 Heterogeneous and legacy databases
 Multimedia and streaming database
 Text databases
 Text mining and Web mining
Data Mining Techniques

1.Classification:

This analysis is used to retrieve important and relevant information about data, and metadata. This data
mining method helps to classify data in different classes.

2. Clustering:

Clustering analysis is a data mining technique to identify data that are like each other. This process helps
to understand the differences and similarities between the data.

3. Regression:

Regression analysis is the data mining method of identifying and analyzing the relationship between
variables. It is used to identify the likelihood of a specific variable, given the presence of other variables.

4. Association Rules:

This data mining technique helps to find the association between two or more Items. It discovers a
hidden pattern in the data set.

5. Outer detection:

This type of data mining technique refers to observation of data items in the dataset which do not
match an expected pattern or expected behavior. This technique can be used in a variety of domains,
such as intrusion, detection, fraud or fault detection, etc. Outer detection is also called Outlier Analysis
or Outlier mining.

6. Sequential Patterns:

This data mining technique helps to discover or identify similar patterns or trends in transaction data for
certain period.

7. Prediction:

Prediction has used a combination of the other data mining techniques like trends, sequential patterns,
clustering, classification, etc. It analyzes past events or instances in a right sequence for predicting a
future event.

Benefits of Data Mining:


 Data mining technique helps companies to get knowledge-based information.
 Data mining helps organizations to make the profitable adjustments in operation and
production.
 The data mining is a cost-effective and efficient solution compared to other statistical data
applications.
 Data mining helps with the decision-making process.
 Facilitates automated prediction of trends and behaviors as well as automated discovery of
hidden patterns.
 It can be implemented in new systems as well as existing platforms
 It is the speedy process which makes it easy for the users to analyze huge amount of data in less
time.

Disadvantages of Data Mining


 There are chances of companies may sell useful information of their customers to other
companies for money. For example, American Express has sold credit card purchases of their
customers to the other companies.
 Many data mining analytics software is difficult to operate and requires advance training to
work on.
 Different data mining tools work in different manners due to different algorithms employed in
their design. Therefore, the selection of correct data mining tool is a very difficult task.
 The data mining techniques are not accurate, and so it can cause serious consequences in
certain conditions.

Applications of data mining


 Helps in knowing the trends.
 Analyzation of the vendor.
 Customer segmentation.
 Understanding and identifying the prospects.
 Identifying the customers.
 Identifying the demand being made.
 Customer clustering.
 Segmentation based on the type of area.
 Exchange schemes.
 Financial statement analysis.

Example

A bank wants to search new ways to increase revenues from its credit card operations. They want to
check whether usage would double if fees were halved.

Bank has multiple years of record on average credit card balances, payment amounts, credit limit usage,
and other key parameters. They create a model to check the impact of the proposed new business
policy. The data results show that cutting fees in half for a targeted customer base could increase
revenues by $10 million.
Q: Classification and Clustering
Classification is the process of learning a model that elucidate different predetermined classes
of data. It is a two-step process, comprised of a learning step and a classification step. In
learning step, a classification model is constructed and classification step the constructed
model is used to prefigure the class labels for given data.
For example, in a banking application, the customer who applies for a loan may be classified as
a safe and risky according to his/her age and salary. This type of activity is also called supervised
learning. The constructed model can be used to classify new data. The learning step can be
accomplished by using already defined training set of data. Each record in the training data is
associated with an attribute referred to as a class label, that signifies which class the record
belongs to. The produced model could be in the form of a decision tree or in a set of rules.
A decision tree is a graphical depiction of the interpretation of each class or classification rules.
Regression is the special application of classification rules. Regression is useful when the value
of a variable is predicted based on the tuple rather than mapping a tuple of data from a relation
to a definite class. Some common classification algorithms are decision tree, neural networks,
logistic regression, etc.

Clustering is a technique of organizing a group of data into classes and clusters where the
objects reside inside a cluster will have high similarity and the objects of two clusters would be
dissimilar to each other. Here the two clusters can be considered as disjoint. The main target of
clustering is to divide the whole data into multiple clusters. Unlike classification process, here
the class labels of objects are not known before, and clustering pertains to unsupervised
learning.
In clustering, the similarity between two objects is measured by the similarity function where
the distance between those two object is measured. Shorter the distance higher the similarity,
conversely longer the distance higher the dissimilarity.

Classification and clustering are the methods used in data mining for analyzing the data sets
and divide them on the basis of some particular classification rules or the association between
objects. Classification categorizes the data with the help of provided training data. On the other
hand, clustering uses different similarity measures to categorize the data.

Q: Teradata Architecture – Components of Teradata


The biggest strength of Teradata and Netezza data warehouse appliance (RDBMS) is parallel processing.
Just like Netezza architecture, Teradata architecture is based on Massively Parallel Processing (MPP)
architecture. The Teradata is made up of Parsing Engine, BYNET and Access Module Processors (AMPs)
and other components such as nodes. Teradata is inexpensive, high-quality system that exceeded the
performance of conventional relational database management systems.

Teradata Architecture Diagram


The following diagram shows the high level architecture of a Teradata Node.

Teradata Architecture – Components

Following are the key components of the Teradata:

 Parsing Engine (PE)


The Parsing Engine (PE) is responsible for receiving the SQL queries from the client. Whenever you
connect to Teradata, you actually connecting to PE.
Parsing Engine (PE) creates execution plan for the submitted queries. The other responsibilities of the PE
are, receive SQL query from client, check for the syntax errors in query, check the user privileges to get
data out of tables, pass the efficient execution plan to BYNET, receive results from AMP and send it back
to client application.
PE uses the statistics such as number of AMP connected to Teradata system, number of rows in the
tables to create efficient execution plan.

 Access Module Processor (AMP)


Access module processors are virtual processors (vprocs) and these processors actually stores and
retrieve the data.
AMP receive the data and execution plan from Parsing Engine (PE). It perform the tasks such as filtering,
aggregation, grouping etc on the data and stores the result back to its associated disks. Each AMP is
associated with set of disks and only that AMP can access those disks. Based on the primary index
column or first column, data is evenly distributed across the AMP’s. Teradata will use the first column
for data distribution across all AMPs by creating non-unique primary index on it.

 BYNET (Message Passing Layer)


Middle layer, the message passing layer is called BYNET. This layer is also a communication layer
between AMP’s, nodes. It receive the execution plan from PE and send to appropriate AMP. It also
receive the processed data from AMP and sends to PE.
Just to maintain high availability, there are two BYNET’s (BYNET 0 and BYNET 1) in Teradata systems. The
other BYNET takes over in case of primary BYNET failure.

 Nodes
Each individual server in the Teradata is referred to as a node. Each node has its own operating system,
CPU, memory, own copy of Teradata RDBMS software and disk space.

Q: What is Data Analysis?


Data analysis is defined as a process of cleaning, transforming, and modeling data to discover useful
information for business decision-making. Whenever we take any decision in our day-to-day life is by
thinking about what happened last time or what will happen by choosing that particular decision. This is
nothing but analyzing our past or future and making decisions based on it.

Types of Data Analysis: Techniques and Methods

There are several types of data analysis techniques that exist based on business and technology. The
major types of data analysis are:

 Text Analysis
 Statistical Analysis
 Diagnostic Analysis
 Predictive Analysis
 Prescriptive Analysis

Text Analysis
Text Analysis is also referred to as Data Mining. It is a method to discover a pattern in large data sets
using databases or data mining tools. It used to transform raw data into business information. Business
Intelligence tools are present in the market which is used to take strategic business decisions. Overall it
offers a way to extract and examine data and deriving patterns and finally interpretation of the data.

Statistical Analysis
Statistical Analysis shows "What happen?" by using past data in the form of dashboards. Statistical
Analysis includes collection, Analysis, interpretation, presentation, and modeling of data. It analyses a
set of data or a sample of data. There are two categories of this type of Analysis - Descriptive Analysis
and Inferential Analysis.
Descriptive Analysis
analyses complete data or a sample of summarized numerical data. It shows mean and deviation for
continuous data whereas percentage and frequency for categorical data.

Inferential Analysis
analyses sample from complete data. In this type of Analysis, you can find different conclusions from the
same data by selecting different samples.

Diagnostic Analysis
Diagnostic Analysis shows "Why did it happen?" by finding the cause from the insight found in Statistical
Analysis. This Analysis is useful to identify behavior patterns of data. If a new problem arrives in your
business process, then you can look into this Analysis to find similar patterns of that problem. And it
may have chances to use similar prescriptions for the new problems.

Predictive Analysis
Predictive Analysis shows "what is likely to happen" by using previous data.

Prescriptive Analysis
Prescriptive Analysis combines the insight from all previous Analysis to determine which action to take
in a current problem or decision. Most data-driven companies are utilizing Prescriptive Analysis because
predictive and descriptive Analysis are not enough to improve data performance. Based on current
situations and problems, they analyze the data and make decisions.

Вам также может понравиться