Вы находитесь на странице: 1из 17

Analytics Problem Solving

The CRISP-DM Framework


Introduction
Until now, you have learnt to use the tools for data analysis and machine learning. This module
marks an inflexion point where you will learn how to solve business problems using data
analysis.

Analytics problem solving involves multiple steps like data cleaning, preparation, modelling,
model evaluation etc. Completing a typical analytics project may take several months, and thus it
is important to have a structure for it.

The structure for analytics problem solving is called the CRISP-DM framework - Cross
Industry Standard Process for Data Mining.

In this session
You will understand why a framework is extremely useful to solve business problems, how it is
used and the common pitfalls people face in the process. It is a 6-step process which starts with
thoroughly understanding the business problem and ends at model building, evaluation and
deployment.

Guidelines for in-module questions


The in-video and in-content questions for this module are not graded.

People you will hear from in this session:


Subject Matter Expert:

Chandrashekhar Ramanathan

Associate Professor & Associate Dean (Academics), IIIT-B

The International Institute of Information Technology, Bangalore commonly known as IIIT


Bangalore, is a premier national graduate school in India. Founded in 1999, it offers Integrated
M.Tech., M.Tech., M.S. (Research) and PhD programs in the field of Information Technology.

Industry Experts:

Kalpana Subbaramappa
Ex-AVP, GENPACT

GENPACT is a multinational business process and information technology services company.

Define the Business Problem - Business Understanding


"I never failed once. It just happened to be a 2000-step process."

As a data analyst, you will face a multitude of challenges ranging from understanding various
business problems to choosing the best techniques to solve them. To avoid getting lost, data
scientists have developed a robust process to solve virtually any analytics problem
in any industry–appropriately called the Cross Industry Standard Process for Data Mining
(CRISP–DM) framework.

It involves a series of steps which you will soon find quite intuitive:
1. Business understanding
2. Data understanding
3. Data Preparation
4. Data Modelling
5. Model Evaluation
6. Model Deployment
In the following videos, you will understand how each step fits smoothly into the process of
generating insights from data.

TheCRISP-DM Framework_13-11-2018 11_33_27.mp4

Play Video
We now have a framework to think about complex problems, but where do we start from? Do we
ask for data straightaway? Or do we ask some fundamental questions to understand the
problem better?

Imagine you are driving to a hill station and your car broke down in the middle of nowhere. You
have a toolkit and you want to repair the car. To do so, you need to know exactly what has gone
wrong. Is it the engine, the battery or you simply ran out of fuel?

For a data analyst, understanding the business and its specific problems is of utmost
importance. You ought to understand the problem clearly to convert it into a well-defined
analytics problem. Only then you can lay out a brilliant strategy to solve it, else you'll be super
efficient in solving the wrong problem!

After figuring out the problem with the car, you ideally experiment with various tools and follow
a somewhat sequential procedure, like a good mechanic. Similarly, analytics projects have a
typical life-cycle with the six stages of the CRISP-DM framework.

Professor R.C. will introduce you to these 6 stages.

TheCRISP-DM Framework_13-11-2018 11_35_59.mp4


Understanding the Business Objective will help us to identify Goals for Data Analysis. Rather
understand what type or analysis should be performed.

Compare Project Returns across different sectors and use that analysis to decide to which sector
to invest.

1. Projected Returns

2. Current Investments
And Invest in sector where most investments are already present. To achive this many analysis
can be performed. Such as

Eg : With Telecom sector Analysis for the high value customer and save from CHURN [ People who
will leave the network probabilities ] as example for Business understanding importance.

It is very difficult to define who are the high valued customers. In this case business
understanding is very important.

It is important to note that steps in CRISP-DM framework are not necessarily sequential in
nature. For example, your data understanding can improve your understanding of business itself.
But it all starts with business understanding. We should be able to articulate the business
objectives from the broad problem statement so that the next steps in the CRISP-DM framework
can be performed effectively.
Owning an IPL Team - Business Understanding
Let's understand the relevance of business understanding using an example we all connect with -
cricket! Imagine you are working as a data scientist in an IPL team.

One of the business problems for IPL team owners is buying the right players.
Interestingly, match-winners are not always the right players, but rather the ones who help
team owners make money.

As with any business, IPL team owners want to make profits. Since profits = revenue – cost, the
team needs to generate more revenues than costs. Let's look at the components of revenues and
costs.

Revenue
Media Rights: 60%
The cricket board of India, BCCI, collects revenue from broadcasters like Sony and shares a part
with the teams. This forms about 60% of the total revenue and teams always get it no matter
what players they choose.

Sponsorship: 25-30%
The logos on players’ jerseys and helmets are advertisements by sponsors. Players also
promote for sponsors through TV ads, and most of that money goes to the team owners.
Sponsors pay more when they get higher viewership -which means star players and the ones
who entertain help teams make more money! How much money? A whole lot! The sponsorship
revenue forms about 25 to 30% of total revenue. And this is why most IPL players resemble
walking billboards or Formula 1 drivers.

Ticket Sales: 6-8%


Most of the ticket money goes to team owners, which amounts to about 6-8% of the total
revenue.

Prize Money: 5-10% (if applicable)


In any tournament, every player wants the glory of lifting the cup. Strangely enough, though,
prize money is not a large part of the overall revenue. The winning team itself gets about 15-20
crore rupees, which is somewhat low compared to the revenue from sponsors and
advertisements.

So what’s the lesson here? Firstly, the business objective is to buy players who help generate
revenue, not necessarily the ones who make the most runs. Since one of the largest components
of revenue is sponsorships, this will have huge implications on what data you collect and
analyse.

Rather than only analysing cricketing statistics like average runs and wickets, you would rather
also analyse the effect of players on sponsorship revenue. Maybe buy young and marketable
players for deodorant and fairness cream ads!

Let’s now look at the costs.

Costs
There are three main types of costs in buying and maintaining an IPL team:

Franchise Fee: 60%


There’s the cost of owning an IPL team, the money that the team owner pays to the BCCI for the
honour of owning a team in the IPL. This cost is fixed and amounts to about 60% of the total
cost.

Player Costs: 30-35%


This amounts to roughly 30-35% of the total cost and depends upon the choice of players.

Maintenance costs: 5-10%


This is the cost of running the day-to-day activities, staff salaries etc. This does not depend on
the choice of players.

What's the lesson here? You need to keep the player costs down. This sounds obvious, but teams
often buy expensive players who have been playing well recently. Do you now think it is a wise
decision?
To sum up, the business objective is to identify low-moderate cost players who can generate
high sponsorship revenue. This understanding will significantly change the data, the analysis and
the final choice of players.

Understanding Raw Data


"Data! Data! Data! I can’t make bricks without clay!"

After business understanding, the next step is data understanding. Once you get your hands on
the data for the first time, you would want to know its structure (number of files, rows, columns
etc.), understand how they are related to each other and whether something looks fishy–like a
date column having negative values. Broadly, you are interested in:
 The type of data sets that are available for analysis
 The information you can get from the datasets
 Exploring the data (by plotting graphs and observing them)
 Performing quality checks on the data sets
Let us listen to professor RC's views on Data Understanding:

TheCRISP-DM Framework_13-11-2018 11_41_19.mp4

2.Data Understanding 2.1 Collect Relevant data

2.ii Describe Data

2.iii Explore Data


2.iv Verify Data Quality
You learnt a few steps on understanding the data and its quality. To summarize, under data
understanding one should:
1. Collect relevant data
2. Describe datasets
3. Explore data by plotting graphs
4. Check data quality

In the next segment, you will see how the raw data is prepared for analysis.

Preparing Data for Analysis


"Give me six hours to chop down a tree and I will spend the first four sharpening the axe."

Across projects, data analysts spend around 50-80% of the time on data cleaning and
preparation, and therefore data preparation becomes one of the most crucial steps.

Data is usually spread across different files. Collating those files together and selecting the
required rows and columns based on business understanding is a major step in data preparation.
After collating the data set we address missing values and outliers. It is considered the most
crucial step because the model will be built on the data sets created here.

If the data set is erroneous, the solution to the problem we get after building a model would be
erroneous too-no matter how the model is being created. Let’s hear more about Data Preparation.

TheCRISP-DM Framework_13-11-2018 12_58_58.mp4

Identify all discrepancies in the dataset


3. Data Preparation. Very intuitive process :

i.Select relevant data : Problem drives the type of data we select and not vice versa.
ii. Integrate datafiles : [ like joining tables and establishing Master – Details relationship]
iii.Clean the data : clean all discripencies
iv. Constructing data : ie. Derivative variables / columns not present in the data set. Eg: year
from date value
v. Format the data : Case of text , year format etc.
To summarise, data preparation is one of the most time-consuming steps of the entire analysis.
It consists of the following steps:
1. Select relevant data
2. Integrate data
3. Clean data
4. Construct Data: Derive new features
5. Format Data
In the next session, you will understand the next steps of the CRISP-DM framework - modelling,
model evaluation and deployment.
The Heart of Data Analysis: Modelling
"If you torture the data long enough, it will confess."

Modelling is the heart of data analytics. One can think of a model as a black box which takes
relevant data as input and gives an output you are interested in.

Let's see how modelling is used to solve business problems.


4. Modeling is the heart of data analysis

Eg :

Recall : Investment problem early discussed Grouping show common characters. By examining we
identify this as clustering model
Once identified we can select one of the below algorithms :
TheCRISP-DM Framework_13-11-2018 11_44_10.mp4
This was just an overview of modelling. You will learn much more interesting and complex models in
detail in the upcoming courses.

Model Evaluation and Deployment


"True genius resides in the capacity for evaluation of uncertain, hazardous and conflicting
information."

In data analytics, evaluation is when you put everything you have done to litmus tests. If the
results obtained from model evaluation are not satisfactory, you reiterate the whole
process. If the model performs well and gives you accurate results, congratulations. You can
move on to implementation of the model.

Let us listen what our experts have to say about the final two stages of analysis-model evaluation
and deployment.

5. Evaluation.
Models build must be assessed to test the effectiveness in solving current problem. Not all
models are correct. Only some are useful. Evaluation are iterative process and the models are
tweeked until satisfactory after which the models are ready for deployment.

6. Deployment the last stage : Here the model is transferred to business strategy.

Modelling -> Evaluation -> Deployment goes by cycle

TheCRISP-DM Framework_13-11-2018 11_47_38.mp4


Evaluation is necessary to ensure that your model is robust and effective. Finally,
implementation is the natural fruition of a project life-cycle.

One interesting insight is that the whole process is iterative in nature. The intelligence of a
model has to evolve continuously.

This completes the typical life cycle of a data analytics project.

Вам также может понравиться