Академический Документы
Профессиональный Документы
Культура Документы
Analytics problem solving involves multiple steps like data cleaning, preparation, modelling,
model evaluation etc. Completing a typical analytics project may take several months, and thus it
is important to have a structure for it.
The structure for analytics problem solving is called the CRISP-DM framework - Cross
Industry Standard Process for Data Mining.
In this session
You will understand why a framework is extremely useful to solve business problems, how it is
used and the common pitfalls people face in the process. It is a 6-step process which starts with
thoroughly understanding the business problem and ends at model building, evaluation and
deployment.
Chandrashekhar Ramanathan
Industry Experts:
Kalpana Subbaramappa
Ex-AVP, GENPACT
As a data analyst, you will face a multitude of challenges ranging from understanding various
business problems to choosing the best techniques to solve them. To avoid getting lost, data
scientists have developed a robust process to solve virtually any analytics problem
in any industry–appropriately called the Cross Industry Standard Process for Data Mining
(CRISP–DM) framework.
It involves a series of steps which you will soon find quite intuitive:
1. Business understanding
2. Data understanding
3. Data Preparation
4. Data Modelling
5. Model Evaluation
6. Model Deployment
In the following videos, you will understand how each step fits smoothly into the process of
generating insights from data.
Play Video
We now have a framework to think about complex problems, but where do we start from? Do we
ask for data straightaway? Or do we ask some fundamental questions to understand the
problem better?
Imagine you are driving to a hill station and your car broke down in the middle of nowhere. You
have a toolkit and you want to repair the car. To do so, you need to know exactly what has gone
wrong. Is it the engine, the battery or you simply ran out of fuel?
For a data analyst, understanding the business and its specific problems is of utmost
importance. You ought to understand the problem clearly to convert it into a well-defined
analytics problem. Only then you can lay out a brilliant strategy to solve it, else you'll be super
efficient in solving the wrong problem!
After figuring out the problem with the car, you ideally experiment with various tools and follow
a somewhat sequential procedure, like a good mechanic. Similarly, analytics projects have a
typical life-cycle with the six stages of the CRISP-DM framework.
Compare Project Returns across different sectors and use that analysis to decide to which sector
to invest.
1. Projected Returns
2. Current Investments
And Invest in sector where most investments are already present. To achive this many analysis
can be performed. Such as
Eg : With Telecom sector Analysis for the high value customer and save from CHURN [ People who
will leave the network probabilities ] as example for Business understanding importance.
It is very difficult to define who are the high valued customers. In this case business
understanding is very important.
It is important to note that steps in CRISP-DM framework are not necessarily sequential in
nature. For example, your data understanding can improve your understanding of business itself.
But it all starts with business understanding. We should be able to articulate the business
objectives from the broad problem statement so that the next steps in the CRISP-DM framework
can be performed effectively.
Owning an IPL Team - Business Understanding
Let's understand the relevance of business understanding using an example we all connect with -
cricket! Imagine you are working as a data scientist in an IPL team.
One of the business problems for IPL team owners is buying the right players.
Interestingly, match-winners are not always the right players, but rather the ones who help
team owners make money.
As with any business, IPL team owners want to make profits. Since profits = revenue – cost, the
team needs to generate more revenues than costs. Let's look at the components of revenues and
costs.
Revenue
Media Rights: 60%
The cricket board of India, BCCI, collects revenue from broadcasters like Sony and shares a part
with the teams. This forms about 60% of the total revenue and teams always get it no matter
what players they choose.
Sponsorship: 25-30%
The logos on players’ jerseys and helmets are advertisements by sponsors. Players also
promote for sponsors through TV ads, and most of that money goes to the team owners.
Sponsors pay more when they get higher viewership -which means star players and the ones
who entertain help teams make more money! How much money? A whole lot! The sponsorship
revenue forms about 25 to 30% of total revenue. And this is why most IPL players resemble
walking billboards or Formula 1 drivers.
So what’s the lesson here? Firstly, the business objective is to buy players who help generate
revenue, not necessarily the ones who make the most runs. Since one of the largest components
of revenue is sponsorships, this will have huge implications on what data you collect and
analyse.
Rather than only analysing cricketing statistics like average runs and wickets, you would rather
also analyse the effect of players on sponsorship revenue. Maybe buy young and marketable
players for deodorant and fairness cream ads!
Costs
There are three main types of costs in buying and maintaining an IPL team:
What's the lesson here? You need to keep the player costs down. This sounds obvious, but teams
often buy expensive players who have been playing well recently. Do you now think it is a wise
decision?
To sum up, the business objective is to identify low-moderate cost players who can generate
high sponsorship revenue. This understanding will significantly change the data, the analysis and
the final choice of players.
After business understanding, the next step is data understanding. Once you get your hands on
the data for the first time, you would want to know its structure (number of files, rows, columns
etc.), understand how they are related to each other and whether something looks fishy–like a
date column having negative values. Broadly, you are interested in:
The type of data sets that are available for analysis
The information you can get from the datasets
Exploring the data (by plotting graphs and observing them)
Performing quality checks on the data sets
Let us listen to professor RC's views on Data Understanding:
In the next segment, you will see how the raw data is prepared for analysis.
Across projects, data analysts spend around 50-80% of the time on data cleaning and
preparation, and therefore data preparation becomes one of the most crucial steps.
Data is usually spread across different files. Collating those files together and selecting the
required rows and columns based on business understanding is a major step in data preparation.
After collating the data set we address missing values and outliers. It is considered the most
crucial step because the model will be built on the data sets created here.
If the data set is erroneous, the solution to the problem we get after building a model would be
erroneous too-no matter how the model is being created. Let’s hear more about Data Preparation.
i.Select relevant data : Problem drives the type of data we select and not vice versa.
ii. Integrate datafiles : [ like joining tables and establishing Master – Details relationship]
iii.Clean the data : clean all discripencies
iv. Constructing data : ie. Derivative variables / columns not present in the data set. Eg: year
from date value
v. Format the data : Case of text , year format etc.
To summarise, data preparation is one of the most time-consuming steps of the entire analysis.
It consists of the following steps:
1. Select relevant data
2. Integrate data
3. Clean data
4. Construct Data: Derive new features
5. Format Data
In the next session, you will understand the next steps of the CRISP-DM framework - modelling,
model evaluation and deployment.
The Heart of Data Analysis: Modelling
"If you torture the data long enough, it will confess."
Modelling is the heart of data analytics. One can think of a model as a black box which takes
relevant data as input and gives an output you are interested in.
Eg :
Recall : Investment problem early discussed Grouping show common characters. By examining we
identify this as clustering model
Once identified we can select one of the below algorithms :
TheCRISP-DM Framework_13-11-2018 11_44_10.mp4
This was just an overview of modelling. You will learn much more interesting and complex models in
detail in the upcoming courses.
In data analytics, evaluation is when you put everything you have done to litmus tests. If the
results obtained from model evaluation are not satisfactory, you reiterate the whole
process. If the model performs well and gives you accurate results, congratulations. You can
move on to implementation of the model.
Let us listen what our experts have to say about the final two stages of analysis-model evaluation
and deployment.
5. Evaluation.
Models build must be assessed to test the effectiveness in solving current problem. Not all
models are correct. Only some are useful. Evaluation are iterative process and the models are
tweeked until satisfactory after which the models are ready for deployment.
6. Deployment the last stage : Here the model is transferred to business strategy.
One interesting insight is that the whole process is iterative in nature. The intelligence of a
model has to evolve continuously.