Data Mining Project

Decision Mining
Introduction:
Data mining, or knowledge discovery, is the computer-assisted process of digging
through and analyzing enormous sets of data and then extracting the meaning of the data. Data
Mining is the extraction of hidden, predictive information patterns from large databases. Data
Mining is especially useful now-a-days when there is massive amount of data and identifying the
useful portions of it can be a tedious job in itself.
With data mining we can now try and predict the future trends rather than identifying
them after they have already taken place. Data mining tools predict behaviors and future trends,
allowing businesses to make proactive, knowledge-driven decisions. XLMiner is an add-in
available for MS Excel that allows us to perform Data Mining on the data sets.
Problem Solving:
In this project we would like to solve a problem of Southwest in Calculating the Air Fares for
the new airports introduced, and providing discounts for its customers.
Using the Dataset we wanted explore the FARE by creating a correlation tables.
We wanted to explore the categorical predictors by computing the percentage of flights in
each category, using pivot tables.

We would like to compare the best model in terms of the predictors between the stepwise
regression to reduce the number of predictors and exhaustive search instead of stepwise
regression.
Predict the average fare on a route using exhaustive search.
Compare the predictive accuracy of the model.
Data Description:
The Dataset Airfares is provided by professor.
It contains real data that were collected for the third quarter of 1996, in which several
new airports have opened in major cities, opening the market for new routes.
The dataset has a total of 639 records; each record consists of 18 attributes or variables.
In order to price flights on these routes, a major airline collected information on 638 air
routes in the United States.

Some factors are known about these new routes. A major unknown factor is whether
Southwest or another discount airline will travel on these new routes.

Southwest's strategy of covering only major cities, use of secondary airports,
standardized fleet, low fares has been very different from the model followed by the older
and bigger airlines.

The presence of discount airlines is therefore believed to reduce the fares greatly.
Data Code:
S_CODE
S_CITY
E_CODE
E_CITY
COUPON
NEW
VACATION
SW
HI
S_INCOME
E_INCOME
S_POP
E_POP
SLOT
GATE
DISTANCE
Starting airport's code

Starting city
Ending airport's code
Ending city
Average number of coupons (a one-coupon flight is a non-stop flight, A
two-coupon flight is a one stop flight, etc.) for that route
Number of new carriers entering that route between Q3-96 and Q2-97
Whether a vacation route (Yes) or not (No).
Whether Southwest Airlines serves that route (Yes) or not (No)
Herfindel Index - measure of market concentration
Starting city's average personal income
Ending city's average personal income
Starting city's population
Ending city's population
Whether either endpoint airport is slot controlled or not; This is a
measure of airport congestion
Whether either endpoint airport has gate constraints or not; This is
another measure of airport congestion
Distance between two endpoint airports in miles
PAX
FARE (the response)
Number of passengers on that route during period of data collection

Average fare on that route
Excel Sheet-Airfares.xls: (Click on the below excel sheet to view complete data)
Step-1: To find out the best numerical predictor.

Scatter plot is used to obtain the correlation between two characteristics. Where
correlation implies that as one variable changes, the other variable will also change. Scatter plot
may indicate a cause and effective relationship between the two characteristics. Due to the
existence of third characteristics (or more), the scatter plot may affect the cause and both
characteristics of interest.
Sometimes, though we know that there is good correlation between two characteristics,
we can use one variable to predict the other, particularly if one characteristic is easy to measure
and the other isnt. For example, if we prove that weight gain in the first trimester of pregnancy
correlates well with fetus development, we can use gain as a predictor. The alternative would be
expensive tests to monitor the actual development of the fetus.
FARE
$400.00
$350.00
$300.00
$250.00
FARE
$200.00
$150.00
$100.00
$50.00
$0.00
0
Scatter Plot of Numerical Predictor (New) Vs Response (Fare)
FARE
$400.00
$350.00
$300.00
$250.00
FARE
$200.00
$150.00
$100.00
$50.00
$0.00
.00
5000.00
10000.00
15000.00
Scatter Plot of Numerical Predictor (HI) Vs Response (Fare)
FARE
$400.00
$350.00
$300.00
$250.00
FARE
$200.00
$150.00
$100.00
$50.00
$0.00
$15,000$20,000$25,000$30,000$35,000$40,000$45,000
Scatter Plot of Numerical Predictor (S_Income) Vs Response (Fare)
FARE
$400.00
$350.00
$300.00
$250.00
FARE
$200.00
$150.00
$100.00
$50.00
$0.00
$10,000 $15,000 $20,000 $25,000 $30,000 $35,000
Scatter Plot of Numerical Predictor (E_Income) Vs Response (Fare)
FARE
$400.00
$350.00
$300.00
$250.00
FARE
$200.00
$150.00
$100.00
$50.00
$0.00
0
5000000
10000000
Scatter Plot of Numerical Predictor (S_POP) Vs Response (Fare)
FARE
$400.00
$350.00
$300.00
$250.00
FARE
$200.00
$150.00
$100.00
$50.00
$0.00
0
5000000
10000000
Scatter Plot of Numerical Predictor (E_POP) Vs Response (Fare)
FARE
$400.00
$350.00
$300.00
$250.00
FARE
$200.00
$150.00
$100.00
$50.00
$0.00
0
500
1000
1500
2000
2500
3000
Scatter Plot of Numerical Predictor (Distance) Vs Response (Fare)
FARE
$400.00
$350.00
$300.00
$250.00
FARE
$200.00
$150.00
$100.00
$50.00
$0.00
0
10000 20000 30000 40000 50000 60000
Scatter Plot of Numerical Predictor (Pax) Vs Response (Fare)
By plotting scatter plots of all numerical predictor with response (Fare), we can say that
the numerical variable Distance has a best correlation with the response (Fare). Therefore, we
consider Distance as a Numerical Predictor.
Step-2: To find the categorical predictor.

Pivot table is a reporting tool that sorts and sums independent of the original data layout
in spreadsheets, by dragging and dropping columns to different rows, columns, or summary
positions. It can automatically sort, count, total or give the average of the data stored in one table
or spreadsheet. The result is obtained in the form of summarized data, which will be available in
the pivot table.
Pivot table is a flexible tool for displaying data by rearranging the data in a variety of
different ways. It is easy and useful to any reader of a report which is generated with pivot table
and the reader can decide what to look at from which perspective just by dragging and dropping
fields graphically. Pivot table can be used to geographically switch the placement of rows and
columns. This can be used to automatically reorganize data.
Pivot Table: (Click on the below excel sheet to view the complete data)
Pivot Table of the Categorical Variables (Vacation, SW, Slot, Gate) and Response (Fare)
After plotting Pivot table by dragging and dropping, we can say that the one with the
highest total between the categories is the best Categorical variable. Therefore, we can say that
Slot is the best Categorical variable in response to Fare.
Step-3: To find the best suited model

Multiple linear regression analysis is a model used to find the linear relationship between
a quantitative dependent variables and a set of predictors. It is mainly used for selecting
modeling step and the performance assessment depending on the goal.
MLR Excel Sheet: (Click on the below excel sheet to view the complete MLR data)
From the above result set we can say that using predictors that are uncorrelated with the
dependent variable increases the variance of predictions. Therefore, we try to drop the actually
correlated predictor with the dependent variable to increase the average error of predictors.
Step-4: To find result set of Exhaustive Search

Exhaustive search is used to evaluate subsets, for the moderate values of predictors. This search
avoids the artificial increase in
R2 , which will result in increase in number of predictors.
Therefore, we use another criteria Mallows
cp
, which is full model with unbiased data but yet
reduces the number of predictors for best result.

Excel sheet of logistics regression model and its output: (Click on the below excel sheet to
view the complete LR data)
Step-5: Deciding the best suited model

In order to avoid a perfect predictive model with loosely fitted data and a good explanatory
model with low predictive accuracy. It is a very important step to be taken, to choose the
modeling process to be used in the analysis. The model is treated as a good one when it has a
minimum number of predictions. When
R2
is higher, the number of predictions increases,
which indirectly effects the performance of the model.

Comparison of Lift and Decile Charts
MLR
LR
Lift chart (training dataset) Lift chart (training dataset)

Cumulative
FARE when
sorted using
predicted
values
40000
30000
20000
Cumulative
10000
0
200
0 400
Cumulative
FARE using
average
# cases
Cumulative
SW when
sorted using
predicted
values
50
40
30
20
Cumulative
10
0
100
0 200
Cumulative
SW using
average
# cases
Decile-wise lift chart (training dataset)

Decile-wise lift chart (training dataset)
2
1.5
Decile mean / Global mean
1
0.5
3
2.5
2
1.5
Decile mean / Global mean
1
0.5
0
0
Deciles
Deciles
Therefore by comparing RMSE, Average Errors, Lift Charts and Decile Wise Charts of
both the model training dataset, we conclude that the model with minimum number of predictors
performs perfectly.
The RMSE value, Average Error Value is small for Logistics Regression model in
comparison with MLR output. We also see from the lift charts that LR output maintains a linear
structure.
Finally we state that the Logistics Regression Model is the best model for calculating
airfares in Southwest for newly introduced airports.
Conclusion:
We conclude that the data mining techniques are one of the best techniques which are
used to find the perfectly suited model for an particular requirement. It is an easy software tool,
which is available in the market for obtaining the best result set. By comparing the obtained
result set of the two models we conclude that Logistics Regression (Exhaustive Search) is a best
suited model for this particular problem.

Data Mining Project

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Mining Project

Загружено:

Авторское право:

Доступные форматы

Decision Mining

each category, using pivot tables.

The Dataset Airfares is provided by professor.

routes in the United States.

Southwest or another discount airline will travel on these new routes.

and bigger airlines.

Starting airport's code

Number of passengers on that route during period of data collection

Step-1: To find out the best numerical predictor.

Scatter Plot of Numerical Predictor (New) Vs Response (Fare)

Scatter Plot of Numerical Predictor (HI) Vs Response (Fare)

Scatter Plot of Numerical Predictor (S_Income) Vs Response (Fare)

Scatter Plot of Numerical Predictor (E_Income) Vs Response (Fare)

Scatter Plot of Numerical Predictor (S_POP) Vs Response (Fare)

Scatter Plot of Numerical Predictor (E_POP) Vs Response (Fare)

Scatter Plot of Numerical Predictor (Distance) Vs Response (Fare)

10000 20000 30000 40000 50000 60000

Scatter Plot of Numerical Predictor (Pax) Vs Response (Fare)

Step-2: To find the categorical predictor.

Step-3: To find the best suited model

Step-4: To find result set of Exhaustive Search

avoids the artificial increase in

R2 , which will result in increase in number of predictors.

Therefore, we use another criteria Mallows

, which is full model with unbiased data but yet

reduces the number of predictors for best result.

Step-5: Deciding the best suited model

minimum number of predictions. When

is higher, the number of predictions increases,

which indirectly effects the performance of the model.

Lift chart (training dataset) Lift chart (training dataset)

Decile-wise lift chart (training dataset)

Вам также может понравиться