You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.


Airline Delay Predictions using Supervised Machine Learning

Article  in  International Journal of Pure and Applied Mathematics · February 2018

0 188

2 authors:

Prabakaran. N Rajendran Kannadasan

VIT University VIT University


Some of the authors of this publication are also working on these related projects:

delay predictions View project


All content following this page was uploaded by Prabakaran. N on 09 May 2018.

The user has requested enhancement of the downloaded file.

International Journal of Pure and Applied Mathematics
Volume 119 No. 7 2018, 329-337
ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)
Special Issue

Airline Delay Predictions using Supervised Machine Learning

PranalliChandraa and Prabakaran.N and Kannadasan.R, VIT University, Vellore., and

Abstract—The primary goal of this project is to predict A. Supervised Machine Learning

airline delays caused by various factors. Flight delays lead to
negative impacts, mainly economical for commuters, airline It is a machine learning task where the dataset inputs and
industries and airport authorities. Furthermore, in the domain outputs are clearly recognized and already given, then
of sustainability, it can even cause environmental harm by the several type of algorithms are trained using labeled
rise in fuel consumption and gas emissions. Hence, these factors examples. A supervised learning algorithm contains an
indicate how necessary and relevant it has become to predict
entire dataset, which is further divided into training and test
the delays no matter the wide-range of airline meshes. To carry
out the predictive analysis, which encompasses a range of data; the algorithm examines the training dataset and
statistical techniques from supervised machine learning and, produces an inferred function, which is then used for
data mining, that studies current and historical data to make mapping new examples. In case of the aviation industry,
predictions or just analyze about the future delays, with help of commercialized aviation is a type of transportation system
Regression Analysis using regularization technique in Python that is complexly distributed. It tends to deal with several
3.This prediction will be helpful for giving a detailed analysis of
important resources, demand fluctuations, and various other
the performance of individual airlines, airports, and then
making a well-assessed decision. Moreover, apart from the kinds of stages. Stages are bound to take place at terminal
assessment related to the passengers, delay prediction analysis boundaries, runways, airports, and distinguished airspaces
will also help in important decision-making procedures that may be susceptible to different kind of delays or
necessary for every pivotal player in the air transportation cancellations. Summing up, some set of examples include
system. weather conditions, ground delays, air traffic control and
several other constraints and unforeseen circumstances that
I. INTRODUCTION lead to delays and cancellations in the entire aviation
During the most defining period of human history, where industry. Hence, this becomes an optimal scenario which
computing has moved from mainframes to PCs to cloud, and will allow us to implement a supervised machine learning
now to artificial intelligence. A fundamental sub-area of algorithm to precisely determine and predict the class labels
artificial intelligence has come into notice, called as for unrevealed instances.
Machine Learning, which enables computers to get into a Supervised Learning algorithm here will model
mode of self-learning without being explicitly programmed. relationships and dependencies between the aimed
With the concept of machine learning, we have been able to prediction output and the input features, such that I’ll be
apply complex mathematical computations to big data predicting the output values for new data based on the
iteratively and automatically, that too with efficient speed, relationships which are learned from the previous data set.
this phenomenon has been encompassing momentum over Supervised Learning problems can be further categorized
the last several years. On the other hand, data mining into following problems
involves data discovery and sorting it among large data sets • Classification – It is a type problem in which the output
available to identify the required patterns and establish variable is an entire category itself, such as “Win” or
relationships with the aim of solving problems through data “Lose”, the entire input data is classified into the category
analysis. Simply combining, machine learning and data variables; it is generally used largely for recommendation
mining use the same type of approach and set of algorithms, problems
except the kind of data pre-processing and end prediction
• Regression – It is a type of problem is which the output
varies. BY combining these two core areas to predict and
variable is a real value, such as few raw data values related
present the most accurate results possible.
to something. This is the problem type massively used for
prediction analysis, and hence will be used in this project.

B. Regression Analysis Methods

The main focus of regression analysis is to model and
determine the expected value of a dependent variable y in
terms of the value of one or more independent variables (x).
• Linear Regression
Linear Regression is used to model and establish a
relationship between dependent and independent set of
Fig.1: Overview and classification of Machine Learning

variables by fitting the best line possible. The best fit line

International Journal of Pure and Applied Mathematics Special Issue

hence formed as the result of prediction carried out is known specific airlines. Even if it is complex, it is still measurable
as our regression line and is represented by a linear equation with decent accuracy. And with respect to the schedule and
(1) : on-time performance of airlines, their generally exists some
pattern of flight delay (Wu, 2005)[4]. The results obtained
y = b0+b1x1 (1) from this project, Airline Delay Predictions using
Supervised Machine Learning, it can help to better
In case of logistic regression, which is very much understand the phenomenon and up to a very large extent.
compared to linear regression, the outcome (dependent
variable) has only limited number of discrete possible In 2013, it was estimated that approx. 36% of flights were
values. Whereas, linear regression analysis is the first best- delayed by more than five minutes in Europe, 32% of flights
suited method because it results in any one among the range delayed by more than 15 minutes in the US, and 16% of
of an infinite number of possible values. flights were cancelled or sobered delays greater than 30-40
• Polynomial Regression minutes in Brazil[1]. Hence, it indicates how important this
In practice, rather than performing a simple linear indicator is and how it acts no matter how wide the scale of
regression, we can improve the model doing a fit with a airline meshes exists.
polynomial of order N, because, in many situations, such a
linear regression model may not hold true, or even if it does, Furthermore, coming to the Indian scenario, in 2017,
the accuracy is decreased. Doing so, it is necessary to define according to the reports by the Directorate General of Civil
the degree N which is optimal to represent the data. Hence, Aviation (DGCA), between January and April, close to 5.12
here it is where polynomial regression analysis becomes the lakh domestic passengers in India faced issues due to airline
next best-suited method for the prediction analysis. companies denying boarding, as well as flight cancellations
Represented by equation (2): and delays [2]. Airline companies had to pay the passengers
compensations of over Rs. 25 crore for various
y=b0+b1x1+b2x12………..+bnx1n (2) inconveniences during the first four months of this year.
Hence, the prediction analysis retrieved from this project can
• Multiple Linear Regression contribute in the form of a prototype in helping to identify
If set of variables have a linear relationship with the operational variables that contribute to delays in any country
dependent variable, then the regression is known as multiple scenario.
linear regression. A multiple regression is represented by the
following equation(3): (Allan et al., 2001)[3]analysed delays at NYC Airports
from September `96 through August`00, with the aim of
y=b0+b1x1+b2x2…..+bnxn(3) finding out some major causes of delay occurred during the
first year of an Integrated Terminal Weather System (or
In all three equations above,(1),(2) and(3), b0,b1,…bn are ITWS) use and delays occurred with ITWS in operation that
the coefficients of the equation whose values we need to were “avoidable” if in case weather conditions would have
determine in any model; the x1,.. xn are the dependent been improved. The methodology used in the study has
variables involved; and y is the independent variable here. considered some major causes of delays (for example,
Multiple Linear Regressions is an even wider class of convective weather inside and outside the terminal area, and
regression that combines linear and nonlinear regressions high winds), and these causes were generally neglected in
with multiple explanatory variables. In this case, because of previous studies of capacity constrained airports such as
the broad range of prediction possibilities it offers, using Newark International Airport (EWR). The research
multiple regression in some of the models, which attempts to concluded that the usual methods of assessing delays only in
explain dependent variable using more than one independent terms of Instrument Meteorological Conditions (IMC)
variable. ,Visual Meteorological Conditions (VMC) and the
respective airport capacities is way more simplified than
II. RELATED WORKS required for determining the type of air traffic management
investments that in the best ways reduces the possible
Flight Delays has become a common and complex “avoidable” delays.
phenomenon, it occurs due to the problems at the origin-
airport, at the destination-airport, any ground reasons or a
combination of these entire factors can also give rise to

delays. Delays are also being regarded as caused due to

International Journal of Pure and Applied Mathematics Special Issue

(Hansen and Hsiao, 2005)[5] analysed the rise in flight TABLE I

delay in the United States domestic system by estimating an
econometric model of average daily delay that combines the Attributes Descriptions of Attributes
effects of arrival queuing, terminal weather conditions,
seasonal effects, and secular effects (such as a half year). YEAR, MONTH, DAY, dates of the flight
The results suggested that even after controlling these DAY_OF_WEEK
factors altogether, the delays decreased gradually from 2000
AIRLINES It is the IATA Code to
through mid-2003, but the trend reversed drastically identify unique airlines
(Rosen, 2002)[6] measured the rate of change in flight ORIGIN_AIRPORT and Code attributed by IATA to
DESTINATION_AIRPORT identify the airports
timings that resulted due to infrastructure-constant changes
in passenger demand. Results indicated that as the ratio of SCHEDULED_DEPARTUR scheduled times of take-off
demand to fix infrastructure increased, the delays increased E and and landing
proportionately, which resulted in proper decrease in
average flight times by approx. 7 minutes after the rapid DEPARTURE_TIME and real times at which take-off
decrease in the fall’01. The flight time differences between ARRIVAL_TIME and landing took place
the airlines in the data sample were small, though the United DEPARTURE_DELAY and difference (in minutes)
Airlines had lesser average flight times in the winter quarter ARRIVAL_DELAY between planned and real
than America West, which is considered even smaller times
airline. DISTANCE
distance (in miles)
Over the past couple of years, various analytical models
and simulation methods have been used to analyze flight
delay, including deterministic queuing models, neural
networks, econometric models etc. Although it is evident
that the analysis on delays carried is either on macroscopic TransportationStatistics (BTS) tracks the on-time
or microscopic data over a period of couple of days and this performance of domestic flights operated by large air
has happened because of the huge data of flights every day. carriers. BTS compiles daily data for the benefit of the
Hence, the predictions led to less accurate results or relapse customers or for any data analysts. The dataset is of 2017
in the trend among the results. So here, obtaining the airline flight delays and cancellations.
on-time performance data set from the U.S. DOT Bureau of
Transportation Statistics (BTS) website, and the linear and No vertical lines in table. Statements that serve as captions for the
polynomial regression models to be used along with entire table do not need footnote letters.
Gaussian units are the same as cgs emu for magnetostatics; Mx =
regularization technique in machine learning is far better to maxwell, G = gauss, Oe = oersted; Wb = weber, V = volt, s = second,
identify the delay pattern. In this project, studies on airport T = tesla, m = meter, A = ampere, J = joule, kg = kilogram, H = henry.
delay and individual airlines delay behavior analysis are
carried out, using linear regression model, polynomial
regression models, and regularization. The performances of
the models are tested using various metrics, e.g., CV
Method, MSE/RMSE Scores, etc. This project will be able
to complete several objectives like the statistical description
of airlines, temporal variability of delays, the relation of
delays with the origin airports, estimating geographically the
flights from each airport, etc., along with the main prediction


A. Overview of the Dataset Fig.2: All the airlines in the dataset associated with particular IATA carrier

The dataset has been taken from a reliable online available B. Data Exploration
government agency website that provides the air traffic
delay statistics in the United States. The U.S. Department of Data cleaning is the critical initial step in evaluating the
Transportation's (DOT) Bureau of dataset for final analysis. With the enormous amount of data
available, databases are prone to have noisy, missing and
inconsistent data. The data in this project is obtained from

International Journal of Pure and Applied Mathematics Special Issue

BTS source, which has varying kinds of 31 variables per airline; it is evident that there is some disparity
involved, and may not be compatible with the format in

which we require the data to use in Python. Data Cleaning between the carriers. For example, Southwest Airlines
helps in removing noisy data, and removing inconsistencies. (WN), that accounts of the largest percentage of flights
Data cleaning is performed as follows: (~20%), equivalent to the number of flights chartered by the
Dates and Times: 7 smallest airlines. However, if we have a look at the second
The date format has been given in four variables format; it pie chart, that is, Fig. 3(b), we see that here, on the contrary,
will be toned down to one particular format available in the differences among airlines are less noticeable.
Python for ease of use. Excluding, Hawaiian Airlines and Alaska Airlines that
Filling Factor: report extremely low mean delays, we obtain that a value of
In the data cleaning process, a missing value can be ∼11±8 minutes would correctly represent all mean delays. It
ignored, manually entered, given a constant value, or a mean is evident that this value is quite low which means that the
value. In this case, it will be organizing and arranging the standard time for every airline is to respect the schedule.
entire data frame to keep the relevant attributes and
eliminate the ones which has missing values. This is done to The following figure (Fig. 4) gives us a count of the
increase the readability and feasibility of use. delays of less than 5 min, those in the range 5 < t < 45 min
The fill factor gives us what percentage of space on each and finally delays greater than 45 minutes. Hence, we see
page to fill with data. The fill factor value obtained, in that independent of the airline, delays greater than 45
general, can be defined as a percentage from 1 to 100. Here, minutes only account for a few percents. However, the
it has been obtained a fill factor of >97%, which is quite proportion of delays in these three groups depends solely on
satisfactory, that means 3% of the overall space can be used the airline: as an example, in the case of SkyWest Airlines,
for future data growth. the delays greater than 45 minutes are only lower by ∼30%,
with respect to delays in the range 5 < t < 45 min. Things are
Further, we have established statistical description of better for SouthWest Airlines, since delays greater than 45
airlines, which involves classifying airlines on the basis of minutes are 4 times less frequent than delays in the range 5 <
their punctuality; it is done using various statistical t < 45 min.

Fig. 4: Comparative analysis of all the airlines with respect to their delays

Further, these are normalized the distribution of delays that

modeled with an exponential distribution (Prabakaran, 2017)

F(x)=aexp(−x/b) (4)

Both the parameters, a and b , have been obtained to

describe each airline are given in the upper right corner of
each panelin Fig.5(b). The normalization of the distribution
implies that:
(a) (b)
Fig.3: (a) Pie chart with % of flights per company, ∫ F(x)dx∼ 1. (5)
(b) Mean delays of airlines at origin airports
The normalization here implies to the histogram, and this
Fig.3(a), the first pie chart, gives us the percentage of flights relation entails that a and b coefficients will be correlated

International Journal of Pure and Applied Mathematics Special Issue

with a ∝ 1/b and hence, only one of these two values is necessary to adopt a model that is specific to the company
and the home airport.
After the exploration of dataset, the final aim to achieve is to
devise models for prediction of delays. The prediction is
necessary to describe the distributions. Finally, according to retrieved using a three week window that will predict the
the value of either a or b, a ranking of the airlines has been delays for the following week.
established: the low values of a will correspond to airlines
There are two models developed for the prediction of
with a large proportion of important delays and, on the delays, which are as follows:
contrary, airlines that beam from their punctuality will have
high a values. Model 1: One Airport – One Airline
Here, delays are modeled by separately considering the
airlines and by splitting the data according to the different
home airports. This basic model is called as a "toy-model"
that helps to identify problems that may arise at the
production stage. It is to be made sure that the automation of
the whole process is robust enough to insure the quality of
the fits, which can occur while treating the whole data.
The pitfall that may occur is that of insufficient statistics
or extreme delays. Extreme Delays are seen when the delay
noted is extremely high (>10 hrs.), that may have occurred
due to any unforeseen or unpredictable circumstance (e.g.
weather conditions, accidents, etc.)., this delay is rare and
introduces a bias in the analysis. In conclusion, the way we
handle delays determines and impacts the modeling to a
large extent.
In practice, the model provides a better fit line with
polynomial regression or order N. It is necessary to define
the value N which provides the best results, and while
increasing the N value , over-fitting needs to be avoided
which happens when more data is been added to test and the
model becomes even more complex, which in turn disrupts
the local structure(over-fit). And this is avoided by splitting
the datasets into test and training sets. The technique is made
(b) more robust by performing cross-validation method. This
method consists of re-separating the data into test, training
Fig.5: (a) Ranking of airlines based on delays and validation sets. The learning is done on the training set,
(b) Individual graphs of airlines demonstrating a & b parameters but to avoid over-learning this method facilitates split into
several pieces that are used alternately for training and
It is seen in the above Fig.5(a) that SouthWest Airlines, that testing. The cross-validation method helps in avoiding any
represents ∼20% of the total number of flights, is ranked kind of bias in the estimation parameters because all of the
well and has occupied the third position. And according to data is used successively to drive the model.
this ranking, SkyWest Airlines is the worst carrier. The K- fold helps in choosing the best polynomial degree.
Furthermore, the arrival delay has been examined, and it is It is seen in Fig.6 that the best model (best generalized
different from the departure delay, it is also retrieved that the model) is of order 2.
arrival delay is not seen up to a very huge extent. Hence,
only departure delay is considered.

C. Prediction of delays using regression

It is deduced from various observations between the origin

airports and all the airlines, that there is a high variability in
average delays, both noticed between different airports but
also between different airlines. This is significant cause it Fig. 6: Using the dataset, and applying K-fold method, the MSE values we
get in Python 3.
implies that in order to accurately model the delays, it is be
On this stage, after confirming the order of polynomial, as

International Journal of Pure and Applied Mathematics Special Issue

it has been validated, the entire dataset is used in order to Linear Regression is first performed on this model, and
extreme or large delays are underestimated and not taken

perform the fit. The following figure, Fig.7, compares the

K=50 polynomial fits corresponding to the cross-validation intoaccount, as explained. Fig.10 gives depicts this.
calculation leads to the orange curve. The final model fit
corresponds to the blue line.

Fig. 8: MSE value and quality % of linear regression on Model 2 obtained

in Python

In practice, the quality of fit is also known by considering

the number of predictions where the differences with data
points (or real values) are greater than 15 minutes.

(No. of values >15min /

Fig. 7: Graph showing final fit and CV output get in Python 3.
No. of predictions (total)) *100 (6)

MSE (Mean Square Error) value calculated for this model The value found here is 5.30%.
is 108.6713085. The MSE value gives us an idea of how
close a regression fit line is to the original data points. It Further, Polynomial Regression is performed on the fit,
does this by taking the distances of the fitted line points from Fig.11 depicts it.
the data points (distance=”errors”) and summing the square
of each of them. It finally takes the average of the value we
get. The RMSE value (square root of MSE value) we get
here is 10.42(min). It refers to the difference in minutes
between the predicted delay and the actual delay, and in this
case, the difference between the model and the observations Fig. 9: MSE score and quality % of Polynomial fit in Model 2 obtained in
found. Python

Model 2: One Airline-All Airports The MSE score found is 49.502543. The quality of the fit
In Model 1, only one airport was considered. This is again judged by the above formula, and is found 4.81%.
procedure is efficient only up to some extent because it is
likely that some of the observations can be extrapolated from
an airport to another. Thus, it is considered advantageous to
make a single fit, which would take all the airports into
account. Particularly, this would allow predicting delays on
airports for which the number of data is low with a better
Here, to test, it has been chosen as the carrier=”AA”, that
is, American Airlines, and in the data frame, a label has been
assigned to each airport. The correspondence between the
label and the original identifier has been saved in a list
Python. The next step involves incorporating the "One Hot
Encoding" Method. In machine learning, to work with
categorical variables, the categorical data is converted into
numbers, which is required for both input and output data
that are categorical. This method is applied in this case by
creating a matrix where instead of the ORIGIN_AIRPORT
variable that contained M labels, we build a matrix with M
columns, filled with 1 and 0 depending on the
correspondence with particular airports. Fig 10: Linear fit on Model 2.

International Journal of Pure and Applied Mathematics Special Issue

The Mean Squared Error (MSE) is a measure of how
close a fitted line is to the real data points. For every data
point on the line, we take the distance vertically from the
real point to the corresponding Y value on the curve fitted
(which is the error), and square the value. The next step is to
carry out the summation of all the squared error values
corresponding to all the data points, and, in the case of a
linear fit, the value we get is divided by the total number of
observations minus 2. The squaring is to avoid negative
values cancelling the positive values. The quality of the
model is assessed by the Mean Squared Error score we get,
Fig 11: Polynomial fit on Model 2 the smaller the value, the closer the fit is to the real data and
the accurate the machine learning model.
Hence, it is evident that a polynomial fit improves the
MSE score slightly, and is an efficient model.
Testing the model against end-week data, using
regularization to minimize the errors and over fitting: (7)

et = error value (predicted value-real value)

n=Total no. of attributes or points taken into account.

MSE value for Model 1 is 108.6713085.

MSE value for Model 2, Linear Fit is (shown in Fig.8),
Fig.12: MSE score of the final testing and quality percent obtained in
Python. MSE value for Model 2, Polynomial Fit (shown in
Fig.12), 49.5025.
The current MSE score is calculated on the basis of all the
airports that are served by American Airlines, whereas RMSE
previously it was calculated on the data of a single airport. Root Mean Squared Error (RMSE) is another quality that
The current model is therefore more generalized and we calculate to measure the accuracy of a model. It is equal
efficient. to the square root of the mean square error. It is considered
as one of the most easily interpreted statistics, as it has the
IV. PERFORMANCE METRIC same units as the quantity plotted on the ordinate, which is
the y-axis.
Cross Validation Technique and K-Fold Technique
Cross Validation is a very important technique for
assessing the performance of machine learning models. It
enables us in knowing how a machine learning model would
generalize to an independent data set. (8)
The model dataset is divided into three sets: Training, test,
and validation. The entire set is divided into K-folds or et = error value(predicted value-real value)
subsets, which is basically applying the K-fold technique, n=Total no. of attributes or points taken into account.
one of the ways of Cross Validation. Then, the K-1 folds are
sent for training and the learning is done on it, then the The RMSE values are depicted using a variable Ecart, for
model’s generalization is checked on the test set, which both the models, in Fig.8 and Fig.9, as the least Ecart value
contains just the remaining one fold; and this process goes shown is for Model 2, 4.81%.
on till the last fold. This method is used in the initial stages As the MSE value and RMSE value is lowest for the
of both Model 1 and Model 2, for data splitting and polynomial regression on Model 2, hence, it depicts that it
increased efficiency. shows the most accurate results, and the fitted line (the
predicted results) is the closest to the real data points.
Though, the final data set testing showed the Ecart value
as 7.70 min. (Fig.12).

International Journal of Pure and Applied Mathematics Special Issue

This project and the analysis retrieved are useful not only
for passengers point of view, but for every decision maker in
the aviation industry. Apart from the financial losses
incurred by the industry, flight delay also portray a negative
reputation of the airlines, and decreases their reliability. It
causes various sustainability issues, for example, increase in
fuel consumption and gas emissions. The analysis carried
here not only predicts delays based on the previous available
data, but also give statistical description of airlines, their
rankings based on their on-time performance, and delays
with respect to time, showing the peak hours of delay. This
project can be used as a prototype by any aviation authority
for their benefit, in the Indian Scenario too, it can work as an
efficient model or a proper prototype to study delay analysis,
based on the real dataset provided. This project has
encompassed and showed the importance of Regression
Analysis in Machine Learning, Data Mining concepts for
efficient data cleaning, Cross Validation technique and
Regularization in ML for making proper models and its
predictive analysis

[1] ANAC. AgˆenciaNacional de Aviac¸ ˜ao Civil. Technical report, , 2017.
[2] Indian Economic Times
[3] MIT, Lexington, Massachusetts, Allan, S.S., S.G. Gaddy, and J.E.
Evans, (2001) Delay Causality and Reduction at the New York City
Airports Using Terminal Weather Information.
[4] Wu, C. (2005), Inherent delays and operational reliability of airline
schedules, Journal of Air Transport Management Volume 11, Issue
[5] Hansen, M., and C. Y. Hsiao (2005), Going South? An Econometric
Analysis of US Airline Flight Delays from 2000 to 2004, Presented at
the 84th Annual Meeting of the Transportation Research Board
(TRB), Washington D.C.’05.
[6] Rosen, A. (2002), Flight Delays on US Airlines: The Impact of
Congestion Externalities in Hub and Spoke Networks, Department of
Economics, Stanford University
[7] Programming in Python 3: A Complete Introduction to the Python
Language , By Mark Summerfield.
[8] Prabakaran, N., and R. JagadeeshKannan. "Sustainable life-span of
WSN nodes using participatory devices in pervasive environment."
Microsystem Technologies 23.3 (2017): 651-657.


View publication stats