S2 MachineLearning Basics

Shubhadeep Mukherjee
 Machine Learning refers to the techniques involved in dealing with vast data in the
most intelligent fashion (by developing algorithms) to derive actionable insights.
 There are some basic common threads, however, and the overarching theme is best
summed up by this oft-quoted statement made by Arthur Samuel way back in
1959: “[Machine Learning is the] field of study that gives computers the ability to learn
without being explicitly programmed.”
1. Collecting data: Be it the raw data from excel, access, text files etc., this step
(gathering past data) forms the foundation of the future learning. The better the
variety, density and volume of relevant data, better the learning prospects for the
machine becomes.
2. Preparing the data: Any analytical process thrives on the quality of the data used.
One needs to spend time determining the quality of data and then taking steps for
fixing issues such as missing data and treatment of outliers. Exploratory analysis is
perhaps one method to study the nuances of the data in details thereby burgeoning
the nutritional content of the data
3. Training a model: This step involves choosing the appropriate algorithm and
representation of data in the form of the model. The cleaned data is split into two
parts – train and test (proportion depending on the prerequisites); the first part
(training data) is used for developing the model. The second part (test data), is used
as a reference
4. Evaluating the model: To test the accuracy, the second part of the data (holdout /
test data) is used. This step determines the precision in the choice of the algorithm
based on the outcome. A better test to check accuracy of model is to see its
performance on data which was not used at all during model build.
5. Improving the performance: This step might involve choosing a different model
altogether or introducing more variables to augment the efficiency. That’s why
significant amount of time needs to be spent in data collection and preparation
 Predictive model as the name suggests is used to predict the future outcome based
on the historical data. Predictive models are normally given clear instructions right
from the beginning as in what needs to be learnt and how it needs to be learnt.
These class of learning algorithms are termed as Supervised Learning.
 For example: Supervised Learning is used when a marketing company is trying to

find out which customers are likely to churn. We can also use it to predict the
likelihood of occurrence of perils like earthquakes, tornadoes etc. with an aim to
determine the Total Insurance Value. Some examples of algorithms used
are: Nearest neighbor, Naïve Bayes, Decision Trees, Regression etc.
 It is used to train descriptive models where no target is set and no single feature is
important than the other. The case of unsupervised learning can be: When a
retailer wishes to find out what are the combination of products, customers tends to
buy more frequently. Furthermore, in pharmaceutical industry, unsupervised
learning may be used to predict which diseases are likely to occur along with
diabetes. Example of algorithm used here is: K- means Clustering Algorithm
 It is an example of machine learning where the machine is trained to take specific
decisions based on the business requirement with the sole motto to maximize
efficiency (performance). The idea involved in reinforcement learning is: The
machine/ software agent trains itself on a continual basis based on the environment
it is exposed to, and applies it’s enriched knowledge to solve business problems.
This continual learning process ensures less involvement of human expertise
which in turn saves a lot of time!
 An example of algorithm used in RL is Markov Decision Process

 Linear Regression
 Logistic Regression
 Decision Tree
 Random Forest
 Clustering
 Association Rule (Market Basket Analysis)
 Correlation Analysis
 Banking & Financial services: ML can be used to predict the customers who are likely to
default from paying loans or credit card bills. This is of paramount importance as machine
learning would help the banks to identify the customers who can be granted loans and
credit cards.
 Healthcare: It is used to diagnose deadly diseases (e.g. cancer) based on the symptoms of
patients and tallying them with the past data of similar kind of patients.
 Retail: It is used to identify products which sell more frequently (fast moving) and the slow
moving products which help the retailers to decide what kind of products to introduce or
remove from the shelf. Also, machine learning algorithms can be used to find which two /
three or more products sell together. This is done to design customer loyalty initiatives
which in turn helps the retailers to develop and maintain loyal customers.
 Chatbots: ML based chatbots are used in various domains in B2B and B2C setting
 From the perspective of inductive learning, we are given input samples (x) and output samples (f(x)) and the
problem is to estimate the function (f). Specifically, the problem is to generalize from the samples and the
mapping to be useful to estimate the output for new samples in the future.
 In practice it is almost always too hard to estimate the function, so we are looking for very good
approximations of the function.
Some practical examples of induction are:

 Credit risk assessment.
 The x is the properties of the customer.
 The f(x) is credit approved or not.
 Disease diagnosis.
 The x are the properties of the patient.
 The f(x) is the disease they suffer from.
 Face recognition.
 The x are bitmaps of peoples faces.
 The f(x) is to assign a name to the face.
 Automatic steering.
 The x are bitmap images from a camera in front of the car.
 The f(x) is the degree the steering wheel should be turned.
 F measure
 Precision
 Recall
 Accuracy
 RMSE (Root Mean Square Error)
 MAE (mean Absolute Error)
 RMSE (ranges from 0 to infinity, lower is better), also called Root Mean Square
Deviation (RMSD), is a quadratic-based rule to measure the absolute average
magnitude of the error. Technically it is produce by taking residuals (the difference
between the regression model and the actual data), squaring it, averaging all the
results and then taking the square root of the average. Because of this the product
will always be a positive number.
 MAE (ranges from 0 to infinity, lower is better) is much like RMSE, but instead of
squaring the difference of the residuals and taking the square root of the result, it
just averages the absolute difference of the residuals. This produces a positive only
numbers, and is less reactive to large errors but can show nuance a bit better. It has
also fallen out of favor over time.
Actual Predicted
175 177
146 145
14 23
290 311
34 47
12 9
654 647
 Training Dataset: The sample of data used to fit the model.
 Validation Dataset: The sample of data used to provide an unbiased evaluation of a

model fit on the training dataset while tuning model hyper-parameters. The evaluation
becomes more biased as skill on the validation dataset is incorporated into the model
configuration.
 Test Dataset: The sample of data used to provide an unbiased evaluation of a final
model fit on the training dataset.
 Cross-validation is a statistical method used to estimate the skill of machine
learning models.
 Cross-validation is primarily used in applied machine learning to estimate the skill

of a machine learning model on unseen data. That is, to use a limited sample in
order to estimate how the model is expected to perform in general when used to
make predictions on data not used during the training of the model.
 It is a popular method because it is simple to understand and because it generally

results in a less biased or less optimistic estimate of the model skill than other
methods, such as a simple train/test split

S2 MachineLearning Basics

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

S2 MachineLearning Basics

Загружено:

Авторское право:

Доступные форматы

Shubhadeep Mukherjee

 For example: Supervised Learning is used when a marketing company is trying to

 An example of algorithm used in RL is Markov Decision Process

Some practical examples of induction are:

 Validation Dataset: The sample of data used to provide an unbiased evaluation of a

 Cross-validation is primarily used in applied machine learning to estimate the skill

 It is a popular method because it is simple to understand and because it generally

Вам также может понравиться