Академический Документы
Профессиональный Документы
Культура Документы
Machine Learning refers to the techniques involved in dealing with vast data in the
most intelligent fashion (by developing algorithms) to derive actionable insights.
There are some basic common threads, however, and the overarching theme is best
summed up by this oft-quoted statement made by Arthur Samuel way back in
1959: “[Machine Learning is the] field of study that gives computers the ability to learn
without being explicitly programmed.”
1. Collecting data: Be it the raw data from excel, access, text files etc., this step
(gathering past data) forms the foundation of the future learning. The better the
variety, density and volume of relevant data, better the learning prospects for the
machine becomes.
2. Preparing the data: Any analytical process thrives on the quality of the data used.
One needs to spend time determining the quality of data and then taking steps for
fixing issues such as missing data and treatment of outliers. Exploratory analysis is
perhaps one method to study the nuances of the data in details thereby burgeoning
the nutritional content of the data
3. Training a model: This step involves choosing the appropriate algorithm and
representation of data in the form of the model. The cleaned data is split into two
parts – train and test (proportion depending on the prerequisites); the first part
(training data) is used for developing the model. The second part (test data), is used
as a reference
4. Evaluating the model: To test the accuracy, the second part of the data (holdout /
test data) is used. This step determines the precision in the choice of the algorithm
based on the outcome. A better test to check accuracy of model is to see its
performance on data which was not used at all during model build.
5. Improving the performance: This step might involve choosing a different model
altogether or introducing more variables to augment the efficiency. That’s why
significant amount of time needs to be spent in data collection and preparation
Predictive model as the name suggests is used to predict the future outcome based
on the historical data. Predictive models are normally given clear instructions right
from the beginning as in what needs to be learnt and how it needs to be learnt.
These class of learning algorithms are termed as Supervised Learning.
Healthcare: It is used to diagnose deadly diseases (e.g. cancer) based on the symptoms of
patients and tallying them with the past data of similar kind of patients.
Retail: It is used to identify products which sell more frequently (fast moving) and the slow
moving products which help the retailers to decide what kind of products to introduce or
remove from the shelf. Also, machine learning algorithms can be used to find which two /
three or more products sell together. This is done to design customer loyalty initiatives
which in turn helps the retailers to develop and maintain loyal customers.
Chatbots: ML based chatbots are used in various domains in B2B and B2C setting
From the perspective of inductive learning, we are given input samples (x) and output samples (f(x)) and the
problem is to estimate the function (f). Specifically, the problem is to generalize from the samples and the
mapping to be useful to estimate the output for new samples in the future.
In practice it is almost always too hard to estimate the function, so we are looking for very good
approximations of the function.
Disease diagnosis.
The x are the properties of the patient.
The f(x) is the disease they suffer from.
Face recognition.
The x are bitmaps of peoples faces.
The f(x) is to assign a name to the face.
Automatic steering.
The x are bitmap images from a camera in front of the car.
The f(x) is the degree the steering wheel should be turned.
F measure
Precision
Recall
Accuracy
RMSE (Root Mean Square Error)
MAE (mean Absolute Error)
RMSE (ranges from 0 to infinity, lower is better), also called Root Mean Square
Deviation (RMSD), is a quadratic-based rule to measure the absolute average
magnitude of the error. Technically it is produce by taking residuals (the difference
between the regression model and the actual data), squaring it, averaging all the
results and then taking the square root of the average. Because of this the product
will always be a positive number.
MAE (ranges from 0 to infinity, lower is better) is much like RMSE, but instead of
squaring the difference of the residuals and taking the square root of the result, it
just averages the absolute difference of the residuals. This produces a positive only
numbers, and is less reactive to large errors but can show nuance a bit better. It has
also fallen out of favor over time.
Actual Predicted
175 177
146 145
14 23
290 311
34 47
12 9
654 647
Training Dataset: The sample of data used to fit the model.
Test Dataset: The sample of data used to provide an unbiased evaluation of a final
model fit on the training dataset.
Cross-validation is a statistical method used to estimate the skill of machine
learning models.