ML Interview Questions

My questions for interviewer
Whose job is it to make sure you have data?

Who gets fired if all your insights aren’t used for anything?
Who picks the tools you use and makes sure they play nice with all the other infrastructure?
How often and who will evaluate my job?
Introduction – how to become better data scientist

To be above the noise
1. Replicate papers. This is especially true if you’re a deep learning buff. People don’t do this
because it’s harder than grabbing a dataset and using a simple ANN or XGBoost to do
cookie-cutter classification. Find the most interesting paper (ideally a relatively recent
one) relevant to your field on the arXiv, and read it. Understand it. Then, replicate it,
potentially on a new dataset. Write a blog post about it.
2. Don’t get comfortable in your comfort zone. If you start a new project, it had better be to
learn some new frameworks/libraries/tools. If you’re building your 6th Jupyter notebook
that starts with df = pd.read_csv(filename) and ends with f1 =
f1_score(y_true, y_pred) , it’s time to change your strategy.
3. Learn boring things. Other people aren’t doing this because no one likes boring things. But
learning a proper Git flow, how to use Docker, how to build an app using Flask, and how to
deploy models on AWS or Google Cloud, are skills that companies desperately want
applicants to have, but that are under-appreciated by a solid majority of applicants.
4. Do annoying things. 1) Offer to present a paper at a local data science meetup. Or, at the
very least, attend the local data science meetup. 2) Send cold messages to people on
LinkedIn. Try to offer value upfront (“I just noticed a typo on your website”). DO NOT ASK
THEM FOR A JOB RIGHT AWAY. Make your ask as specific as possible (“I’d love to get your
feedback on my blog post”). You’re trying to build a relationship and expand your
network, and that takes patience. 3) Attend conferences and network. 4) Start a study
group.
5. Do things that seem crazy. Everyone goes to the UCI repository, or uses some stock
dataset (yawn) to build their project. Don’t do that. Learn how to use a web scraping
library, or some under-appreciated API to build your own, custom dataset. Data is hard to
come by, and companies often need to rely on their engineers to get it for them. Your
goal should be to come across as the kind of data science-obsessed lunatic who will build
your own goddamn dataset if that’s what it takes to get the job done.
Python-for-data-science skills
To force yourself to improve your data science theory and implementation game, use these in a few
projects, if you haven’t already:
 Data exploration. You should have pandas functions like .corr(),

scatter_matrix() , .hist() and .bar() on the tip of your tongue. You should always be
looking for opportunities to visualize your data using PCA or t-SNE, using sklearn's PCA and
TSNE functions.
 Feature selection. 90% of the time, your dataset will have way more features than you need
(which leads to excessive training time, and a heightened risk of overfitting). Get familiar with
basic filter methods (look up scikit-learn’s VarianceThreshold and SelectKBest
functions), and more sophisticated model-based feature selection methods (look up
SelectFromModel).
 Hyperparameter search for model optimization. You definitely should know what
GridSearchCV does and how it works. Likewise for RandomSearchCV. To really stand out,
try experimenting with skopt's BayesSearchCV to learn how you can apply bayesian
optimization to your hyperparameter search.
 Pipelines. Use sklearn's pipeline library to wrap their preprocessing, feature selection and
modeling steps together. Discomfort with pipeline is a huge tell that a data scientist needs
to get more familiar with their modeling toolkit.
probability and statistics knowledge

 Bayes’s theorem. It’s a foundational pillar of probability theory, and it comes up all the time in
interviews. You should practice doing some basic Bayes theorem whiteboarding problems,
and read the first chapter of this famous book to get a rock-solid understanding of the origin
and meaning of the rule (bonus: it’s actually a fun read!).
 Basic probability. You should be able to answer questions like these.
 Model evaluation. In classification problems, for example, most n00bs default to using model
accuracy as their metric, which is usually a terrible choice. Get comfortable with sklearn's
precision_score, recall_score, f1_score , and roc_auc_score functions, and the
theory behind them. For regression tasks, understanding why you would use
mean_squared_error rather than mean_absolute_error (and vice-versa) is also crucial.
It’s really worth taking the time to check out all the model evaluation metrics listed in
sklearn's official documentation.
software engineering know-how

 Version control. You should know how to use git , and interact with your remote GitHub
repos using the command line. If you don’t, I suggest starting with this tutorial.
 Web development. Some companies like their data scientists to be comfortable accessing data
that’s stored on their web app, or via an API. Getting comfortable with the basics of web
development is important, and the best way to do that is to learn a bit of Flask.
 Web scraping. Sort of related to web development: sometimes, you’ll need to automate data
collection by scraping data from live websites. Two great tools to consider for this are
BeautifulSoup and scrapy.
 Clean code. Learn how to use docstrings. Don’t overuse inline comments. Break your functions
up into smaller functions. Way smaller. There shouldn’t be functions in your code longer than
10 lines of code. Give your functions good, descriptive names ( function_1 is not a good
name). Follow pythonic convention and name your variables with underscores like_this
and not LikeThis or likeThis . Don’t write python modules ( .py files) with more than
400 lines of code. Each module should have a clear purpose (e.g. data_processing.py,
predict.py ). Learn what an if name == '__main__': code block does and why it’s
important. Use list comprehension. Don’t over-use for loops. Add a README file to your
project.
Dimensionality reduction
1. What are dimensions?
Overloaded term having any of the following definitions:
 The number of levels of coordinates in a Tensor. For example:
o A scalar has zero dimensions; for example, ["Hello"].
o A vector has one dimension; for example, [3, 5, 7, 11].
o A matrix has two dimensions; for example, [[2, 4, 18], [5, 7, 14]].
You can uniquely specify a particular cell in a one-dimensional vector with one coordinate; you need
two coordinates to uniquely specify a particular cell in a two-dimensional matrix.
 The number of entries in a feature vector.
 The number of elements in an embedding layer.
1. What is dimension reduction?

Decreasing the number of dimensions used to represent a particular feature in a feature vector,
typically by converting to an embedding.
1. Explain dimensionality reduction, where it’s used, and its benefits?

Dimensionality reduction is the process of reducing the number of feature variables under
consideration by obtaining a set of principal variables which are basically the important features.
Importance of a feature depends on how much the feature variable contributes to the information
representation of the data and depends on which technique you decide to use. Deciding which
technique to use comes down to trial-and-error and preference. It’s common to start with a linear
technique and move to non-linear techniques when results suggest inadequate fit. Benefits of
dimensionality reduction for a data set may be: (1) Reduce the storage space needed (2) Speed up
computation (for example in machine learning algorithms), less dimensions mean less computing, also
less dimensions can allow usage of algorithms unfit for a large number of dimensions (3) Remove
redundant features, for example no point in storing a terrain’s size in both sq meters and sq miles
(maybe data gathering was flawed) (4) Reducing a data’s dimension to 2D or 3D may allow us to plot
and visualize it, maybe observe patterns, give us insights (5) Too many features or too complex a
model can lead to overfitting.
2. What is the curse of dimensionality?

Prasad Pore answers:
"As the number of features or dimensions grows, the amount of data we need to generalize accurately
grows exponentially."
- Charles Isbell, Professor and Senior Associate Dean, School of Interactive Computing, Georgia Tech
Let’s take an example below. Fig. 1 (a) shows 10 data points in one dimension i.e. there is only one
feature in the data set. It can be easily represented on a line with only 10 values, x=1, 2, 3... 10.
But if we add one more feature, same data will be represented in 2 dimensions (Fig.1 (b)) causing
increase in dimension space to 10*10 =100. And again if we add 3rd feature, dimension space will
increase to 10*10*10 = 1000. As dimensions grows, dimensions space increases exponentially.
10^1 = 10
10^2 = 100
10^3 = 1000 and so on...
This exponential growth in data causes high sparsity in the data set and unnecessarily increases
storage space and processing time for the particular modelling algorithm. Think of image recognition
problem of high resolution images 1280 × 720 = 921,600 pixels i.e. 921600 dimensions. OMG. And
that’s why it’s called Curse of Dimensionality. Value added by additional dimension is much smaller
compared to overhead it adds to the algorithm.
Bottom line is, the data that can be represented using 10 space units of one true dimension, needs
1000 space units after adding 2 more dimensions just because we observed these dimensions during
the experiment. The true dimension means the dimension which accurately generalize the data and
observed dimensions means whatever other dimensions we consider in dataset which may or may not
contribute to accurately generalize the data.
2. How do you combat the curse of dimensionality?

3. What is the advantage of performing dimensionality reduction before fitting an
SVM?
Support Vector Machine Learning Algorithm performs better in the reduced space. It is beneficial to
perform dimensionality reduction before fitting an SVM if the number of features is large when
compared to the number of observations.
4. Principal Componenet Analysis (PCA)

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to
convert a set of observations of possibly correlated variables (entities each of which takes on various
numerical values) into a set of values of linearly uncorrelated variables called principal components. If
there are n {\displaystyle n} n observations with p {\displaystyle p} p variables, then the number of
distinct principal components is min ( n − 1 , p ) {\displaystyle \min(n-1,p)} {\displaystyle \min(n-1,p)}.
This transformation is defined in such a way that the first principal component has the largest possible
variance (that is, accounts for as much of the variability in the data as possible), and each succeeding
component in turn has the highest variance possible under the constraint that it is orthogonal to the
preceding components. The resulting vectors (each being a linear combination of the variables and
containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative
scaling of the original variables.
PCA is mostly used as a tool in exploratory data analysis and for making predictive models. It is often
used to visualize genetic distance and relatedness between populations. PCA can be done by
eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition
of a data matrix, usually after a normalization step of the initial data. The normalization of each
attribute consists of mean centering – subtracting each data value from its variable's measured mean
so that its empirical mean (average) is zero – and, possibly, normalizing each variable's variance to
make it equal to 1; see Z-scores.[4] The results of a PCA are usually discussed in terms of component
scores, sometimes called factor scores (the transformed variable values corresponding to a particular
data point), and loadings (the weight by which each standardized original variable should be multiplied
to get the component score).[5] If component scores are standardized to unit variance, loadings must
contain the data variance in them (and that is the magnitude of eigenvalues). If component scores are
not standardized (therefore they contain the data variance) then loadings must be unit-scaled,
("normalized") and these weights are called eigenvectors; they are the cosines of orthogonal rotation
of variables into principal components or back.
PCA can be thought of as fitting a p-dimensional ellipsoid to the data, where each axis of the ellipsoid
represents a principal component. If some axis of the ellipsoid is small, then the variance along that
axis is also small, and by omitting that axis and its corresponding principal component from our
representation of the dataset, we lose only a commensurately small amount of information.
To find the axes of the ellipsoid, we must first subtract the mean of each variable from the dataset to
center the data around the origin. Then, we compute the covariance matrix of the data and calculate
the eigenvalues and corresponding eigenvectors of this covariance matrix. Then we must normalize
each of the orthogonal eigenvectors to become unit vectors. Once this is done, each of the mutually
orthogonal, unit eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data. This
choice of basis will transform our covariance matrix into a diagonalised form with the diagonal
elements representing the variance of each axis. The proportion of the variance that each eigenvector
represents can be calculated by dividing the eigenvalue corresponding to that eigenvector by the sum
of all eigenvalues.
This procedure is sensitive to the scaling of the data, and there is no consensus as to how to best scale
the data to obtain optimal results.
ICA
Partial Least Squares Regression (PLSR)
Sammon Mapping
Multidimensional Scaling (MDS)
Projection Pursuit
Principal Component Regression (PCR)
Partial Least Squares Discriminant Analysis
Mixture Discriminant Analysis (MDA)
Quadratic Discriminant Analysis (QDA)
Regularized Discriminant Analysis (RDA)
Flexible Discriminant Analysis (FDA)
Linear Discriminant Analysis (LDA)
Classification
1.1 What is classification model?
A type of machine learning model for distinguishing among two or more discrete classes. For example,
a natural language processing classification model could determine whether an input sentence was in
French, Spanish, or Italian. Compare with regression model.
1.2 What is class? Negative class and Positive class?

One of a set of enumerated target values for a label. For example, in a binary classification model that
detects spam, the two classes are spam and not spam. In a multi-class classification model that
identifies dog breeds, the classes would be poodle, beagle, pug, and so on.
Multi-class classification (multinomial classification)
Classification problems that distinguish among more than two classes. For example, there are
approximately 128 species of maple trees, so a model that categorized maple tree species would be
multi-class. Conversely, a model that divided emails into only two categories (spam and not spam)
would be a binary classification model.
Negative class in binary classification, one class is termed positive and the other is termed negative.
The positive class is the thing we're looking for and the negative class is the other possibility. For
example, the negative class in a medical test might be "not tumor." The negative class in an email
classifier might be "not spam." See also positive class.
Positive class in binary classification, the two possible classes are labeled as positive and negative. The
positive outcome is the thing we're testing for. (Admittedly, we're simultaneously testing for both
outcomes, but play along.) For example, the positive class in a medical test might be "tumor." The
positive class in an email classifier might be "spam."
1.3 What is binary classification?

A type of classification task that outputs one of two mutually exclusive classes. For example, a
machine learning model that evaluates email messages and outputs either "spam" or "not spam" is a
binary classifier.
1.4 What is decision boundary?
The separator between classes learned by a model in a binary class or multi-class classification
problems. For example, in the following image representing a binary classification problem, the
decision boundary is the frontier between the orange class and the blue class:
1.5 What is classification threshold (decision threshold)?

A scalar-value criterion that is applied to a model's predicted score in order to separate the positive
class from the negative class. Used when mapping logistic regression results to binary classification. For
example, consider a logistic regression model that determines the probability of a given email
message being spam. If the classification threshold is 0.9, then logistic regression values above 0.9 are
classified as spam and those below 0.9 are classified as not spam.
2.1 What is one-vs.-all?

Given a classification problem with N possible solutions, a one-vs.-all solution consists of N separate
binary classifiers—one binary classifier for each possible outcome. For example, given a model that
classifies examples as animal, vegetable, or mineral, a one-vs.-all solution would provide the following
three separate binary classifiers:
 animal vs. not animal
 vegetable vs. not vegetable
 mineral vs. not mineral
Decision Tree
3.1 What is decision tree?
A model represented as a sequence of branching statements. For example, the following over-
simplified decision tree branches a few times to predict the price of a house (in thousands of USD).
According to this decision tree, a house larger than 160 square meters, having more than three
bedrooms, and built less than 10 years ago would have a predicted price of 510 thousand USD.
Machine learning can generate deep decision trees.
3.1 Explain the steps in making a decision tree.

Take the entire data set as input.
Look for a split that maximizes the separation of the classes. A split is any test that divides the
data into two sets.
Apply the split to the input data (divide step).
Re-apply steps 1 to 2 to the divided data.
Stop when you meet some stopping criteria.
This step is called pruning. Clean up the tree if you went too far doing splits.
3.1 How do you work towards a random forest?
The underlying principle of this technique is that several weak learners combined to provide a
strong learner. The steps involved are
Build several decision trees on bootstrapped training samples of data
On each tree, each time a split is considered, a random sample of mm predictors is chosen as
split candidates, out of all pp predictors
Rule of thumb: At each split m=p√m=p
Predictions: At the majority rule
For data scientists, the work isn’t easy, but it’s rewarding and there are plenty of available
positions out there. Be sure to prepare yourself for the rigors of interviewing and stay sharp
with the nuts-and-bolts of data science.
Classification and Regression Tree (CART)
Iterative Dichotomiser 3 (ID3)
C4.5
C5.0
Chi-squared Automatic Interaction Detection (CHAID)
Decision Stump
Conditional Decision Trees
M5
Bayes
4.1 Naïve Bayes
4.2 Is Naïve Bayes bad? If yes, under what aspects.
4.3 What is prior belief?
What you believe about the data before you begin training on it. For example, L2 regularization relies
on a prior belief that weights should be small and normally distributed around zero.
4.4 What do you understand by conjugate-prior with respect to Naïve Bayes?

5.1 What is the difference between Bayesian Estimate and Maximum Likelihood Estimation
(MLE)?
In Bayesian estimate we have some knowledge about the data/problem (prior). There may be several
values of the parameters which explain data and hence we can look for multiple parameters like 5
gammas and 5 lambdas that do this. As a result of Bayesian Estimate, we get multiple models for
making multiple predictions i.e. one for each pair of parameters but with the same prior. So, if a new
example need to be predicted than computing the weighted sum of these predictions serves the
purpose.
Maximum likelihood does not take prior into consideration (ignores the prior) so it is like being a
Bayesian while using some kind of a flat prior.
6.1 What is Bayesian Neural Network (BN)?

A probabilistic neural network that accounts for uncertainty in weights and outputs. A standard neural
network regression model typically predicts a scalar value; for example, a model predicts a house price
of 853,000. By contrast, a Bayesian neural network predicts a distribution of values; for example, a
model predicts a house price of 853,000 with a standard deviation of 67,200. A Bayesian neural
network relies on Bayes' Theorem to calculate uncertainties in weights and predictions. A Bayesian
neural network can be useful when it is important to quantify uncertainty, such as in models related to
pharmaceuticals. Bayesian neural networks can also help prevent overfitting.
Averaged One-Dependence Estimators (AODE)

Bayesian Belief Network (BBN)
Gaussian Naïve Bayes
Multinomial Naïve Bayes
Logistic Regression
7.1 What is logistic regression? What log loss is for?
A model that generates a probability for each possible discrete label value in classification problems by
applying a sigmoid function to a linear prediction. Although logistic regression is often used in binary
classification problems, it can also be used in multi-class classification problems (where it becomes
called multi-class logistic regression or multinomial regression).
Log loss is the loss function used in binary logistic regression.
7.2 What is cross-entropy?

A generalization of Log Loss to multi-class classification problems. Cross-entropy quantifies the
difference between two probability distributions. See also perplexity.
Instance based
k-Nearest Neighbour (kNN)
Learning Vector Quantization (LVQ)
Support Vector Machines
What is Kernel Support Vector Machines (KSVMs)?
A classification algorithm that seeks to maximize the margin between positive and negative classes by
mapping input data vectors to a higher dimensional space. For example, consider a classification
problem in which the input dataset has a hundred features. To maximize the margin between positive
and negative classes, a KSVM could internally map those features into a million-dimension space.
KSVMs uses a loss function called hinge loss.
Give some situations where you will use an SVM over a RandomForest Machine Learning
algorithm and vice-versa.
SVM and Random Forest are both used in classification problems.
a) If you are sure that your data is outlier free and clean then go for SVM. It is the opposite - if your
data might contain outliers then Random forest would be the best choice
b) Generally, SVM consumes more computational power than Random Forest, so if you are
constrained with memory go for Random Forest machine learning algorithm.
c) Random Forest gives you a very good idea of variable importance in your data, so if you want to
have variable importance then choose Random Forest machine learning algorithm.
d) Random Forest machine learning algorithms are preferred for multiclass problems.
e) SVM is preferred in multi-dimensional problem set - like text classification
but as a good data scientist, you should experiment with both of them and test for accuracy or rather
you can use ensemble of many Machine Learning techniques.
Clustering
1 Explain this clustering algorithm?
I wrote a popular article on the The 5 Clustering Algorithms Data Scientists Need to Know explaining all
of them in detail with some great visualizations.
1 What is clustering?
Grouping related examples, particularly during unsupervised learning. Once all the examples are
grouped, a human can optionally supply meaning to each cluster.
Many clustering algorithms exist. For example, the k-means algorithm clusters examples based on
their proximity to a centroid, as in the following diagram:
A human researcher could then review the clusters and, for example, label cluster 1 as "dwarf trees"
and cluster 2 as "full-size trees."
As another example, consider a clustering algorithm based on an example's distance from a center
point, illustrated as follows:
2 How will you define the number of clusters in a clustering algorithm?

Though the Clustering Algorithm is not specified, this question will mostly be asked in reference to K-
Means clustering where “K” defines the number of clusters. The objective of clustering is to group
similar entities in a way that the entities within a group are similar to each other but the groups are
different from each other.
For example, the following image shows three different groups.

Within Sum of squares is generally used to explain the homogeneity within a cluster. If you plot WSS
for a range of number of clusters, you will get the plot shown below. The Graph is generally known as
Elbow Curve.
Red circled point in above graph i.e. Number of Cluster =6 is the point after which you don’t see any
decrement in WSS. This point is known as bending point and taken as K in K – Means.
This is the widely used approach but few data scientists also use Hierarchical clustering first to create
dendograms and identify the distinct groups from there.
2 In unsupervised learning, if a ground truth about a dataset is unknown, how can we

determine the most useful number of clusters to be?
Matthew Mayo answers:
With supervised learning, the number of classes in a particular set of data is known outright, since
each data instance in labeled as a member of a particular existent class. In the worst case, we can scan
the class attribute and count up the number of unique entries which exist.
With unsupervised learning, the idea of class attributes and explicit class membership does not exist;
in fact, one of the dominant forms of unsupervised learning -- data clustering -- aims to approximate
class membership by minimizing interclass instance similarity and maximizing intraclass similarity. A
major drawback with clustering can be the requirement to provide the number of classes which exist
in the unlabeled dataset at the onset, in some form or another. If we are lucky, we may know the
data’s ground truth -- the actual number of classes -- beforehand. However, this is not always the case,
for numerous reasons, one of which being that there may actually be no defined number of classes
(and hence, clusters) in the data, with the whole point of the unsupervised learning task being to
survey the data and attempt to impose some meaningful structure of optimal cluster and class
numbers upon it.
Without knowing the ground truth of a dataset, then, how do we know what the optimal number of
data clusters are? As one may expect, there are actually numerous methods to go about answering
this question. We will have a look at 2 particular popular methods for attempting to answer this
question: the elbow method and the silhouette method.
The Elbow Method
The elbow method is often the best place to state, and is especially useful due to its ease of
explanation and verification via visualization. The elbow method is interested in explaining variance as
a function of cluster numbers (the k in k-means). By plotting the percentage of variance explained
against k, the first N clusters should add significant information, explaining variance; yet, some
eventual value of k will result in a much less significant gain in information, and it is at this point that
the graph will provide a noticeable angle. This angle will be the optimal number of clusters, from the
perspective of the elbow method,
It should be self-evident that, in order to plot this variance against varying numbers of clusters, varying
numbers of clusters must be tested. Successive complete iterations of the clustering method must be
undertaken, after which the results can be plotted and compared.
Image source.
The Silhouette Method

The silhouette method measures the similarity of an object to its own cluster -- called cohesion --
when compared to other clusters -- called separation. The silhouette value is the means for this
comparison, which is a value of the range [-1, 1]; a value close to 1 indicates a close relationship with
objects in its own cluster, while a value close to -1 indicates the opposite. A clustered set of data in a
model producing mostly high silhouette values is likely an acceptable and appropriate model.
Image source.
Read more on the silhouette method here. Find the specifics on computing a silhouette value here.
What is the difference between Cluster and Systematic Sampling?

Cluster sampling is a technique used when it becomes difficult to study the target population spread
across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability
sample where each sampling unit is a collection, or cluster of elements. Systematic sampling is a
statistical technique where elements are selected from an ordered sampling frame. In systematic
sampling, the list is progressed in a circular manner so once you reach the end of the list, it is
progressed from the top again. The best example for systematic sampling is equal probability method.
What is the difference between Supervised Learning an Unsupervised Learning?

If an algorithm learns something from the training data so that the knowledge can be applied to the
test data, then it is referred to as Supervised Learning. Classification is an example for Supervised
Learning. If the algorithm does not learn anything beforehand because there is no response variable
or any training data, then it is referred to as unsupervised learning. Clustering is an example for
unsupervised learning.
What is similarity measure?

In clustering algorithms, the metric used to determine how alike (how similar) any two examples are.
K-means
3 What is K-means? How can you select K for K-means?

A popular clustering algorithm that groups examples in unsupervised learning. The k-means algorithm
basically does the following:
 Iteratively determines the best k center points (known as centroids).
 Assigns each example to the closest centroid. Those examples nearest the same centroid
belong to the same group.
The k-means algorithm picks centroid locations to minimize the cumulative square of the distances
from each example to its closest centroid.
For example, consider the following plot of dog height to dog width:
If k=3, the k-means algorithm will determine three centroids. Each example is assigned to its closest
centroid, yielding three groups:
Imagine that a manufacturer wants to determine the ideal sizes for small, medium, and large sweaters
for dogs. The three centroids identify the mean height and mean width of each dog in that cluster. So,
the manufacturer should probably base sweater sizes on those three centroids. Note that the centroid
of a cluster is typically not an example in the cluster.
The preceding illustrations shows k-means for examples with only two features (height and width).
Note that k-means can group examples across many features.
3 How will you find the right K for K-means?

3 What is centroid? What is centroid-based clustering?
The center of a cluster as determined by a k-means or k-median algorithm. For instance, if k is 3, then
the k-means or k-median algorithm finds 3 centroids.
A category of clustering algorithms that organizes data into nonhierarchical clusters. k-means is the
most widely used centroid-based clustering algorithm.
Contrast with hierarchical clustering algorithms.
K-Medians
What is k-median?
A clustering algorithm closely related to k-means. The practical difference between the two is as
follows:
 In k-means, centroids are determined by minimizing the sum of the squares of the distance
between a centroid candidate and each of its examples.
 In k-median, centroids are determined by minimizing the sum of the distance between a
centroid candidate and each of its examples.
Note that the definitions of distance are also different:
 k-means relies on the Euclidean distance from the centroid to an example. (In two
dimensions, the Euclidean distance means using the Pythagorean theorem to calculate the
hypotenuse.) For example, the k-means distance between (2,2) and (5,-2) would be:
 k-median relies on the Manhattan distance from the centroid to an example. This distance is
the sum of the absolute deltas in each dimension. For example, the k-median distance
between (2,2) and (5,-2) would be:
4 Mean-Shift
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

Expectation Maximization (EM) using Gaussian Mixture Models (GMM)
5 Hierarchical Clustering (Agglomerative clustering and Divisive clustering)
A category of clustering algorithms that create a tree of clusters. Hierarchical clustering is well-suited
to hierarchical data, such as botanical taxonomies. There are two types of hierarchical clustering
algorithms:
 Agglomerative clustering first assigns every example to its own cluster, and iteratively merges
the closest clusters to create a hierarchical tree.
 Divisive clustering first groups all examples into one cluster and then iteratively divides the
cluster into a hierarchical tree.
Contrast with centroid-based clustering.
Regression
1 What is regression?
A type of model that outputs continuous (typically, floating-point) values. Compare with classification
models, which output discrete values, such as "day lily" or "tiger lily."
2 What is linear regression?
A type of regression model that outputs a continuous value from a linear combination of input
features.
2 What is Linear Regression?

Linear regression is a statistical technique where the score of a variable Y is predicted from the score
of a second variable X. X is referred to as the predictor variable and Y as the criterion variable.
2 What are the basic assumptions to be made for linear regression?

Normality of error distribution, statistical independence of errors, linearity and additivity.
2 What are the assumptions required for linear regression?

2 What is multicollinearity and how you can overcome it?
2 What are the drawbacks of the linear model?
Some drawbacks of the linear model are:
The assumption of linearity of the errors.
It can’t be used for count outcomes or binary outcomes
There are overfitting problems that it can’t solve
3 How will you explain logistic regression to an economist, physics scientist and
biologist?
3 What is logistic regression? Or State an example when you have used logistic
regression recently.
Logistic Regression often referred as logit model is a technique to predict the binary outcome from a
linear combination of predictor variables. For example, if you want to predict whether a particular
political leader will win the election or not. In this case, the outcome of prediction is binary i.e. 0 or 1
(Win/Lose). The predictor variables here would be the amount of money spent for election
campaigning of a particular candidate, the amount of time spent in campaigning, etc.
3 What is logits?
The vector of raw (non-normalized) predictions that a classification model generates, which is
ordinarily then passed to a normalization function. If the model is solving a multi-class classification
problem, logits typically become an input to the softmax function. The softmax function then
generates a vector of (normalized) probabilities with one value for each possible class.
In addition, logits sometimes refer to the element-wise inverse of the sigmoid function. For more
information, see tf.nn.sigmoid_cross_entropy_with_logits.
Is it possible to perform logistic regression with Microsoft Excel?

It is possible to perform logistic regression with Microsoft Excel. There are two ways to do it using
Excel.
a) One is to use Add-ins provided by many websites which we can use.
b) Second is to use fundamentals of logistic regression and use Excel’s computational power to
build a logistic regression
But when this question is being asked in an interview, interviewer is not looking for a name of Add-ins
rather a method using the base excel functionalities.
Let’s use a sample data to learn about logistic regression using Excel. (Example assumes that you are
familiar with basic concepts of logistic regression)
Data shown above consists of three variables where X1 and X2 are independent variables and Y is a
class variable. We have kept only 2 categories for our purpose of binary logistic regression classifier.
Next we have to create a logit function using independent variables, i.e.
Logit = L = β0 + β1*X1 + β2*X2

We have kept the initial values of beta 1, beta 2 as 0.1 for now and we will use Excel Solve to optimize
the beta values in order to maximize our log likelihood estimate.
Assuming that you are aware of logistic regression basics, we calculate probability values from Logit
using following formula:
Probability= e^Logit/(1+ e^Logit )
e is base of natural logarithm i.e. e = 2.71828163
Let’s put it into excel formula to calculate probability values for each of the observation.
The conditional probability is the probability of Predicted Y, given set of independent variables X.
And this p can be calculated as-
P〖(X)〗^Yactual*[1-P〖(X)〗^(1-Yactual)]
Then we have to take natural log of the above function-
ln⁡〖[ 〗 P〖(X)〗^Yactual*[1-P(X)^(1-Yactual) ]]
Which turns out to be –
Yactual*ln⁡〖[ 〗 P(X)]*(Yactual- 1)*ln[1-P(X)]
Log likelihood function LL is the sum of above equation for all the observations
Log likelihood LL will be sum of column G, which we just calculated
The objective is to maximize the Log Likelihood i.e. cell H2 in this example. We have to maximize H2 by
optimizing B0, B1, and B2.
We’ll use Excel’s solver add-in to achieve the same.
Excel comes with this Add-in pre-installed and you must see it under Data Tab in Excel as shown below
If you don’t see it there then make sure if you have loaded it. To load an add-in in Excel,
Go to File >> Options >> Add-Ins and see if checkbox in front of required add-in is checked or not?
Make sure to check it to load an add-in into Excel.
If you don’t see Solver Add-in there, go to the bottom of the screen (Manage Add-Ins) and click on OK.
Next you will see a popup window which should have your Solver add-in present. Check the checkbox
in-front of the add-in name. If you don’t see it there as well click on browse and direct it to the
required folder which contains Solver Add-In.
Once you have your Solver loaded, click on Solver icon under Data tab and You will see a new window
popped up like –
Put H2 in set objective, select max and fill cells E2 to E4 in next form field.
By doing this we have told Solver to Maximize H2 by changing values in cells E2 to E4.
Now click on Solve button at the bottom –
You will see a popup like below -

This shows that Solver has found a local maxima solution but we are in need of Global Maxima Output.
Keep clicking on Continue until it shows the below popup
It shows that Solver was able to find and converge the solution. In case it is not able to converge it will
throw an error. Select “Keep Solver Solution” and Click on OK to accept the solution provided by
Solver.
Now, you can see that value of Beta coefficients from B0, B1 B2 have changed and our Log Likelihood
function has been maximized.
Using these values of Betas you can calculate the probability and hence response variable by deciding
the probability cut-off.
4 How would you validate a model you created to generate a predictive model of a
quantitative outcome variable using multiple regression.
Answer by Matthew Mayo.
Proposed methods for model validation:
If the values predicted by the model are far outside of the response variable range, this would
immediately indicate poor estimation or model inaccuracy.
If the values seem to be reasonable, examine the parameters; any of the following would indicate
poor estimation or multi-collinearity: opposite signs of expectations, unusually large or small values, or
observed inconsistency when the model is fed new data.
Use the model for prediction by feeding it new data, and use the coefficient of determination (R
squared) as a model validity measure.
Use data splitting to form a separate dataset for estimating model parameters, and another for
validating predictions.
Use jackknife resampling if the dataset contains a small number of instances, and measure validity
with R squared and mean squared error (MSE).
Mean Squared Error (MSE) is the average squared loss per example. MSE is calculated by dividing the
squared loss by the number of examples. The values that TensorFlow Playground displays for "Training
loss" and "Test loss" are MSE.
4 You created a predictive model of a quantitative outcome variable using multiple
regressions. What are the steps you would follow to validate the model?
Since the question asked, is about post model building exercise, we will assume that you have already
tested for null hypothesis, multi collinearity and Standard error of coefficients.
Once you have built the model, you should check for following –
· Global F-test to see the significance of group of independent variables on dependent variable
· R^2
· Adjusted R^2
· RMSE, MAPE
In addition to above mentioned quantitative metrics you should also check for-
· Residual plot
· Assumptions of linear regression
4 How do you decide whether your linear regression model fits the data?
4 How can you assess a good logistic model?
There are various methods to assess the results of a logistic regression analysis-
• Using Classification Matrix to look at the true negatives and false positives.
• Concordance that helps identify the ability of the logistic model to differentiate between the
event happening and not happening.
• Lift helps assess the logistic model by comparing it with random selection.
What are various steps involved in an analytics project?
• Understand the business problem
• Explore the data and become familiar with it.
• Prepare the data for modelling by detecting outliers, treating missing values, transforming
variables, etc.
• After data preparation, start running the model, analyse the result and tweak the approach.
This is an iterative step till the best possible outcome is achieved.
• Validate the model using a new data set.
• Start implementing the model and track the result to analyse the performance of the model
over the period of time.
Explain tradeoffs between different types of regression models and different types of
classification models.
How you can make data normal using Box-Cox transformation?
Explain about the box cox transformation in regression models.
For some reason or the other, the response variable for a regression analysis might not satisfy one or
more assumptions of an ordinary least squares regression. The residuals could either curve as the
prediction increases or follow skewed distribution. In such scenarios, it is necessary to transform the
response variable so that the data meets the required assumptions. A Box cox transformation is a
statistical technique to transform non-mornla dependent variables into a normal shape. If the given
data is not normal then most of the statistical techniques assume normality. Applying a box cox
transformation means that you can run a broader number of tests.
What is generalized linear model?

A generalization of least squares regression models, which are based on Gaussian noise, to other types
of models based on other types of noise, such as Poisson noise or categorical noise. Examples of
generalized linear models include:
 logistic regression
 multi-class regression
 least squares regression
The parameters of a generalized linear model can be found through convex optimization.
Generalized linear models exhibit the following properties:
 The average prediction of the optimal least squares regression model is equal to the average
label on the training data.
 The average probability predicted by the optimal logistic regression model is equal to the
average label on the training data.
The power of a generalized linear model is limited by its features. Unlike a deep model, a generalized
linear model cannot "learn new features."
Linear Regression
Ordinary Least Squares Regression (OLSR)
Least squares regression
A linear regression model trained by minimizing L2 Loss.
Stepwise Regression
Multivariate Adaptive Regression Splines (MARS)
Locally Estimated Scatterplot Smoothing (LOESS)
DeepLearing
What is Tower?
A component of a deep neural network that is itself a deep neural network without an output layer.
Typically, each tower reads from an independent data source. Towers are independent until their
output is combined in a final layer.
1 Activation functions
What is activation function?
A function (for example, ReLU or sigmoid) that takes in the weighted sum of all of the inputs from the
previous layer and then generates and passes an output value (typically nonlinear) to the next layer.
Linear
What is Rectified Linear Unit (ReLU)?
An activation function with the following rules:
 If input is negative or zero, output is 0.
 If input is positive, output is equal to input.
Why is ReLU better and more often used than Sigmoid in Neural Networks?
Step function
Threshold logic
What is Sigmoid Function?
A function that maps logistic or multinomial regression output (log odds) to probabilities, returning a
value between 0 and 1. The sigmoid function has the following formula:
where
in logistic regression problems is simply:
In other words, the sigmoid function converts
into a probability between 0 and 1.
In some neural networks, the sigmoid function acts as the activation function.
What is log-odds?
The logarithm of the odds of some event.
If the event refers to a binary probability, then odds refers to the ratio of the probability of success (p)
to the probability of failure (1-p). For example, suppose that a given event has a 90% probability of
success and a 10% probability of failure. In this case, odds is calculated as follows:
The log-odds is simply the logarithm of the odds. By convention, "logarithm" refers to natural
logarithm, but logarithm could actually be any base greater than 1. Sticking to convention, the log-
odds of our example is therefore:
The log-odds are the inverse of the sigmoid function.

2 Optimization techniques
What is optimizer?
A specific implementation of the gradient descent algorithm. TensorFlow's base class for optimizers is
tf.train.Optimizer. Different optimizers may leverage one or more of the following concepts to
enhance the effectiveness of gradient descent on a given training set:
 momentum (Momentum)
 update frequency (AdaGrad = ADAptive GRADient descent; (Adam = ADAptive with

Momentum; RMSProp)
 sparsity/regularization (Ftrl)
 more complex math (Proximal, and others)
You might even imagine an NN-driven optimizer.
Stochastic Gradient Descent

A gradient descent algorithm in which the batch size is one. In other words, SGD relies on a single
example chosen uniformly at random from a dataset to calculate an estimate of the gradient at each
step.
What is Mini- batch stochastic gradient descent (SGD)?

A gradient descent algorithm that uses mini-batches. In other words, mini-batch SGD estimates the
gradient based on a small subset of the training data. Vanilla SGD uses a mini-batch of size 1.
Stochastic Gradient Descent (SGD) with momentum

Momentum is a sophisticated gradient descent algorithm in which a learning step depends not only on
the derivative in the current step, but also on the derivatives of the step(s) that immediately preceded
it. Momentum involves computing an exponentially weighted moving average of the gradients over
time, analogous to momentum in physics. Momentum sometimes prevents learning from getting
stuck in local minima.
Adam
RMSprop
Adadelta
Gradient descent
What is loss surface? How does gradient descent work?
Loss surface is a graph of weight(s) vs. loss. Gradient descent aims to find the weight(s) for which the
loss surface is at a local minimum.
What is gradient and gradient descent?

Gradient is the vector of partial derivatives with respect to all of the independent variables. In machine
learning, the gradient is the vector of partial derivatives of the model function. The gradient points in
the direction of steepest ascent.
Gradient descent is a technique to minimize loss by computing the gradients of loss with respect to
the model's parameters, conditioned on training data. Informally, gradient descent iteratively adjusts
parameters, gradually finding the best combination of weights and bias to minimize loss.
Partial derivative is a derivative in which all but one of the variables is considered a constant. For
example, the partial derivative of f(x, y) with respect to x is the derivative of f considered as a function
of x alone (that is, keeping y constant). The partial derivative of f with respect to x focuses only on how
x is changing and ignores all other variables in the equation.
What is exploding gradient problem and vanishing gradient problem?

The tendency for gradients in a deep neural networks (especially recurrent neural networks) to become
surprisingly steep (high). Steep gradients result in very large updates to the weights of each node in a
deep neural network.
Models suffering from the exploding gradient problem become difficult or impossible to train.
Gradient clipping can mitigate this problem.
Gradient clipping is a commonly used mechanism to mitigate the exploding gradient problem by
artificially limiting (clipping) the maximum value of gradients when using gradient descent to train a
model.
Compare to vanishing gradient problem.
vanishing gradient problem is the tendency for the gradients of early hidden layers of some deep
neural networks to become surprisingly flat (low). Increasingly lower gradients result in increasingly
smaller changes to the weights on nodes in a deep neural network, leading to little or no learning.
Models suffering from the vanishing gradient problem become difficult or impossible to train. Long
Short-Term Memory cells address this issue.
What is convex optimization and convex set?

The process of using mathematical techniques such as gradient descent to find the minimum of a
convex function. A great deal of research in machine learning has focused on formulating various
problems as convex optimization problems and in solving those problems more efficiently.
For complete details, see Boyd and Vandenberghe, Convex Optimization.
A subset of Euclidean space such that a line drawn between any two points in the subset remains
completely within the subset. For instance, the following two shapes are convex sets:
By contrast, the following two shapes are not convex sets:

What is AdaGrad?
A sophisticated gradient descent algorithm that rescales the gradients of each parameter, effectively
giving each parameter an independent learning rate. For a full explanation, see this paper.
Do gradient descent methods always converge to same point?

No, they do not because in some cases it reaches a local minima or a local optima point. You don’t
reach the global optima point. It depends on the data and starting conditions
3 Neural Networks
1. What is Neural network?
A model that, taking inspiration from the brain, is composed of layers (at least one of which is hidden)
consisting of simple connected units or neurons followed by nonlinearities.
Neuron is a node in a neural network, typically taking in multiple input values and generating one
output value. The neuron calculates the output value by applying an activation function (nonlinear
transformation) to a weighted sum of input values.
1. What is deep model and deep neural network? How do I build a deep neural network?
A type of neural network containing multiple hidden layers.
Contrast with wide model.
Deep neural network is a synonym for deep model.
Interpretability is the degree to which a model's predictions can be readily explained. Deep models are
often non-interpretable; that is, a deep model's different layers can be hard to decipher. By contrast,
linear regression models and wide models are typically far more interpretable.
Wide model - a linear model that typically has many sparse input features. We refer to it as "wide"
since such a model is a special type of neural network with a large number of inputs that connect
directly to the output node. Wide models are often easier to debug and inspect than deep models.
Although wide models cannot express nonlinearities through hidden layers, they can use
transformations such as feature crossing and bucketization to model nonlinearities in different ways.
Contrast with deep model.
1. What is perceptron?
system (either hardware or software) that takes in one or more input values, runs a function on the
weighted sum of the inputs, and computes a single output value. In machine learning, the function is
typically nonlinear, such as ReLU, sigmoid, or tanh. For example, the following perceptron relies on the
sigmoid function to process three input values:
In the following illustration, the perceptron takes three inputs, each of which is itself modified by a
weight before entering the perceptron:
Perceptrons are the (nodes) in deep neural networks. That is, a deep neural network consists of
multiple connected perceptrons, plus a backpropagation algorithm to introduce feedback.
1. What is layer?
A set of neurons in a neural network that process a set of input features, or the output of those
neurons.
Also, an abstraction in TensorFlow. Layers are Python functions that take Tensors and configuration
options as input and produce other tensors as output. Once the necessary Tensors have been
composed, the user can convert the result into an Estimator via a model function.
1. What is input layer, dense layer (fully connected layer) and output layer? What is depth and
width?
Input layer is the first layer (the one that receives the input data) in a neural network.
Dense layer is a hidden layer in which each node is connected to every node in the subsequent hidden
layer.
A fully connected layer is also known as a dense layer.
Output layer is the "final" layer of a neural network. The layer containing the answer(s).
Depth is the number of layers (including any embedding layers) in a neural network that learn weights.
For example, a neural network with 5 hidden layers and 1 output layer has a depth of 6.
Width is the number of neurons in a particular layer of a neural network.
1. What is calibration layer?

A post-prediction adjustment, typically to account for prediction bias. The adjusted predictions and
probabilities should match the distribution of an observed set of labels.
What is active learning?

A training approach in which the algorithm chooses some of the data it learns from. Active learning is
particularly valuable when labeled examples are scarce or expensive to obtain. Instead of blindly
seeking a diverse range of labeled examples, an active learning algorithm selectively seeks the
particular range of examples it needs for learning.
2. Feedforward Neural Networks (FFN)

A neural network without cyclic or recursive connections. For example, traditional deep neural
networks are feedforward neural networks. Contrast with recurrent neural networks, which are cyclic.
2. What is backpropagation?
The primary algorithm for performing gradient descent on neural networks. First, the output values of
each node are calculated (and cached) in a forward pass. Then, the partial derivative of the error with
respect to each parameter is calculated in a backward pass through the graph.
3. What is epoch?
A full training pass over the entire dataset such that each example has been seen once. Thus, an
epoch represents N/batch size training iterations, where N is the total number of examples.
3. What is learning rate?

A scalar used to train a model via gradient descent. During each iteration (a single update of a model's
weights during training. An iteration consists of computing the gradients of the parameters with
respect to the loss on a single batch of data.), the gradient descent algorithm multiplies the learning
rate by the gradient. The resulting product is called the gradient step.
Learning rate is a key hyperparameter.
3. Learning Rate Decay

3. What is co-adaption?
When neurons predict patterns in training data by relying almost exclusively on outputs of specific
other neurons instead of relying on the network's behavior as a whole. When the patterns that cause
co-adaption are not present in validation data, then co-adaptation causes overfitting. Dropout
regularization reduces co-adaptation because dropout ensures neurons cannot rely solely on specific
other neurons.
3. Dropout
3. Pruning
3. What is batch and batch size?
The set of examples used in one iteration (that is, one gradient update) of model training.
Batch size is the number of examples in a batch. For example, the batch size of SGD is 1, while the
batch size of a mini-batch is usually between 10 and 1000. Batch size is usually fixed during training
and inference; however, TensorFlow does permit dynamic batch sizes.
Mini-batch is a small, randomly selected subset of the entire batch of examples run together in a
single iteration of training or inference. The batch size of a mini-batch is usually between 10 and 1,000.
It is much more efficient to calculate the loss on a mini-batch than on the full training data.
3. What is Batch Normalization?

Normalizing the input or output of the activation functions in a hidden layer. Batch normalization can
provide the following benefits:
 Make neural networks more stable by protecting against outlier weights.
 Enable higher learning rates.
 Reduce overfitting.
3. What is batch normalization and why does it work?

Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs
changes during training, as the parameters of the previous layers change. The idea is then to
normalize the inputs of each layer in such a way that they have a mean output activation of zero and
standard deviation of one. This is done for each individual mini-batch at each layer i.e compute the
mean and variance of that mini-batch alone, then normalize. This is analogous to how the inputs to
networks are standardized. How does this help? We know that normalizing the inputs to a network
helps it learn. But a network is just a series of layers, where the output of one layer becomes the input
to the next. That means we can think of any layer in a neural network as the first layer of a smaller
subsequent network. Thought of as a series of neural networks feeding into each other, we normalize
the output of one layer before applying the activation function, and then feed it into the following
layer (sub-network).
4. What is Long Short-Term Memory?

A type of cell in a recurrent neural network used to process sequences of data in applications such as
handwriting recognition, machine translation, and image captioning. LSTMs address the vanishing
gradient problem that occurs when training RNNs due to long data sequences by maintaining history in
an internal memory state based on new input and context from previous cells in the RNN.
1. It has control on deciding when to let the input enter the neuron.
2. It has control on deciding when to remember what was computed in the previous time
step.
3. It has control on deciding when to let the output pass on to the next time stamp.
4. What is forget gate?

The portion of a Long Short-Term Memory cell that regulates the flow of information through the cell.
Forget gates maintain context by deciding which information to discard from the cell state.
Skip-gram
5. Transfer Learning
Radial Basis Function Network (RBFN)
Hopfield Network
Artificial Neural Network (ANN)
Self-Organizing Map (SOM)
A self-organizing map (SOM) is a type of artificial neural network (ANN) that is trained using
unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized
representation of the input space of the training samples, called a map, and is therefore a method to
do dimensionality reduction. Self-organizing maps differ from other artificial neural networks as they
apply competitive learning as opposed to error-correction learning (such as backpropagation with
gradient descent), and in the sense that they use a neighbourhood function to preserve the
topological properties of the input space.
Convolutional Neural Network (CNN)

6. What is Convolution?
In mathematics, casually speaking, a mixture of two functions. In machine learning, a convolution
mixes the convolutional filter and the input matrix in order to train weights.
The term "convolution" in machine learning is often a shorthand way of referring to either
convolutional operation or convolutional layer.
Without convolutions, a machine learning algorithm would have to learn a separate weight for every
cell in a large tensor. For example, a machine learning algorithm training on 2K x 2K images would be
forced to find 4M separate weights. Thanks to convolutions, a machine learning algorithm only has to
find weights for every cell in the convolutional filter, dramatically reducing the memory needed to
train the model. When the convolutional filter is applied, it is simply replicated across cells such that
each is multiplied by the filter.
6. What is Convolutional Neural Network?

A neural network in which at least one layer is a convolutional layer. A typical convolutional neural
network consists of some combination of the following layers:
 convolutional layers
 pooling layers
 dense layers
Convolutional neural networks have had great success in certain kinds of problems, such as image
recognition.
6. What is convolutional filter and convolutional layer? How convolutional operation works?
One of the two actors in a convolutional operation. (The other actor is a slice of an input matrix.) A
convolutional filter is a matrix having the same rank as the input matrix, but a smaller shape. For
example, given a 28x28 input matrix, the filter could be any 2D matrix smaller than 28x28.
In photographic manipulation, all the cells in a convolutional filter are typically set to a constant
pattern of ones and zeroes. In machine learning, convolutional filters are typically seeded with random
numbers and then the network trains the ideal values.
Convolutional layer is a layer of a deep neural network in which a convolutional filter passes along an
input matrix. For example, consider the following 3x3 convolutional filter:
The following animation shows a convolutional layer consisting of 9 convolutional operations involving
the 5x5 input matrix. Notice that each convolutional operation works on a different 3x3 slice of the
input matrix. The resulting 3x3 matrix (on the right) consists of the results of the 9 convolutional
operations:
The following two-step mathematical operation:
1. Element-wise multiplication of the convolutional filter and a slice of an input matrix. (The slice
of the input matrix has the same rank and size as the convolutional filter.)
2. Summation of all the values in the resulting product matrix.
For example, consider the following 5x5 input matrix:

Now imagine the following 2x2 convolutional filter:
Each convolutional operation involves a single 2x2 slice of the input matrix. For instance, suppose we
use the 2x2 slice at the top-left of the input matrix. So, the convolution operation on this slice looks as
follows:
A convolutional layer consists of a series of convolutional operations, each acting on a different slice of
the input matrix.
6. Why would you use many small convolutional kernels such as 3x3 rather than a few large
ones?
This is very well explained in the VGGNet paper. There are 2 reasons: First, you can use several smaller
kernels rather than few large ones to get the same receptive field and capture more spatial context,
but with the smaller kernels you are using less parameters and computations. Secondly, because with
smaller kernels you will be using more filters, you’ll be able to use more activation functions and thus
have a more discriminative mapping function being learned by your CNN.
6. Why do we use convolutions for images rather than just FC layers?

This one was pretty interesting since it’s not something companies usually ask. As you would expect, I
got this question from a company focused on Computer Vision. This answer has 2 parts to it. Firstly,
convolutions preserve, encode, and actually use the spatial information from the image. If we used
only FC layers we would have no relative spatial information. Secondly, Convolutional Neural Networks
(CNNs) have a partially built-in translation in-variance, since each convolution kernel acts as it’s own
filter/feature detector.
6. What makes CNNs translation invariant?

As explained above, each convolution kernel acts as its own filter/feature detector. So let’s say you’re
doing object detection, it doesn’t matter where in the image the object is since we’re going to apply
the convolution in a sliding window fashion across the entire image anyways.
How CNN use shared weights as a extension across space to standard Neural Network?
7. What is pooling?
Reducing a matrix (or matrices) created by an earlier convolutional layer to a smaller matrix. Pooling
usually involves taking either the maximum or average value across the pooled area. For example,
suppose we have the following 3x3 matrix:
A pooling operation, just like a convolutional operation, divides that matrix into slices and then slides
that convolutional operation by strides. For example, suppose the pooling operation divides the
convolutional matrix into 2x2 slices with a 1x1 stride. As the following diagram illustrates, four pooling
operations take place. Imagine that each pooling operation picks the maximum value of the four in
that slice:
Pooling helps enforce translational invariance in the input matrix.
Pooling for vision applications is known more formally as spatial pooling. Time-series applications
usually refer to pooling as temporal pooling. Less formally, pooling is often called subsampling or
downsampling.
7. Max Pooling
7. Why do we have max-pooling in classification CNNs?
Again as you would expect this is for a role in Computer Vision. Max-pooling in a CNN allows you to
reduce computation since your feature maps are smaller after the pooling. You don’t lose too much
semantic information since you’re taking the maximum activation. There’s also a theory that max-
pooling contributes a bit to giving CNNs more translation in-variance. Check out this great video from
Andrew Ng on the benefits of max-pooling.
Why do segmentation CNNs typically have an encoder-decoder style / structure?

The encoder CNN can basically be thought of as a feature extraction network, while the decoder uses
that information to predict the image segments by “decoding” the features and upscaling to the
original image size.
What is the significance of Residual Networks?

The main thing that residual connections did was allow for direct feature access from previous layers.
This makes information propagation throughout the network much easier. One very interesting paper
about this shows how using local skip connections gives the network a type of ensemble multi-path
structure, giving features multiple paths to propagate throughout the network.
What is dephtwise separable convolutional neural network (sepCNN)?

A convolutional neural network architecture based on Inception, but where Inception modules are
replaced with depthwise separable convolutions. Also known as Xception.
A depthwise separable convolution (also abbreviated as separable convolution) factors a standard 3-D
convolution into two separate convolution operations that are more computationally efficient: first, a
depthwise convolution, with a depth of 1 (n ✕ n ✕ 1), and then second, a pointwise convolution, with
length and width of 1 (1 ✕ 1 ✕ n).
To learn more, see Xception: Deep Learning with Depthwise Separable Convolutions.
What is rotational invariance, translational invariance and size invariance?

Rotational invariance
In an image classification problem, an algorithm's ability to successfully classify images even when the
orientation of the image changes. For example, the algorithm can still identify a tennis racket whether
it is pointing up, sideways, or down. Note that rotational invariance is not always desirable; for
example, an upside-down 9 should not be classified as a 9.
Translational invariance
position of objects within the image changes. For example, the algorithm can still identify a dog,
whether it is in the center of the frame or at the left end of the frame.
Size invariance
size of the image changes. For example, the algorithm can still identify a cat whether it consumes 2M
pixels or 200K pixels. Note that even the best image classification algorithms still have practical limits
on size invariance. For example, an algorithm (or human) is unlikely to correctly classify a cat image
consuming only 20 pixels.
Recurrent Neural Network (RNN)

What is Recurrent Neural Network and timestep?
A neural network that is intentionally run multiple times, where parts of each run feed into the next
run. Specifically, hidden layers from the previous run provide part of the input to the same hidden
layer in the next run. Recurrent neural networks are particularly useful for evaluating sequences, so
that the hidden layers can learn from previous runs of the neural network on earlier parts of the
sequence.
For example, the following figure shows a recurrent neural network that runs four times. Notice that
the values learned in the hidden layers from the first run become part of the input to the same hidden
layers in the second run. Similarly, the values learned in the hidden layer on the second run become
part of the input to the same hidden layer in the third run. In this way, the recurrent neural network
gradually trains and predicts the meaning of the entire sequence rather than just the meaning of
individual words.
Timestep
One "unrolled" cell within a recurrent neural network. For example, the following figure shows three
timesteps (labeled with the subscripts t-1, t, and t+1):
What is special about RNN which makes it good in recognize sequences in time (speech signal,
texts)?
How short memory works in RNN?
Recursive Neural Network
Generative Adversarial Networks
What is Generative Adversarial Networks (GAN)?
A system to create new data in which a generator creates data and a discriminator determines
whether that created data is valid or invalid.
Generator is the subsystem within a generative adversarial network that creates new examples.
Minimax loss is a loss function for generative adversarial networks, based on the cross-entropy
between the distribution of generated data and real data.
Minimax loss is used in the first paper to describe generative adversarial networks.
What is Wasserstein loss?

One of the loss functions commonly used in generative adversarial networks, based on the earth-
mover's distance between the distribution of generated data and real data.
Wasserstein Loss is the default loss function in TF-GAN.
What is discriminator?
A system that determines whether examples are real or fake.
The subsystem within a generative adversarial network that determines whether the examples created
by the generator are real or fake.
Deep Boltzmann Machine (DBM)

Deep Belief Networks (DBN)
Stacked Auto-Encoders
Reinforcement
What is reinforcement learning?
A machine learning approach to maximize an ultimate reward through feedback (rewards and
punishments) after a sequence of actions. For example, the ultimate reward of most games is victory.
Reinforcement learning systems can become expert at playing complex games by evaluating
sequences of previous game moves that ultimately led to wins and sequences that ultimately led to
losses.
What is candidate sampling? Full softmax, softmax

A training-time optimization in which a probability is calculated for all the positive labels, using, for
example, softmax, but only for a random sample of negative labels. For example, if we have an
example labeled beagle and dog candidate sampling computes the predicted probabilities and
corresponding loss terms for the beagle and dog class outputs in addition to a random subset of the
remaining classes (cat, lollipop, fence). The idea is that the negative classes can learn from less
frequent negative reinforcement as long as positive classes always get proper positive reinforcement,
and this is indeed observed empirically. The motivation for candidate sampling is a computational
efficiency win from not computing predictions for all negatives.
A function that provides probabilities for each possible class in a multi-class classification model. The
probabilities add up to exactly 1.0. For example, softmax might determine that the probability of a
particular image being a dog at 0.9, a cat at 0.08, and a horse at 0.02. (Also called full softmax.)
Contrast with candidate sampling.
Markov Decision Processes
Recommender algorithms
1 What are Recommender Systems?
A subclass of information filtering systems that are meant to predict the preferences or ratings that a
user would give to a product. Recommender systems are widely used in movies, news, research
articles, products, social tags, music, etc.
A system that selects for each user a relatively small set of desirable items from a large corpus. For
example, a video recommendation system might recommend two videos from a corpus of 100,000
videos, selecting Casablanca and The Philadelphia Story for one user, and Wonder Woman and Black
Panther for another. A video recommendation system might base its recommendations on factors
such as:
 Movies that similar users have rated or watched.
 Genre, directors, actors, target demographic...
1 What is a recommendation engine? How does it work?

Answer by Gregory Piatetsky:
We are all familiar now with recommendations from Netflix - "Other Movies you might enjoy" or from
Amazon - Customers who bought X also bought Y., such systems are called recommendation engines
or more broadly recommender systems.
They typically produce recommendations in one of two ways: using collaborative or content-based
filtering.
Collaborative filtering methods build a model based on users past behavior (items previously
purchased, movies viewed and rated, etc) and use decisions made by current and other users. This
model is then used to predict items (or ratings for items) that the user may be interested in.
Content-based filtering methods use features of an item to recommend additional items with similar
properties. These approaches are often combined in Hybrid Recommender Systems.
Here is a comparison of these 2 approaches used in two popular music recommender systems -
Last.fm and Pandora Radio. (example from Recommender System entry)
Last.fm creates a "station" of recommended songs by observing what bands and individual tracks the
user has listened to on a regular basis and comparing those against the listening behavior of other
users. Last.fm will play tracks that do not appear in the user's library, but are often played by other
users with similar interests. As this approach leverages the behavior of users, it is an example of a
collaborative filtering technique.
Pandora uses the properties of a song or artist (a subset of the 400 attributes provided by the Music
Genome Project) in order to seed a "station" that plays music with similar properties. User feedback is
used to refine the station's results, deemphasizing certain attributes when a user "dislikes" a particular
song and emphasizing other attributes when a user "likes" a song. This is an example of a content-
based approach.
Here is a good Introduction to Recommendation Engines by Dataconomy and an overview of building

a Collaborative Filtering Recommendation Engine by Toptal. For latest research on recommender
systems, check ACM RecSys conference.
1 What is Collaborative filtering?

The process of filtering used by most of the recommender systems to find patterns or information by
collaborating viewpoints, various data sources and multiple agents.
Making predictions about the interests of one user based on the interests of many other users.
Collaborative filtering is often used in recommendation systems.
2 What is candidate generation, scoring and re-ranking?

The initial set of recommendations chosen by a recommendation system. For example, consider a
bookstore that offers 100,000 titles. The candidate generation phase creates a much smaller list of
suitable books for a particular user, say 500. But even 500 books is way too many to recommend to a
user. Subsequent, more expensive, phases of a recommendation system (such as scoring and re-
ranking) whittle down those 500 to a much smaller, more useful set of recommendations.
Scoring is a part of a recommendation system that provides a value or ranking for each item produced
by the candidate generation phase.
Re-ranking is the final stage of a recommendation system, during which scored items may be re-
graded according to some other (typically, non-ML) algorithm. Re-ranking evaluates the list of items
generated by the scoring phase, taking actions such as:
 Eliminating items that the user has already purchased.
 Boosting the score of fresher items.

3 What are items, item matrix and user matrix?
Items in a recommendation system, the entities that a system recommends. For example, videos are
the items that a video store recommends, while books are the items that a bookstore recommends.
Item matrix in recommendation systems, a matrix of embeddings generated by matrix factorization

that holds latent signals about each item. Each row of the item matrix holds the value of a single latent
feature for all items. For example, consider a movie recommendation system. Each column in the item
matrix represents a single movie. The latent signals might represent genres, or might be harder-to-
interpret signals that involve complex interactions among genre, stars, movie age, or other factors.
The item matrix has the same number of columns as the target matrix that is being factorized. For
example, given a movie recommendation system that evaluates 10,000 movie titles, the item matrix
will have 10,000 columns.
User matrix in recommendation systems, an embedding generated by matrix factorization that holds
latent signals about user preferences. Each row of the user matrix holds information about the relative
strength of various latent signals for a single user. For example, consider a movie recommendation
system. In this system, the latent signals in the user matrix might represent each user's interest in
particular genres, or might be harder-to-interpret signals that involve complex interactions across
multiple factors.
The user matrix has a column for each latent feature and a row for each user. That is, the user matrix
has the same number of rows as the target matrix that is being factorized. For example, given a movie
recommendation system for 1,000,000 users, the user matrix will have 1,000,000 rows.
3 What is matrix factorization?

In math, a mechanism for finding the matrices whose dot product approximates a target matrix.
In recommendation systems, the target matrix often holds users' ratings on items. For example, the
target matrix for a movie recommendation system might look something like the following, where the
positive integers are user ratings and 0 means that the user didn't rate the movie:
Casablanca The Philadelphia Story Black Panther Wonder Woman Pulp Fiction
User 1 5.0 3.0 0.0 2.0 0.0
User 2 4.0 0.0 0.0 1.0 5.0
User 3 3.0 1.0 4.0 5.0 0.0
The movie recommendation system aims to predict user ratings for unrated movies. For example, will
User 1 like Black Panther?
One approach for recommendation systems is to use matrix factorization to generate the following
two matrices:
 A user matrix, shaped as the number of users X the number of embedding dimensions.
 An item matrix, shaped as the number of embedding dimensions X the number of users.
For example, using matrix factorization on our three users and five items could yield the following user
matrix and item matrix:
User Matrix Item Matrix
1.1 2.3 0.9 0.2 1.4 2.0 1.2

0.6 2.0 1.7 1.2 1.2 -0.1 2.1
2.5 0.5
The dot product of the user matrix and item matrix yields a recommendation matrix that contains not
only the original user ratings but also predictions for the movies that each user hasn't seen. For
example, consider User 1's rating of Casablanca, which was 5.0. The dot product corresponding to
that cell in the recommendation matrix should hopefully be around 5.0, and it is:
(1.1 * 0.9) + (2.3 * 1.7) = 4.9
More importantly, will User 1 like Black Panther? Taking the dot product corresponding to the first
row and the third column yields a predicted rating of 4.3:
(1.1 * 1.4) + (2.3 * 1.2) = 4.3
Matrix factorization typically yields a user matrix and item matrix that, together, are significantly more
compact than the target matrix.
What is Weighted Alternating Least Squares (WALS)?

An algorithm for minimizing the objective function during matrix factorization in recommendation
systems, which allows a downweighting of the missing examples. WALS minimizes the weighted
squared error between the original matrix and the reconstruction by alternating between fixing the
row factorization and column factorization. Each of these optimizations can be solved by least squares
convex optimization. For details, see the Recommendation Systems course
NLP text processing

1 What is Natural language understanding?
Determining a user's intentions based on what the user typed or said. For example, a search engine
uses natural language understanding to determine what the user is searching for based on what the
user typed or said.
2 Continuous Bag Of Words

2 What is bag of words?
A representation of the words in a phrase or passage, irrespective of order (unordered sets of words).
For example, bag of words represents the following three phrases identically:
 the dog jumps
 jumps the dog
 dog jumps the
Each word is mapped to an index in a sparse vector, where the vector has an index for every word in
the vocabulary. For example, the phrase the dog jumps is mapped into a feature vector with non-zero
values at the three indices corresponding to the words the, dog, and jumps. The non-zero value can be
any of the following:
 A 1 to indicate the presence of a word.

 A count of the number of times a word appears in the bag. For example, if the phrase were
the maroon dog is a dog with maroon fur, then both maroon and dog would be represented as
2, while the other words would be represented as 1.
 Some other value, such as the logarithm of the count of the number of times a word appears
in the bag.
2 What is N-gram and bigram?

An ordered sequence of N words. For example, truly madly is a 2-gram. Because order is relevant,
madly truly is a different 2-gram than truly madly.
N Name(s) for this kind of N-gram Examples
2 bigram or 2-gram to go, go to, eat lunch, eat dinner
3 trigram or 3-gram ate too much, three blind mice, the bell tolls
4 4-gram walk in the park, dust in the wind, the boy ate lentils
Many natural language understanding models rely on N-grams to predict the next word that the user
will type or say. For example, suppose a user typed three blind. An NLU model based on trigrams
would likely predict that the user will next type mice.
Contrast N-grams with bag of words, which are unordered sets of words.
Bigram is an N-gram in which N=2.
3 What are embeddings?

A categorical feature represented as a continuous-valued feature. Typically, an embedding is a
translation of a high-dimensional vector into a low-dimensional space. For example, you can represent
the words in an English sentence in either of the following two ways:
 As a million-element (high-dimensional) sparse vector in which all elements are integers. Each
cell in the vector represents a separate English word; the value in a cell represents the
number of times that word appears in a sentence. Since a single English sentence is unlikely to
contain more than 50 words, nearly every cell in the vector will contain a 0. The few cells that
aren't 0 will contain a low integer (usually 1) representing the number of times that word
appeared in the sentence.
 As a several-hundred-element (low-dimensional) dense vector in which each element holds a

floating-point value between 0 and 1. This is an embedding.
In TensorFlow, embeddings are trained by backpropagating loss just like any other parameter in a
neural network.
3 What is embedding space?

The d-dimensional vector space that features from a higher-dimensional vector space are mapped to.
Ideally, the embedding space contains a structure that yields meaningful mathematical results; for
example, in an ideal embedding space, addition and subtraction of embeddings can solve word
analogy tasks.
The dot product of two embeddings is a measure of their similarity.

4 What is sentyment analysis?
Using statistical or machine learning algorithms to determine a group's overall attitude—positive or
negative—toward a service, product, organization, or topic. For example, using natural language
understanding, an algorithm could perform sentiment analysis on the textual feedback from a
university course to determine the degree to which students generally liked or disliked the course.
What is crash blossom?

A sentence or phrase with an ambiguous meaning. Crash blossoms present a significant problem in
natural language understanding. For example, the headline Red Tape Holds Up Skyscraper is a crash
blossom because an NLU model could interpret the headline literally or figuratively.
Statistics
zasadami testowania hipotez statystycznych:
1. formułujemy hipotezy,
2. zakładamy poziom istotności α dopuszczalną wartość błędu pierwszego rodzaju,
3. następnie na podstawie danych z próby wyznaczamy wartość statystyki testowej,
4. po czym porównujemy ją z wartościami krytycznymi odczytanymi z tablic odpowiedniego
rozkładu teoretycznego.
Postać stosowanej statystyki testowej zależy od następujących czynników:
 czy badamy hipotezę dotyczącą jednej, dwóch, czy wielu proporcji,

 jaka jest liczebność próby (prób) występujących w danym zagadnieniu,
 w przypadku dwu lub więcej prób – czy próby są niezależne, czy zależne (powiązane).
How would you use either the extreme value theory, Monte Carlo simulations or mathematical
statistics (or anything else) to correctly estimate the chance of a very rare event?
Extreme value theory (EVT) focuses on rare events and extremes, as opposed to classical approaches
to statistics which concentrate on average behaviors. EVT states that there are 3 types of distributions
needed to model the the extreme data points of a collection of random observations from some
distribution: the Gumble, Frechet, and Weibull distributions, also known as the Extreme Value
Distributions (EVD) 1, 2, and 3, respectively.
The EVT states that, if you were to generate N data sets from a given distribution, and then create a
new dataset containing only the maximum values of these N data sets, this new dataset would only be
accurately described by one of the EVD distributions: Gumbel, Frechet, or Weibull. The Generalized
Extreme Value Distribution (GEV) is, then, a model combining the 3 EVT models as well as the EVD
model.
Knowing the models to use for modeling our data, we can then use the models to fit our data, and
then evaluate. Once the best fitting model is found, analysis can be performed, including calculating
possibilities.
Explain the use of Combinatorics in data science.
What is the Law of Large Numbers?
It is a theorem that describes the result of performing the same experiment a large number of
times. This theorem forms the basis of frequency-style thinking. It says that the sample mean,
the sample variance and the sample standard deviation converge to what they are trying to
estimate.
What is Pearson correlation coefficient? How to calculate it having two lists, regression lines
etc.?
What does P-value signify about the statistical data?
P-value is used to determine the significance of results after a hypothesis test in statistics. P-value
helps the readers to draw conclusions and is always between 0 and 1.
• P- Value > 0.05 denotes weak evidence against the null hypothesis which means the null
hypothesis cannot be rejected.
• P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null
hypothesis can be rejected.
• P-value=0.05is the marginal value indicating it is possible to go either way.
Are expected value and mean value different?

They are not different but the terms are used in different contexts. Mean is generally referred when
talking about a probability distribution or sample population whereas expected value is generally
referred in a random variable context.
For Sampling Data
Mean value is the only value that comes from the sampling data.
Expected Value is the mean of all the means i.e. the value that is built from multiple samples. Expected
value is the population mean.
For Distributions
Mean value and Expected value are same irrespective of the distribution, under the condition that the
distribution is in the same population.
Explain what resampling methods are and why they are useful. Also explain their limitations.
How are confidence intervals constructed and how will you interpret them?
Parametric Tests
Mean Tests
Variance Tests
Population proportion
Non-parametric tests
Properties tests
 test zgodności chi-kwadrat

 test zgodności λ Kołmogorowa
 test normalności Shapiro-Wilka
 test serii
Comparison tests
 test Kołmogorowa-Smirnowa
 test jednorodności chi-kwadrat
 test mediany
 test serii
 test znaków
Distributions
What is the difference between skewed and uniform distribution?
When the observations in a dataset are spread equally across the range of distribution, then it is
referred to as uniform distribution. There are no clear perks in an uniform distribution. Distributions
that have more observations on one side of the graph than the other are referred to as skewed
distribution.Distributions with fewer observations on the left ( towards lower values) are said to be
skewed left and distributions with fewer observation on the right ( towards higher values) are said to
be skewed right.
What do you understand by the term Normal Distribution?

Data is usually distributed in different ways with a bias to the left or to the right or it can all be
jumbled up. However, there are chances that data is distributed around a central value without any
bias to the left or right and reaches normal distribution in the form of a bell shaped curve. The random
variables are distributed in the form of an symmetrical bell shaped curve.
Image Credit : mathisfun.com

Correlaitons
Parametric
Non-parametric
Approches
Bayesian
Frequentist
Likelihood
A/B tests
How will you explain an A/B test to an engineer who does not know statistics?
A statistical way of comparing two (or more) techniques, typically an incumbent against a new rival.
A/B testing aims to determine not only which technique performs better but also to understand
whether the difference is statistically significant. A/B testing usually considers only two techniques
using one measurement, but it can be applied to any finite number of techniques and measures.
What is the goal of A/B Testing?

This is a statistical hypothesis testing for randomized experiments with two variables, A and B.
The objective of A/B testing is to detect any changes to a web page to maximize or increase
the outcome of a strategy.
What is the goal of A/B Testing?
It is a statistical hypothesis testing for randomized experiment with two variables A and B. The goal of
A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an
interest. An example for this could be identifying the click through rate for a banner ad.
How can you prove that one improvement you've brought to an algorithm is really an
improvement over not doing anything?
Answer by Anmol Rajpurohit.
Often it is observed that in the pursuit of rapid innovation (aka "quick fame"), the principles of
scientific methodology are violated leading to misleading innovations, i.e. appealing insights that are
confirmed without rigorous validation. One such scenario is the case that given the task of improving
an algorithm to yield better results, you might come with several ideas with potential for
improvement.
An obvious human urge is to announce these ideas ASAP and ask for their implementation. When
asked for supporting data, often limited results are shared, which are very likely to be impacted by
selection bias (known or unknown) or a misleading global minima (due to lack of appropriate variety in
test data).
Data scientists do not let their human emotions overrun their logical reasoning. While the exact
approach to prove that one improvement you've brought to an algorithm is really an improvement
over not doing anything would depend on the actual case at hand, there are a few common
guidelines:
Ensure that there is no selection bias in test data used for performance comparison
Ensure that the test data has sufficient variety in order to be symbolic of real-life data (helps avoid
overfitting)
Ensure that "controlled experiment" principles are followed i.e. while comparing performance, the
test environment (hardware, etc.) must be exactly the same while running original algorithm and new
algorithm
Ensure that the results are repeatable with near similar results
Examine whether the results reflect local maxima/minima or global maxima/minima
One common way to achieve the above guidelines is through A/B testing, where both the versions of
algorithm are kept running on similar environment for a considerably long time and real-life input data
is randomly split between the two. This approach is particularly common in Web Analytics.
In an A/B test, how can we ensure that assignment to the various buckets is truly random?
First, let’s consider how we can best ensure comparability between buckets prior to bucket
assignment, without knowledge of any distribution of attributes in the population.
The answer here is simple: random selection and bucket assignment. Random selection and
assignment to buckets without regard to any attribute of the population is a statistically sound
approach, given a large enough population to draw from.
For example, let’s say you are testing a change to a website feature and are interested in response
from only a particular region, the US. By first splitting into 2 groups (control and treatment) without
regard to user region (and given a large enough population size), assignment of US visitors should be
split between these groups. From these 2 buckets, visitor attributes can then be inspected for the
purposes of testing, such as:
if (region == "US" && bucket == "treatment"):
# do something treatment-related here
else:
if (region == "US" && bucket == "control"):
# do something control-related here
else:
# catch-all for non-US (and not relevant to testing scenario)
Image Source.
Bear in mind that, even after performing a round of random bucket assignment, statistical testing can
be utilized to inspect/verify random distribution of bucket member attributes (e.g. ensure that
significantly more US visitors did not get assigned to bucket A). If not, a new random assignment can
be attempted (with a similar inspection/verification process), or -- if it is determined that the
population does not conform to a cooperative distribution -- an approach such as the following can be
pursued.
If we happen to know of some uneven population attribute distribution prior to bucket assignment,
stratified random sampling may be helpful in ensuring more evenly distributed sampling. Such a
strategy can help eliminate selection bias, which is the archenemy of A/B testing.
References:
Detecting and avoiding bucket imbalance in A/B tests
What are the methods to ensure that the population split for A/B test is random?
A/B Testing
How would you conduct an A/B test on an opt-in feature?

This seems to be a somewhat ambiguous question with a variety of interpretable meanings (an idea
supported by this post). Let's first look at the different possible interpretations of this questions and
go from there.
How would you conduct an A/B test on an opt-in version of a feature to a non-opt-in-version?
This would not allow for a fair or meaningful A/B test, since one bucket would be filled from the entire
site's users, while the other would be filled from the group which has already opted in. Such a test
would be akin to comparing some apples to all oranges, and thus ill-advised.
How would you conduct an A/B test on the adoption (or use) of an opt-in feature (i.e. test the actual
opting-in)?
This would be testing the actual opting in -- such as the testing between 2 versions of a "click here to
sign up" feature -- and as such is just a regular A/B test (see the above question for some insight).
How would you conduct an A/B test on different versions of an opt-in feature (i.e. for those having
already opted in)?
This could, again, be construed as one of a few meanings, but I intend to approach it as a complex
scenario of the chaining together of events, expanded upon below.
Let's flesh out #3 from the list above. Let's first look at a simple chaining of events which can be
tested, and then generalize. Suppose you are performing an A/B test on an email campaign. Let's say
the variable will be subject line, and that content remains constant between the 2. Suppose the
subject lines are as follows:
We have something for you

The greatest online data science courses are free this weekend! Try now, no commitment!
Contrived, to be sure. All else aside, intuition would say that subject #2 would get more action.
But beyond that, there is psychology at play. Even though the content which follows after clicking
either of the subjects is the same, the individual clicking the second subject could reasonably be
assumed to have a higher level of excitement and anticipation of what is to follow. This difference in
expectations and level of commitment between the groups may lead to a higher percentage of click-
throughs for those in the bucket with subject line #2 -- again, even with the same content.
Pivoting slightly... How would you conduct an A/B test on different versions of an opt-in feature (i.e. for
those having already opted in)?
If my interpretation of evaluating a series of chained events is correct, such an A/B test could
commence with different feeder locations to the same opt-in -- of the same content -- and move to to
different follow-up landing spots after opt-in, with the intent of measuring what users do on the
resulting landing page being the goal.
Do different originating locations to the same opt-in procedure result in different follow-up behavior?
Sure, it's still an A/B test, with the same goals, setup, and evaluation; however, the exact user
psychology being measured is different.
What does this have to do with an interview question? Beyond being able to identify the basic ideas of
A/B testing, being able to walk through imprecise questions is an asset to people working in analytics
and data science.
Model selection and Validation

Bias
What is bias (ethics/fairness)?
1. Stereotyping, prejudice or favoritism towards some things, people, or groups over others. These
biases can affect collection and interpretation of data, the design of a system, and how users interact
with a system. Forms of this type of bias include:
 automation bias
 confirmation bias
 experimenter’s bias
 group attribution bias
 implicit bias
 in-group bias
 out-group homogeneity bias
2. Systematic error introduced by a sampling or reporting procedure. Forms of this type of bias
include:
 coverage bias
 non-response bias
 participation bias
 reporting bias
 sampling bias
 selection bias
Not to be confused with the bias term in machine learning models or prediction bias
What is reporting bias?

The fact that the frequency with which people write about actions, outcomes, or properties is not a
reflection of their real-world frequencies or the degree to which a property is characteristic of a class
of individuals. Reporting bias can influence the composition of data that ML systems learn from.
For example, in books, the word laughed is more prevalent than breathed. An ML model that
estimates the relative frequency of laughing and breathing from a book corpus would probably
determine that laughing is more common than breathing.
What is prediction bias?

A value indicating how far apart the average of predictions is from the average of labels in the dataset.
Not to be confused with the bias term in machine learning models or with bias in ethics and fairness.
What is confirmation bias (experimenter’s bias)?

The tendency to search for, interpret, favor, and recall information in a way that confirms one's
preexisting beliefs or hypotheses. Machine learning developers may inadvertently collect or label data
in ways that influence an outcome supporting their existing beliefs. Confirmation bias is a form of
implicit bias.
Experimenter's bias is a form of confirmation bias in which an experimenter continues training models
until a preexisting hypothesis is confirmed.
What is bias (math)?

An intercept or offset from an origin. Bias (also known as the bias term) is referred to as b or w0 in
machine learning models. For example, bias is the b in the following formula:
Not to be confused with bias in ethics and fairness or prediction bias.
What is group attribution bias, out-group homogeneity bias and in-group bias?
Group attribution bias
Assuming that what is true for an individual is also true for everyone in that group. The effects of
group attribution bias can be exacerbated if a convenience sampling is used for data collection. In a
non-representative sample, attributions may be made that do not reflect reality.
Out-group homogeneity
The tendency to see out-group members as more alike than in-group members when comparing
attitudes, values, personality traits, and other characteristics. In-group refers to people you interact
with regularly; out-group refers to people you do not interact with regularly. If you create a dataset by
asking people to provide attributes about out-groups, those attributes may be less nuanced and more
stereotyped than attributes that participants list for people in their in-group.
For example, Lilliputians might describe the houses of other Lilliputians in great detail, citing small
differences in architectural styles, windows, doors, and sizes. However, the same Lilliputians might
simply declare that Brobdingnagians all live in identical houses.
In-group
Showing partiality to one's own group or own characteristics. If testers or raters consist of the
machine learning developer's friends, family, or colleagues, then in-group bias may invalidate product
testing or the dataset.
What is Automation bias?

When a human decision maker favors recommendations made by an automated decision-making
system over information made without automation, even when the automated decision-making
system makes errors.
What are bias and variance, and what are their relation to modeling data?
Bias is how far removed a model's predictions are from correctness, while variance is the degree to
which these predictions vary between model iterations.
Bias vs Variance, Image source
As an example, using a simple flawed Presidential election survey as an example, errors in the survey
are then explained through the twin lenses of bias and variance: selecting survey participants from a
phonebook is a source of bias; a small sample size is a source of variance.
Minimizing total model error relies on the balancing of bias and variance errors. Ideally, models are
the result of a collection of unbiased data of low variance. Unfortunately, however, the more complex
a model becomes, its tendency is toward less bias but greater variance; therefore an optimal model
would need to consider a balance between these 2 properties.
The statistical evaluation method of cross-validation is useful in both demonstrating the importance of
this balance, as well as actually searching it out. The number of data folds to use -- the value of k in k-
fold cross-validation -- is an important decision; the lower the value, the higher the bias in the error
estimates and the less variance.
Bias and variance contributing to total error, Image sourceConversely, when k is set equal to the
number of instances, the error estimate is then very low in bias but has the possibility of high variance.
The most important takeaways are that bias and variance are two sides of an important trade-off
when building models, and that even the most routine of statistical evaluation methods are directly
reliant upon such a trade-off.
On next page, we answer
What’s the trade-off between bias and variance?

What are the types of biases that can occur during sampling?
Selection bias
Under coverage bias
Survivorship bias
Explain survivorship bias.
It is the logical error of focusing aspects that support surviving some process and casually
overlooking those that did not because of their lack of prominence. This can lead to wrong
conclusions in numerous different means.
Explain selective bias.
Selection bias, in general, is a problematic situation in which error is introduced due to a non-
random population sample.
What is selection bias?
Errors in conclusions drawn from sampled data due to a selection process that generates systematic
differences between samples observed in the data and those not observed. The following forms of
selection bias exist:
 coverage bias: The population represented in the dataset does not match the population that
the ML model is making predictions about.
 sampling bias: Data is not collected randomly from the target group.
 non-response bias (also called participation bias): Users from certain groups opt-out of surveys
at different rates than users from other groups.
For example, suppose you are creating an ML model that predicts people's enjoyment of a movie. To
collect training data, you hand out a survey to everyone in the front row of a theater showing the
movie. Offhand, this may sound like a reasonable way to gather a dataset; however, this form of data
collection may introduce the following forms of selection bias:
 coverage bias: By sampling from a population who chose to see the movie, your model's
predictions may not generalize to people who did not already express that level of interest in
the movie.
 sampling bias: Rather than randomly sampling from the intended population (all the people at
the movie), you sampled only the people in the front row. It is possible that the people sitting
in the front row were more interested in the movie than those in other rows.
 non-response bias: In general, people with strong opinions tend to respond to optional
surveys more frequently than people with mild opinions. Since the movie survey is optional,
the responses are more likely to form a bimodal distribution than a normal (bell-shaped)
distribution.
What is the importance of having a selection bias?

Selection Bias occurs when there is no appropriate randomization achieved while selecting individuals,
groups or data to be analysed. Selection bias implies that the obtained sample does not exactly
represent the population that was actually intended to be analysed. Selection bias consists of
Sampling Bias, Data, Attribute and Time Interval.
What is selection bias, why is it important and how can you avoid it?
Selection bias, in general, is a problematic situation in which error is introduced due to a non-random
population sample. For example, if a given sample of 100 test cases was made up of a 60/20/15/5 split
of 4 classes which actually occurred in relatively equal numbers in the population, then a given model
may make the false assumption that probability could be the determining predictive factor. Avoiding
non-random samples is the best way to deal with bias; however, when this is impractical, techniques
such as resampling, boosting, and weighting are strategies which can be introduced to help deal with
the situation.
How do data management procedures like missing data handling make selection bias worse?
Missing value treatment is one of the primary tasks which a data scientist is supposed to do before
starting data analysis. There are multiple methods for missing value treatment. If not done properly, it
could potentially result into selection bias. Let see few missing value treatment examples and their
impact on selection-
Complete Case Treatment: Complete case treatment is when you remove entire row in data even if
one value is missing. You could achieve a selection bias if your values are not missing at random and
they have some pattern. Assume you are conducting a survey and few people didn’t specify their
gender. Would you remove all those people? Can’t it tell a different story?
Available case analysis: Let say you are trying to calculate correlation matrix for data so you might
remove the missing values from variables which are needed for that particular correlation coefficient.
In this case your values will not be fully correct as they are coming from population sets.
Mean Substitution: In this method missing values are replaced with mean of other available
values.This might make your distribution biased e.g., standard deviation, correlation and regression
are mostly dependent on the mean value of variables.
Hence, various data management procedures might include selection bias in your data if not chosen
correctly.
What is the difference between bias and underfitting? And, analogously, what is the difference
between variance and overfitting? Do the terms of each pair mean the same thing? If not, what
is the difference?
They do not exactly mean the same thing, but they are correlated in the following manner:
Over fitting occurs when the model captures the noise and the outliers in the data along with the
underlying pattern. These models usually have high variance and low bias. These models are usually
complex like Decision Trees, SVM or Neural Networks which are prone to over fitting.
Under fitting occurs when the model is unable to capture the underlying pattern of the data. These
models usually have a low variance and a high bias. These models are usually simple which are unable
to capture the complex patterns in the data like Linear and Logistic Regressions.
Error
What is convergence?
Informally, often refers to a state reached during training in which training loss and validation loss
change very little or not at all with each iteration after a certain number of iterations. In other words,
a model reaches convergence when additional training on the current data will not improve the
model. In deep learning, loss values sometimes stay constant or nearly so for many iterations before
finally descending, temporarily producing a false sense of convergence.
See also early stopping.
See also Boyd and Vandenberghe, Convex Optimization.

What is the difference between squared error and absolute error?
What do you understand by statistical power of sensitivity and how do you calculate it?
Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, RF etc.). Sensitivity
is nothing but “Predicted TRUE events/ Total events”. True events here are the events which were true
and model also predicted them as true.
Calculation of senstivity is pretty straight forward-
Senstivity = True Positives /Positives in Actual Dependent Variable
Where, True positives are Positive events which are correctly classified as Positives.
What is accuracy?
The fraction of predictions that a classification model got right. In multi-class classification, accuracy is
defined as follows:
In binary classification, accuracy has the following definition:
See true positive and true negative.
What is statistical power?

Wikipedia defines Statistical power or sensitivity of a binary hypothesis test is the probability that the
test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true.
To put in another way, Statistical power is the likelihood that a study will detect an effect when the
effect is present. The higher the statistical power, the less likely you are to make a Type II error
(concluding there is no effect when, in fact, there is).
Here are some tools to calculate statistical power.
Can you cite some examples where a false negative important than a false positive?
Assume there is an airport ‘A’ which has received high security threats and based on certain
characteristics they identify whether a particular passenger can be a threat or not. Due to shortage of
staff they decided to scan passenger being predicted as risk positives by their predictive model.
What will happen if a true threat customer is being flagged as non-threat by airport model?
Another example can be judicial system. What if Jury or judge decide to make a criminal go free?
What if you rejected to marry a very good person based on your predictive model and you happen to
meet him/her after few years and realize that you had a false negative?
Can you cite some examples where a false positive is important than a false negative?
Before we start, let us understand what are false positives and what are false negatives.
False Positives are the cases where you wrongly classified a non-event as an event a.k.a Type I error.
And, False Negatives are the cases where you wrongly classify events as non-events, a.k.a Type II
error.
In medical field, assume you have to give chemo therapy to patients. Your lab tests patients for certain
vital information and based on those results they decide to give radiation therapy to a patient.
Assume a patient comes to that hospital and he is tested positive for cancer (But he doesn’t have
cancer) based on lab prediction. What will happen to him? (Assuming Sensitivity is 1)
One more example might come from marketing. Let’s say an ecommerce company decided to give
$1000 Gift voucher to the customers whom they assume to purchase at least $5000 worth of items.
They send free voucher mail directly to 100 customers without any minimum purchase condition
because they assume to make at least 20% profit on sold items above 5K.
Now what if they have sent it to false positive cases?

Can you cite some examples where both false positive and false negatives are equally
important?
In the banking industry giving loans is the primary source of making money but at the same time if
your repayment rate is not good you will not make any profit, rather you will risk huge losses.
Banks don’t want to lose good customers and at the same point of time they don’t want to acquire
bad customers. In this scenario both the false positives and false negatives become very important to
measure.
These days we hear many cases of players using steroids during sport competitions Every player has to
go through a steroid test before the game starts. A false positive can ruin the career of a Great
sportsman and a false negative can make the game unfair.
Explain what a false positive and a false negative are. Why is it important to differentiate
these from each other?
In binary classification (or medical testing), False positive is when an algorithm (or test) indicates
presence of a condition, when in reality it is absent. A false negative is when an algorithm (or test)
indicates absence of a condition, when in reality it is present.
In statistical hypothesis testing false positive is also called type I error and false negative - type II error.
It is obviously very important to distinguish and treat false positives and false negatives differently
because the costs of such errors can be hugely different.
For example, if a test for serious disease is false positive (test says disease, but person is healthy), then
an extra test will be made that will determine the correct diagnosis. However, if a test is false negative
(test says healthy, but person has disease), then treatment will be done and person may die as a
result.
Explain what precision and recall are. How do they relate to the ROC curve?
Here is the answer from KDnuggets FAQ: Precision and Recall:
Calculating precision and recall is actually quite easy. Imagine there are 100 positive cases among
10,000 cases. You want to predict which ones are positive, and you pick 200 to have a better chance
of catching many of the 100 positive cases. You record the IDs of your predictions, and when you get
the actual results you sum up how many times you were right or wrong. There are four ways of being
right or wrong:
TN / True Negative: case was negative and predicted negative

TP / True Positive: case was positive and predicted positive
FN / False Negative: case was positive but predicted negative
FP / False Positive: case was negative but predicted positive
Makes sense so far? Now you count how many of the 10,000 cases fall in each bucket, say:
Predicted Negative Predicted Positive
Negative Cases TN: 9,760 FP: 140
Positive Cases FN: 40 TP: 60
Now, your boss asks you three questions:
What percent of your predictions were correct?

You answer: the "accuracy" was (9,760+60) out of 10,000 = 98.2%
What percent of the positive cases did you catch?

You answer: the "recall" was 60 out of 100 = 60%
What percent of positive predictions were correct?

You answer: the "precision" was 60 out of 200 = 30%
See also a very good explanation of Precision and recall in Wikipedia.

Fig 4: Precision and Recall.
ROC curve represents a relation between sensitivity (RECALL) and specificity(NOT PRECISION) and is
commonly used to measure the performance of binary classifiers. However, when dealing with highly
skewed datasets, Precision-Recall (PR) curves give a more representative picture of performance. See
also this Quora answer: What is the difference between a ROC curve and a precision-recall curve?.
What is AUC (Area under the ROC Curve)?

Receiver Operation Characteristic (ROC) curve is a curve of true positive rate vs. false positive rate at
different classification thresholds
An evaluation metric that considers all possible classification thresholds.
The x-axis in an ROC curve. The false positive rate is defined as follows:
The Area Under the ROC curve is the probability that a classifier will be more confident that a
randomly chosen positive example is actually positive than that a randomly chosen negative example
is positive
Is it better to have too many false negatives or too many false positives?
False negative is an example in which the model mistakenly predicted the negative class. For example,
the model inferred that a particular email message was not spam (the negative class), but that email
message actually was spam.
False positive is an example in which the model mistakenly predicted the positive class. For example,
the model inferred that a particular email message was spam (the positive class), but that email
message was actually not spam.
Is it better to have too many false positives, or too many false negatives? Explain.
Answer by Devendra Desale.
It depends on the question as well as on the domain for which we are trying to solve the question.
In medical testing, false negatives may provide a falsely reassuring message to patients and physicians
that disease is absent, when it is actually present. This sometimes leads to inappropriate or
inadequate treatment of both the patient and their disease. So, it is desired to have too many false
positive.
For spam filtering, a false positive occurs when spam filtering or spam blocking techniques wrongly
classify a legitimate email message as spam and, as a result, interferes with its delivery. While most
anti-spam tactics can block or filter a high percentage of unwanted emails, doing so without creating
significant false-positive results is a much more demanding task. So, we prefer too many false
negatives over many false positives.
What do you understand by Recall and Precision?

Recall measures "Of all the actual true samples how many did we classify as true?"
Precision measures "Of all the samples we classified as true how many are actually true?"
We will explain this with a simple example for better understanding -
Imagine that your wife gave you surprises every year on your anniversary in last 12 years. One day all
of a sudden your wife asks -"Darling, do you remember all anniversary surprises from me?".
This simple question puts your life into danger.To save your life, you need to Recall all 12 anniversary
surprises from your memory. Thus, Recall(R) is the ratio of number of events you can correctly recall
to the number of all correct events. If you can recall all the 12 surprises correctly then the recall ratio
is 1 (100%) but if you can recall only 10 suprises correctly of the 12 then the recall ratio is 0.83
(83.3%).
However , you might be wrong in some cases. For instance, you answer 15 times, 10 times the
surprises you guess are correct and 5 wrong. This implies that your recall ratio is 100% but the
precision is 66.67%.
Precision is the ratio of number of events you can correctly recall to a number of all events you recall
(combination of wrong and correct recalls).
Precision a metric for classification models. Precision identifies the frequency with which a model was
correct when predicting the positive class. That is:
Recall is a metric for classification models that answers the following question: Out of all the possible
positive labels, how many did the model correctly identify? That is:
What error metric would you use to evaluate how good a binary classifier is?
What method do you use to determine whether the statistics published in an article (or
appeared in a newspaper or other media) are either wrong or presented to support the
author's point of view, rather than correct, comprehensive factual information on a specific
subject?
A simple rule, suggested by Zack Lipton, is
if some statistics are published in a newspaper, then they are wrong.
Here is a more serious answer by Anmol Rajpurohit.
Every media organization has a target audience. This choice impacts a lot of decisions such as which
article to publish, how to phrase an article, what part of an article to highlight, how to tell a given
story, etc.
In determining the validity of statistics published in any article, one of the first steps will be to examine
the publishing agency and its target audience. Even if it is the same news story involving statistics, you
will notice that it will be published very differently across Fox News vs. WSJ vs. ACM/IEEE journals. So,
data scientists are smart about where to get the news from (and how much to rely on the stories
based on sources!).
Fig 14a: Example of a very misleading bar chart that appeared on Fox News
Fig 14b: how the same data should be presented objectively, from 5 Ways to Avoid Being Fooled By
Statistics
Often the authors try to hide the inadequacy of their research through canny storytelling and omitting
important details to jump on to enticingly presented false insights. Thus, a thumb's rule to identify
articles with misleading statistical inferences is to examine whether the article includes details on the
research methodology followed and any perceived limitations of the choices made related to research
methodology. Look for words such as "sample size", "margin of error", etc. While there are no perfect
answers as to what sample size or margin of error is appropriate, these attributes must certainly be
kept in mind while reading the end results.
Another common case of erratic reporting are the situations when journalists with poor data-
education pick up an insight from one or two paragraphs of a published research paper, while ignoring
the rest of research paper, just in order to make their point. So, here is how you can be smart to avoid
being fooled by such articles: Firstly, a reliable article must not have any unsubstantiated claims. All
the assertions must be backed with reference to past research. Or otherwise, is must be clearly
differentiated as an "opinion" and not an assertion. Secondly, just because an article is referring to
renowned research papers, does not mean that it is using the insight from those research papers
appropriately. This can be validated by reading those referred research papers "in entirety", and
independently judging their relevance to the article at hand. Lastly, though the end-results might
naturally seem like the most interesting part, it is often fatal to skip the details about research
methodology (and spot errors, bias, etc.).
Ideally, I wish that all such articles publish their underlying research data as well as the approach. That
way, the articles can achieve genuine trust as everyone is free to analyze the data and apply the
research approach to see the results for themselves.
A test has a true positive rate of 100% and false positive rate of 5%. There is a population with
a 1/1000 rate of having the condition the test identifies. Considering a positive test, what is
the probability of having that condition?
Let’s suppose you are being tested for a disease, if you have the illness the test will end up
saying you have the illness. However, if you don’t have the illness- 5% of the times the test
will end up saying you have the illness and 95% of the times the test will give accurate result
that you don’t have the illness. Thus there is a 5% error in case you do not have the illness.
Out of 1000 people, 1 person who has the disease will get true positive result.
Out of the remaining 999 people, 5% will also get true positive result.
Close to 50 people will get a true positive result for the disease.
This means that out of 1000 people, 51 people will be tested positive for the disease even
though only one person has the illness. There is only a 2% probability of you having the
disease even if your reports say that you have the disease.
Under- and Overfitting

Explain overfitting and underfitting and how to combat them?
Overfitting - creating a model that matches the training data so closely that the model fails to make
correct predictions on new data.
Unerfitting - Producing a model with poor predictive ability because the model hasn't captured the
complexity of the training data. Many problems can cause underfitting, including:
 Training on the wrong set of features.
 Training for too few epochs or at too low a learning rate.
 Training with too high a regularization rate.
 Providing too few hidden layers in a deep neural network.

How can you overcome Overfitting?
Why might it be preferable to include fewer predictors over many?
Anmol Rajpurohit answers:
Here are a few reasons why it might be a better idea to have fewer predictor variables rather than
having many of them:
Redundancy/Irrelevance:
If you are dealing with many predictor variables, then the chances are high that there are hidden
relationships between some of them, leading to redundancy. Unless you identify and handle this
redundancy (by selecting only the non-redundant predictor variables) in the early phase of data
analysis, it can be a huge drag on your succeeding steps.
It is also likely that not all predictor variables are having a considerable impact on the dependent
variable(s). You should make sure that the set of predictor variables you select to work on does not
have any irrelevant ones – even if you know that data model will take care of them by giving them
lower significance.
Note: Redundancy and Irrelevance are two different notions –a relevant feature can be redundant due
to the presence of other relevant feature(s).
Overfitting:
Even when you have a large number of predictor variables with no relationships between any of them,
it would still be preferred to work with fewer predictors. The data models with large number of
predictors (also referred to as complex models) often suffer from the problem of overfitting, in which
case the data model performs great on training data, but performs poorly on test data.
Productivity:
Let’s say you have a project where there are a large number of predictors and all of them are relevant
(i.e. have measurable impact on the dependent variable). So, you would obviously want to work with
all of them in order to have a data model with very high success rate. While this approach may sound
very enticing, practical considerations (such of amount of data available, storage and compute
resources, time taken for completion, etc.) make it nearly impossible.
Thus, even when you have a large number of relevant predictor variables, it is a good idea to work
with fewer predictors (shortlisted through feature selection or developed through feature extraction).
This is essentially similar to the Pareto principle, which states that for many events, roughly 80% of the
effects come from 20% of the causes.
Focusing on those 20% most significant predictor variables will be of great help in building data
models with considerable success rate in a reasonable time, without needing non-practical amount of
data or other resources.
Training error & test error vs model complexity (Source: Posted on Quora by Sergul Aydore)
Understandability:
Models with fewer predictors are way easier to understand and explain. As the data science steps will
be performed by humans and the results will be presented (and hopefully, used) by humans, it is
important to consider the comprehensive ability of human brain. This is basically a trade-off – you are
letting go of some potential benefits to your data model’s success rate, while simultaneously making
your data model easier to understand and optimize.
This factor is particularly important if at the end of your project you need to present your results to
someone, who is interested in not just high success rate, but also in understanding what is happening
“under the hood”.
What is overfitting and how to avoid it?

Gregory Piatetsky answers:
(Note: this is a revised version of the answer given in 21 Must-Know Data Science Interview Questions
and Answers, part 2)
Overfitting is when you build a predictive model that fits the data "too closely", so that it captures the
random noise in the data rather than true patterns. As a result, the model predictions will be wrong
when applied to new data.
We frequently hear about studies that report unusual results (especially if you listen to Wait Wait
Don't Tell Me) , or see findings like "an orange used car is least likely to be a lemon", or learn that
studies overturn previous established findings (eggs are no longer bad for you).
Many such studies produce questionable results that cannot be repeated.
This is a big problem, especially in social sciences or medicine, when researchers frequently commit
the cardinal sin of Data Science - Overfitting the data.
The researchers test too many hypotheses without proper statistical control, until they happen to find
something interesting. Then they report it. Not surprisingly, next time the effect (which was partly
due to chance) will be much smaller or absent.
These flaws of research practices were identified and reported by John P. A. Ioannidis in his landmark
paper Why Most Published Research Findings Are False (PLoS Medicine, 2005). Ioannidis found that
very often either the results were exaggerated or the findings could not be replicated. In his paper, he
presented statistical evidence that indeed most claimed research findings are false!
Ioannidis noted that in order for a research finding to be reliable, it should have:
Large sample size and with large effects
Greater number of and lesser selection of tested relationship
Greater flexibility in designs, definitions, outcomes, and analytical modes
Minimal bias due to financial and other factors (including popularity of that scientific field)
Unfortunately, too often these rules were violated, producing spurious results, such as S&P 500 index
strongly correlated to production of butter in Bangladesh, or US spending on science, space and
technology correlated with suicides by hanging, strangulation, and suffocation (from
http://tylervigen.com/spurious-correlations)
(Source: Tylervigen.com)
See more strange and spurious findings at Spurious correlations by Tyler Vigen or discover them by
yourself using tools such as Google correlate.
Several methods can be used to avoid "overfitting" the data:
Try to find the simplest possible hypothesis
Regularization (adding a penalty for complexity)
Randomization Testing (randomize the class variable, try your method on this data - if it find the same
strong results, something is wrong)
Nested cross-validation (do feature selection on one level, then run entire method in cross-validation
on outer level)
Adjusting the False Discovery Rate
Using the reusable holdout method - a breakthrough approach proposed in 2015

Good data science is on the leading edge of scientific understanding of the world, and it is data
scientists responsibility to avoid overfitting data and educate the public and the media on the dangers
of bad data analysis.
See also:
4 Reasons Your Machine Learning Model is Wrong (and How to Fix It)
When Good Advice Goes Bad
The Cardinal Sin of Data Mining and Data Science: Overfitting
Big Idea To Avoid Overfitting: Reusable Holdout to Preserve Validity in Adaptive Data Analysis
Overcoming Overfitting with the reusable holdout: Preserving validity in adaptive data analysis
11 Clever Methods of Overfitting and how to avoid them
Explain what is overfitting and how would you control for it

Answer by Gregory Piatetsky.
Overfitting is finding spurious results that are due to chance and cannot be reproduced by subsequent
studies.
We frequently see newspaper reports about studies that overturn the previous findings, like eggs are
no longer bad for your health, or saturated fat is not linked to heart disease. The problem, in our
opinion is that many researchers, especially in social sciences or medicine, too frequently commit the
cardinal sin of Data Mining - Overfitting the data.
The researchers test too many hypotheses without proper statistical control, until they happen to find
something interesting and report it. Not surprisingly, next time the effect, which was (at least partly)
due to chance, will be much smaller or absent.
These flaws of research practices were identified and reported by John P. A. Ioannidis in his landmark
paper Why Most Published Research Findings Are False (PLoS Medicine, 2005). Ioannidis found that
very often either the results were exaggerated or the findings could not be replicated. In his paper, he
presented statistical evidence that indeed most claimed research findings are false.
Ioannidis noted that in order for a research finding to be reliable, it should have:
Large sample size and with large effects
Greater number of and lesser selection of tested relationship
Greater flexibility in designs, definitions, outcomes, and analytical modes
Minimal bias due to financial and other factors (including popularity of that scientific field)
Unfortunately, too often these rules were violated, producing irreproducible results. For example, S&P
500 index was found to be strongly related to Production of butter in Bangladesh (from 19891 to
1993) (here is PDF)
See more interesting (and totally spurious) findings which you can discover yourself using tools such as
Google correlate or Spurious correlations by Tyler Vigen.
Several methods can be used to avoid "overfitting" the data
Try to find the simplest possible hypothesis
Regularization (adding a penalty for complexity)
Randomization Testing (randomize the class variable, try your method on this data - if it find the same
strong results, something is wrong)
Nested cross-validation (do feature selection on one level, then run entire method in cross-validation
on outer level)
Adjusting the False Discovery Rate
Using the reusable holdout method - a breakthrough approach proposed in 2015

Good data science is on the leading edge of scientific understanding of the world, and it is data
scientists responsibility to avoid overfitting data and educate the public and the media on the dangers
of bad data analysis.
See also
The Cardinal Sin of Data Mining and Data Science: Overfitting
Big Idea To Avoid Overfitting: Reusable Holdout to Preserve Validity in Adaptive Data Analysis
Overcoming Overfitting with the reusable holdout: Preserving validity in adaptive data analysis
11 Clever Methods of Overfitting and how to avoid them
Tag: Overfitting
Validation
What is validation?
A process used, as part of training, to evaluate the quality of a machine learning model using the
validation set. Because the validation set is disjoint from the training set, validation helps ensure that
the model’s performance generalizes beyond the training set.
Contrast with test set.

Validation set - A subset of the dataset—disjoint from the training set—used in validation.
Contrast with training set and test set.
How do I determine whether my model is effective?

What is perplexity?
One measure of how well a model is accomplishing its task. For example, suppose your task is to read
the first few letters of a word a user is typing on a smartphone keyboard, and to offer a list of possible
completion words. Perplexity, P, for this task is approximately the number of guesses you need to
offer in order for your list to contain the actual word the user is trying to type.
Perplexity is related to cross-entropy as follows:
What is convenience sampling?

Using a dataset not gathered scientifically in order to run quick experiments. Later on, it's essential to
switch to a scientifically gathered dataset.
Discuss various numerical optimization techniques. Show understanding of training, testing,

and validation of results.
Can you explain the difference between a Test Set and a Validation Set?
Validation set can be considered as a part of the training set as it is used for parameter selection and
to avoid Overfitting of the model being built. On the other hand, test set is used for testing or
evaluating the performance of a trained machine leaning model.
In simple terms, the differences can be summarized as-

Training Set is to fit the parameters i.e. weights.
Test Set is to assess the performance of the model i.e. evaluating the predictive power and
generalization.
Generalization refers to your model's ability to make correct predictions on new, previously unseen
data as opposed to the data used to train the model.
Validation set is to tune the parameters.
Loss curve is a graph of loss as a function of training iterations. For example:
The loss curve can help you determine when your model is converging, overfitting, or underfitting.
Generalization curve is a loss curve showing both the training set and the validation set. A
generalization curve can help you detect possible overfitting. For example, the following generalization
curve suggests overfitting because loss for the validation set ultimately becomes significantly higher
than for the training set.
Explain cross-validation.
It is a model validation technique for evaluating how the outcomes of a statistical analysis will
generalize to an independent data set. It is mainly used in backgrounds where the objective is
forecast and one wants to estimate how accurately a model will accomplish in practice. The
goal of cross-validation is to term a data set to test the model in the training phase (i.e.
validation data set) in order to limit problems like overfitting and gain insight on how the
model will generalize to an independent data set.
What is cross-validation?
A mechanism for estimating how well a model will generalize to new data by testing the model against
one or more non-overlapping data subsets withheld from the training set.
Why is resampling done?

Resampling is done in any of these cases:
Estimating the accuracy of sample statistics by using subsets of accessible data or drawing
randomly with replacement from a set of data points
Substituting labels on data points when performing significance tests
Validating models by using random subsets (bootstrapping, cross validation)
Explain what resampling methods are and why they are useful. Also explain their limitations.
Classical statistical parametric tests compare observed statistics to theoretical sampling distributions.
Resampling a data-driven, not theory-driven methodology which is based upon repeated sampling
within the same sample.
Resampling refers to methods for doing one of these
Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of
available data (jackknifing) or drawing randomly with replacement from a set of data points
(bootstrapping)
Exchanging labels on data points when performing significance tests (permutation tests, also called
exact tests, randomization tests, or re-randomization tests)
Validating models by using random subsets (bootstrapping, cross validation)
See more in Wikipedia about bootstrapping, jackknifing.
See also How to Check Hypotheses with Bootstrap and Apache Spark
Here is a good overview of Resampling Statistics.
Metrics
What is metric?
A number that you care about. May or may not be directly optimized in a machine-learning system. A
metric that your system tries to optimize is called an objective.
Entropy
Gini index
Information gain
Variance reduction
Classification error
Selection
Mallow’s Cp
Akaike Information Criterion
Bayesian Information Criterion
Can you write the formula to calculat R-square?
R-Square can be calculated using the below formula -
1 - (Residual Sum of Squares/ Total Sum of Squares)

Adjusted R^2
Cross-Validation
Optimization of data and model

What is model, model capacity and model function?
Model Is the representation of what an ML system has learned from the training data. Within
TensorFlow, model is an overloaded term, which can have either of the following two related
meanings:
 The TensorFlow graph that expresses the structure of how a prediction will be computed.
 The particular weights and biases of that TensorFlow graph, which are determined by training.
Model capacity is the complexity of problems that a model can learn. The more complex the problems
that a model can learn, the higher the model’s capacity. A model’s capacity typically increases with the
number of model parameters. For a formal definition of classifier capacity, see VC dimension.
Model function is the function within an Estimator that implements ML training, evaluation, and
inference. For example, the training portion of a model function might handle tasks such as defining
the topology of a deep neural network and identifying its optimizer function. When using premade
Estimators, someone has already written the model function for you. When using custom Estimators,
you must write the model function yourself.
For details about writing a model function, see the Creating Custom Estimators chapter in the
TensorFlow Programmers Guide.
What is loss (cost), and how do I measure it? What is objective function?
The mathematical formula or metric that a model aims to optimize. For example, the objective
function for linear regression is usually squared loss. Therefore, when training a linear regression
model, the goal is to minimize squared loss.
In some cases, the goal is to maximize the objective function. For example, if the objective function is
accuracy, the goal is to maximize accuracy.
See also loss.
What is empirical risk minimization (ERM) and Structural risk minimization (SRM)?
Choosing the function that minimizes loss on the training set. Contrast with structural risk
minimization.
An algorithm that balances two goals:
 The desire to build the most predictive model (for example, lowest loss).
 The desire to keep the model as simple as possible (for example, strong regularization).
For example, a function that minimizes loss+regularization on the training set is a structural risk
minimization algorithm.
For more information, see http://www.svms.org/srm/.
Contrast with empirical risk minimization.
Regularization
What is regularization and regularization rate?
The penalty on a model's complexity. Regularization helps prevent overfitting. Different kinds of
regularization include:
 L1 regularization
 dropout regularization
 early stopping (this is not a formal regularization method, but can effectively limit overfitting)
A scalar value, represented as lambda, specifying the relative importance of the regularization
function. The following simplified loss equation shows the regularization rate's influence:
Raising the regularization rate reduces overfitting but may make the model less accurate.
What is convex function?

A function in which the region above the graph of the function is a convex set. The prototypical convex
function is shaped something like the letter U. For example, the following are all convex functions:
By contrast, the following function is not convex. Notice how the region above the graph is not a
convex set:
A strictly convex function has exactly one local minimum point, which is also the global minimum point.
The classic U-shaped functions are strictly convex functions. However, some convex functions (for
example, straight lines) are not U-shaped.
A lot of the common loss functions, including the following, are convex functions:
 L2 loss
 Log Loss
Many variations of gradient descent are guaranteed to find a point close to the minimum of a strictly
convex function. Similarly, many variations of stochastic gradient descent have a high probability
(though, not a guarantee) of finding a point close to the minimum of a strictly convex function.
The sum of two convex functions (for example, L2 loss + L1 regularization) is a convex function.
Deep models are never convex functions. Remarkably, algorithms designed for convex optimization
tend to find reasonably good solutions on deep networks anyway, even though those solutions are not
guaranteed to be a global minimum.
What is L1, L2 (squared loss) and L1 regularization and L2 regularization?

L1
Loss function based on the absolute value of the difference between the values that a model is
predicting and the actual values of the labels. L1 loss is less sensitive to outliers than L2 loss.
L2 (squared loss)
The loss function used in linear regression. (Also known as L2 Loss.) This function calculates the squares
of the difference between a model's predicted value for a labeled example and the actual value of the
label. Due to squaring, this loss function amplifies the influence of bad predictions. That is, squared
loss reacts more strongly to outliers than L1 loss.
L1 regularization
A type of regularization that penalizes weights in proportion to the sum of the absolute values of the
weights. In models relying on sparse features, L1 regularization helps drive the weights of irrelevant or
barely relevant features to exactly 0, which removes those features from the model. Contrast with L2
regularization.
L2 regularizaiton
A type of regularization that penalizes weights in proportion to the sum of the squares of the weights.
L2 regularization helps drive outlier weights (those with high positive or low negative values) closer to
0 but not quite to 0. (Contrast with L1 regularization.) L2 regularization always improves generalization
in linear models.
What is regularization, why do we use it, and give some examples of common methods?
What is Regularization and what kind of problems does regularization solve?
What are the advantages and disadvantages of using regularization methods like Ridge
Regression?
Why L1 regularizations causes parameter sparsity whereas L2 regularization does not?
Regularizations in statistics or in the field of machine learning is used to include some extra
information in order to solve a problem in a better way. L1 & L2 regularizations are generally used to
add constraints to optimization problems.
In the example shown above H0 is a hypothesis. If you observe, in L1 there is a high likelihood to hit the
corners as solutions while in L2, it doesn’t. So in L1 variables are penalized more as compared to L2
which results into sparsity.
In other words, errors are squared in L2, so model sees higher error and tries to minimize that squared
error.
Explain what regularization is and why it is useful.

Regularization is the process of adding a tuning parameter to a model to induce smoothness in order
to prevent overfitting. (see also KDnuggets posts on Overfitting)
This is most often done by adding a constant multiple to an existing weight vector. This constant is
often either the L1 (Lasso) or L2 (ridge), but can in actuality can be any norm. The model predictions
should then minimize the mean of the loss function calculated on the regularized training set.
Xavier Amatriain presents a good comparison of L1 and L2 regularization here, for those interested.
Fig 1: Lp ball: As the value of p decreases, the size of the corresponding L-p space also decreases.
What is dropout regularization?

A form of regularization useful in training neural networks. Dropout regularization works by removing a
random selection of a fixed number of the units in a network layer for a single gradient step. The more
units dropped out, the stronger the regularization. This is analogous to training the network to
emulate an exponentially large ensemble of smaller networks. For full details, see Dropout: A Simple
Way to Prevent Neural Networks from Overfitting.
What is early stopping?

A method for regularization that involves ending model training before training loss finishes
decreasing. In early stopping, you end model training when the loss on a validation dataset starts to
increase, that is, when generalization performance worsens.
Ridge Regression
Synonym for L2 regularization. The term ridge regularization is more frequently used in pure statistics
contexts, whereas L2 regularization is used more often in machine learning.
Least Absolute Shrinkage and Selection Operator (LASSO)

Elastic Net
Least Angle Regression (LARS)
Tichonov Regularisation
Feature
What is feature and example? How does it differs from feature set and feature vector?
Feature is an input variable used in making predictions.
Example is one row of a dataset. An example contains one or more features and possibly a label. See
also labeled example and unlabeled example.
The group of features your machine learning model trains on. For example, postal code, property size,
and property condition might comprise a simple feature set for a model that predicts housing prices.
The list of feature values representing an example passed into a model.
Representation is the process of mapping data to useful features.

What is label and labelled example? What is proxy label?
In supervised learning, the "answer" or "result" portion of an example. Each example in a labeled
dataset consists of one or more features and a label. For instance, in a housing dataset, the features
might include the number of bedrooms, the number of bathrooms, and the age of the house, while
the label might be the house's price. In a spam detection dataset, the features might include the
subject line, the sender, and the email message itself, while the label would probably be either "spam"
or "not spam."
An example that contains features and a label. In supervised training, models learn from labeled
examples
Proxy label
Data used to approximate labels not directly available in a dataset.
For example, suppose you want is it raining? to be a Boolean label for your dataset, but the dataset
doesn't contain rain data. If photographs are available, you might establish pictures of people carrying
umbrellas as a proxy label for is it raining? However, proxy labels may distort results. For example, in
some places, it may be more common to carry umbrellas to protect against sun than the rain.
What is feature engineering and feature extraction?

The process of determining which features might be useful in training a model, and then converting
raw data from log files and other sources into said features. In TensorFlow, feature engineering often
means converting raw log file entries to tf.Example protocol buffers. See also tf.Transform.
Feature engineering is sometimes called feature extraction.
Overloaded term having either of the following definitions:
 Retrieving intermediate feature representations calculated by an unsupervised or pretrained

model (for example, hidden layer values in a neural network) for use in another model as
input.
 Synonym for feature engineering.
How do I represent my data so that a program can learn from it?

What is continuous and discrete feature? What is dense feature and sparse feature?
A floating-point feature with an infinite range of possible values. Contrast with discrete feature.
A feature with a finite set of possible values. For example, a feature whose values may only be animal,
vegetable, or mineral is a discrete (or categorical) feature. Contrast with continuous feature.
A feature in which most values are non-zero, typically a Tensor of floating-point values. Contrast with
sparse feature.
Feature vector whose values are predominately zero or empty. For example, a vector containing a
single 1 value and a million 0 values is sparse. As another example, words in a search query could also
be a sparse feature—there are many possible words in a given language, but only a few of them occur
in a given query.
How do you control model complexity?
What is binning and bucketing?
Converting a (usually continuous) feature into multiple binary features called buckets or bins, typically
based on value range. For example, instead of representing temperature as a single continuous
floating-point feature, you could chop ranges of temperatures into discrete bins. Given temperature
data sensitive to a tenth of a degree, all temperatures between 0.0 and 15.0 degrees could be put into
one bin, 15.1 to 30.0 degrees could be a second bin, and 30.1 to 50.0 degrees could be a third bin.
What is synthetic feature?

A feature not present among the input features, but created from one or more of them. Kinds of
synthetic features include:
 Bucketing a continuous feature into range bins.
 Multiplying (or dividing) one feature value by other feature value(s) or by itself.
 Creating a feature cross.
Features created by normalizing or scaling alone are not considered synthetic features.
What is feature cross?

A synthetic feature formed by crossing (taking a Cartesian product of) individual binary features
obtained from categorical data or from continuous features via bucketing. Feature crosses help
represent nonlinear relationships.
How can you determine which features are the most important in your model?
Thuy Pham answers:
In applied machine learning, success depends significantly on the quality of data representation
(features). Highly correlated features can make learning/sorting steps in the classification module
easy. Conversely if label classes are a very complex function of the features, it can be impossible to
build a good model [Dom 2012]. Thus a so-called feature engineering, a process of transforming data
into features that are most relevant to the problem, is often needed.
A feature selection scheme often involves techniques to automatically select salient features from a
large exploratory feature pool. Redundant and irrelevant features are well known to cause poor
accuracy so discarding these features should be the first task. The relevance is often scored using
mutual information calculation. Furthermore, input features should thus offer a high level of
discrimination between classes. The separability of features can be measured by distance or variance
ratio between classes. One recent work [Pham 2016] proposed a systematic voting based feature
selection that is a data-driven approach incorporating above criteria. This can be used as a common
framework for a wide class of problems.
A data-driven feature selection approach incorporating several saliency criteria [Pham 2016].
Another approach is penalizing on the features that are not very important (e.g., yield a high error
metric) when using regularization methods like Lasso or Ridge.
References:
[Dom 2012] P. Domingos. A few useful things to know about machine learning. Communications of the
ACM, 55(10):78–87, 2012. 2.4
[Pham 2016] T. T. Pham, C. Thamrin, P. D. Robinson, and P. H. W. Leong. Respiratory artefact removal
in forced oscillation measurements: A machine learning approach. Biomedical Engineering, IEEE
Transactions on, accepted, 2016.
What are feature vectors?

A feature vector is an n-dimensional vector of numerical features that represent some object. In
machine learning, feature vectors are used to represent numeric or symbolic characteristics, called
features, of an object in a mathematical, easily analyzable way.
What do you understand by feature vectors?

What is data normalization and why do we need it? Scaling
Normalization is the process of converting an actual range of values into a standard range of values,
typically -1 to +1 or 0 to 1. For example, suppose the natural range of a certain feature is 800 to 6,000.
Through subtraction and division, you can normalize those values into the range -1 to +1.
Scaling is a commonly used practice in feature engineering to tame a feature's range of values to
match the range of other features in the dataset. For example, suppose that you want all floating-
point features in the dataset to have a range of 0 to 1. Given a particular feature's range of 0 to 500,
you could scale that feature by dividing each value by 500.
I felt this one would be important to highlight. Data normalization is very important pre-processing
step, used to rescale values to fit in a specific range to assure better convergence during
backpropagation. In general, it boils down to subtracting the mean of each data point and dividing by
its standard deviation. If we don’t do this then some of the features (those with high magnitude) will
be weighted more in the cost function (if a higher-magnitude feature changes by 1%, then that change
is pretty big, but for smaller features it’s quite insignificant). The data normalization makes all features
weighted equally.
What do you understand by long and wide data formats?

What is the difference between "long" ("tall") and "wide" format data?
Answer by Gregory Piatetsky.
In most data mining / data science applications there are many more records (rows) than features
(columns) - such data is sometimes called "tall" (or "long") data.
In some applications like genomics or bioinformatics you may have only a small number of records
(patients), eg 100, but perhaps 20,000 observations for each patient. The standard methods that work
for "tall" data will lead to overfitting the data, so special approaches are needed.
Fig 13. Different approaches for tall data and wide data, from presentation Sparse Screening for Exact
Data Reduction, by Jieping Ye.
The problem is not just reshaping the data (here there are useful R packages), but avoiding false
positives by reducing the number of features to find most relevant ones.
Approaches for feature reduction like Lasso are well covered in Statistical Learning with Sparsity: The
Lasso and Generalizations, by Hastie, Tibshirani, and Wainwright. (you can download free PDF of the
book)
Differentiate between wide and tall data formats?

How would you go about doing an Exploratory Data Analysis (EDA)?
The goal of an EDA is to gather some insights from the data before applying your predictive model i.e
gain some information. Basically, you want to do your EDA in a coarse to fine manner. We start by
gaining some high-level global insights. Check out some imbalanced classes. Look at mean and
variance of each class. Check out the first few rows to see what it’s all about. Run a pandas df.info() to
see which features are continuous, categorical, their type (int, float, string). Next, drop unnecessary
columns that won’t be useful in analysis and prediction. These can simply be columns that look
useless, one’s where many rows have the same value (i.e it doesn’t give us much information), or it’s
missing a lot of values. We can also fill in missing values with the most common value in that column,
or the median. Now we can start making some basic visualizations. Start with high-level stuff. Do some
bar plots for features that are categorical and have a small number of groups. Bar plots of the final
classes. Look at the most “general features”. Create some visualizations about these individual
features to try and gain some basic insights. Now we can start to get more specific. Create
visualizations between features, two or three at a time. How are features related to each other? You
can also do a PCA to see which features contain the most information. Group some features together
as well to see their relationships. For example, what happens to the classes when A = 0 and B = 0?
How about A = 1 and B = 0? Compare different features. For example, if feature A can be either
“Female” or “Male” then we can plot feature A against which cabin they stayed in to see if Males and
Females stay in different cabins. Beyond bar, scatter, and other basic plots, we can do a PDF/CDF,
overlayed plots, etc. Look at some statistics like distribution, p-value, etc. Finally it’s time to build the
ML model. Start with easier stuff like Naive Bayes and Linear Regression. If you see that those suck or
the data is highly non-linear, go with polynomial regression, decision trees, or SVMs. The features can
be selected based on their importance from the EDA. If you have lots of data you can use a Neural
Network. Check ROC curve. Precision, Recall
Explain Principal Component Analysis (PCA)?

What are confounding variables?
These are extraneous variables in a statistical model that correlate directly or inversely with
both the dependent and the independent variable. The estimate fails to account for the
confounding factor.
What is numerical data (continuous features)?
Features represented as integers or real-valued numbers. For example, in a real estate model, you
would probably represent the size of a house (in square feet or square meters) as numerical data.
Representing a feature as numerical data indicates that the feature's values have a mathematical
relationship to each other and possibly to the label. For example, representing the size of a house as
numerical data indicates that a 200 square-meter house is twice as large as a 100 square-meter house.
Furthermore, the number of square meters in a house probably has some mathematical relationship
to the price of the house.
Not all integer data should be represented as numerical data. For example, postal codes in some parts
of the world are integers; however, integer postal codes should not be represented as numerical data
in models. That's because a postal code of 20000 is not twice (or half) as potent as a postal code of
10000. Furthermore, although different postal codes do correlate to different real estate values, we
can't assume that real estate values at postal code 20000 are twice as valuable as real estate values at
postal code 10000. Postal codes should be represented as categorical data instead.
Numerical features are sometimes called continuous features.
Categorical
What is one-hot encoding?
A sparse vector in which:
 One element is set to 1.
 All other elements are set to 0.
One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible
values. For example, suppose a given botany dataset chronicles 15,000 different species, each
denoted with a unique string identifier. As part of feature engineering, you'll probably encode those
string identifiers as one-hot vectors in which the vector has a size of 15,000.
What are categorical variables (categorical data)?

Features having a discrete set of possible values. For example, consider a categorical feature named
house style, which has a discrete set of three possible values: Tudor, ranch, colonial. By representing
house style as categorical data, the model can learn the separate impacts of Tudor, ranch, and colonial
on house price.
Sometimes, values in the discrete set are mutually exclusive, and only one value can be applied to a
given example. For example, a car maker categorical feature would probably permit only a single value
(Toyota) per example. Other times, more than one value may be applicable. A single car could be
painted more than one different color, so a car color categorical feature would likely permit a single
example to have multiple values (for example, red and white).
Categorical features are sometimes called discrete features.
Contrast with numerical data.
How will you find the correlation between a categorical variable and a continuous variable?
You can use the analysis of covariance technique to find the correlation between a categorical
variable and a continuous variable.
Which technique is used to predict categorical responses?
Classification technique is used widely in mining for classifying data sets.
Missing
How do you handle missing or corrupted data in a dataset?
You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide
to replace them with another value. In Pandas, there are two very useful methods: isnull() and
dropna() that will help you find columns of data with missing or corrupted data and drop those values.
If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna()
method.
During analysis, how do you treat missing values?
The extent of the missing values is identified after identifying the variables with missing values. If any
patterns are identified the analyst has to concentrate on them as it could lead to interesting and
meaningful business insights. If there are no patterns identified, then the missing values can be
substituted with mean or median values (imputation) or they can simply be ignored. There are various
factors to be considered when answering this question-
Understand the problem statement, understand the data and then give the answer. Assigning a
default value which can be mean, minimum or maximum value. Getting into the data is important.
If it is a categorical variable, the default value is assigned. The missing value is assigned a default value.
If you have a distribution of data coming, for normal distribution give the mean value.
Should we even treat missing values is another important point to consider? If 80% of the values for a
variable are missing then you can answer that you would be dropping the variable instead of treating
the missing values.
What are your favourite imputation techniques to handle missing data?

Outlier
What are outliers?
Values distant from most other values. In machine learning, any of the following are outliers:
 Weights with high absolute values.
 Predicted values relatively far away from the actual values.
 Input data whose values are more than roughly 3 standard deviations from the mean.
Outliers often cause problems in model training. Clipping is one way of managing outliers.
How can outlier values be treated?

Outlier values can be identified by using univariate or any other graphical analysis method. If the
number of outlier values is few then they can be assessed individually but for large number of outliers
the values can be substituted with either the 99th or the 1st percentile values. All extreme values are
not outlier values. The most common ways to treat outlier values –
1) To change the value and bring in within a range
2) To just remove the value.
What do you understand by outliers and inliers? What would you do if you find them in your
dataset?
How would you screen for outliers and what should you do if you find one?
Answer by Bhavya Geethika.
Some methods to screen outliers are
 z-scores,
 modified z-score,
 box plots,
 Grubb's test,
 Tietjen-Moore test exponential smoothing,
 Kimber test for exponential distribution and moving window filter algorithm.
However, two of the robust methods in detail are:
Inter Quartile Range

An outlier is a point of data that lies over 1.5 IQRs below the first quartile (Q1) or above third quartile
(Q3) in a given data set.
High = (Q3) + 1.5 IQR
Low = (Q1) - 1.5 IQR
Tukey Method
It uses interquartile range to filter very large or very small numbers. It is practically the same method
as above except that it uses the concept of "fences". The two values of fences are:
Low outliers = Q1 - 1.5(Q3 - Q1) = Q1 - 1.5(IQR)
High outliers = Q3 + 1.5(Q3 - Q1) = Q3 + 1.5(IQR)
Anything outside of the fences is an outlier.
When you find outliers, you should not remove it without a qualitative assessment because that way
you are altering the data and making it no longer pure. It is important to understand the context of
analysis or importantly "The Why question - Why an outlier is different from other data points?"
This reason is critical. If outliers are attributed to error, you may throw it out but if they signify a new
trend, pattern or reveal a valuable insight into the data you should retain it.
How would you screen for outliers and what should you do if you find one?
What is clipping?
A technique for handling outliers. Specifically, reducing feature values that are greater than a set
maximum value down to that maximum value. Also, increasing feature values that are less than a
specific minimum value up to that minimum value.
For example, suppose that only a few feature values fall outside the range 40–60. In this case, you
could do the following:
 Clip all values over 60 to be exactly 60.
 Clip all values under 40 to be exactly 40.
In addition to bringing input values within a designated range, clipping can also used to force gradient
values within a designated range during training.
What are some ways I can make my model more robust to outliers?
Thuy Pham answers:
There are several ways to make a model more robust to outliers, from different points of view (data
preparation or model building). An outlier in the question and answer is assumed being unwanted,
unexpected, or a must-be-wrong value to the human’s knowledge so far (e.g. no one is 200 years old)
rather than a rare event which is possible but rare.
Outliers are usually defined in relation to the distribution. Thus outliers could be removed in the pre-
processing step (before any learning step), by using standard deviations (for normality) or interquartile
ranges (for not normal/unknown) as threshold levels.
Outliers. Image source
Moreover, data transformation (e.g. log transformation) may help if data have a noticeable tail. When
outliers related to the sensitivity of the collecting instrument which may not precisely record small
values, Winsorization may be useful. This type of transformation (named after Charles P. Winsor
(1895–1951)) has the same effect as clipping signals (i.e. replaces extreme data values with less
extreme values). Another option to reduce the influence of outliers is using mean absolute difference
rather mean squared error.
For model building, some models are resistant to outliers (e.g. tree-based approaches) or non-
parametric tests. Similar to the median effect, tree models divide each node into two in each split.
Thus, at each split, all data points in a bucket could be equally treated regardless of extreme values
they may have. The study [Pham 2016] proposed a detection model that incorporates interquartile
information of data to predict outliers of the data.
References:
[Pham 2016] T. T. Pham, C. Thamrin, P. D. Robinson, and P. H. W. Leong. Respiratory artefact removal
in forced oscillation measurements: A machine learning approach. IEEE Transactions on Biomedical
Engineering, 2016.
This Quora answer contains further information.
Imbalanced
What is class-imbalanced dataset? What is minority and majority class?
A binary classification problem in which the labels for the two classes have significantly different
frequencies. For example, a disease dataset in which 0.0001 of examples have positive labels and
0.9999 have negative labels is a class-imbalanced problem, but a football game predictor in which 0.51
of examples label one team winning and 0.49 label the other team winning is not a class-imbalanced
problem.
Minority class is the less common label in a class-imbalanced dataset. For example, given a dataset
containing 99% non-spam labels and 1% spam labels, the spam labels are the minority class.
Majority class is the more common label in a class-imbalanced dataset. For example, given a dataset
containing 99% non-spam labels and 1% spam labels, the non-spam labels are the majority class.
How would you handle an imbalanced dataset?

I have an article about this! Check out #3 :)
There are a few things you can do to combat this:
 Use class weights in the loss function. Essentially, the under-represented classes receive
higher weights in the loss function, such that any miss-classifications for that particular class
will lead to a very high error in the loss function.
 Over-sample: Repeating some of the training examples that contain the under-represented
class helps even-out the distribution. This works best if the available data is small.
 Under-sample: You can simply skip some training examples that contain the over-represented
class. This works best if the available data is very large.
 Data augmentation for the minority class. You can synthetically create more training examples
for the under-represented class! For example, with the previous example of detecting lethal
weapons, you can change some of the colours and lighting of the videos that belong to the
class having lethal weapons.
What is data augmentation?

Artificially boosting the range and number of training examples by transforming existing examples to
create additional examples. For example, suppose images are one of your features, but your dataset
doesn't contain enough image examples for the model to learn useful associations. Ideally, you'd add
enough labeled images to your dataset to enable your model to train properly. If that's not possible,
data augmentation can rotate, stretch, and reflect each image to produce many variants of the
original picture, possibly yielding enough labeled data to enable excellent training.
What is downsampling?
Overloaded term that can mean either of the following:
 Reducing the amount of information in a feature in order to train a model more efficiently. For
example, before training an image recognition model, downsampling high-resolution images
to a lower-resolution format.
 Training on a disproportionately low percentage of over-represented class examples in order

to improve model training on under-represented classes. For example, in a class-imbalanced
dataset, models tend to learn a lot about the majority class and not enough about the minority
class. Downsampling helps balance the amount of training on the majority and minority
classes.
What is confusion matrix?

An NxN table that summarizes how successful a classification model's predictions were; that is, the
correlation between the label and the model's classification. One axis of a confusion matrix is the label
that the model predicted, and the other axis is the actual label. N represents the number of classes. In
a binary classification problem, N=2. For example, here is a sample confusion matrix for a binary
classification problem:
Tumor (predicted) Non-Tumor (predicted)
Tumor (actual) 18 1
Non-Tumor (actual) 6 452
The preceding confusion matrix shows that of the 19 samples that actually had tumors, the model
correctly classified 18 as having tumors (18 true positives), and incorrectly classified 1 as not having a
tumor (1 false negative). Similarly, of 458 samples that actually did not have tumors, 452 were
correctly classified (452 true negatives) and 6 were incorrectly classified (6 false positives).
The confusion matrix for a multi-class classification problem can help you determine mistake patterns.
For example, a confusion matrix could reveal that a model trained to recognize handwritten digits
tends to mistakenly predict 9 instead of 4, or 1 instead of 7.
Confusion matrices contain sufficient information to calculate a variety of performance metrics,

including precision and recall.
What error metric would you use to evaluate how good a binary classifier is? What if the
classes are imbalanced? What if there are more than 2 groups?
Binary classification involves classifying the data into two groups, e.g. whether or not a customer buys
a particular product or not (Yes/No), based on independent variables such as gender, age, location etc.
As the target variable is not continuous, binary classification model predicts the probability of a target
variable to be Yes/No. To evaluate such a model, a metric called the confusion matrix is used, also
called the classification or co-incidence matrix. With the help of a confusion matrix, we can calculate
important performance measures:
 True Positive Rate (TPR) or Hit Rate or Recall or Sensitivity = TP / (TP + FN)
 False Positive Rate(FPR) or False Alarm Rate = 1 - Specificity = 1 - (TN / (TN + FP))
 Accuracy = (TP + TN) / (TP + TN + FP + FN)
 Error Rate = 1 – accuracy or (FP + FN) / (TP + TN + FP + FN)
 Precision = TP / (TP + FP)
 F-measure: 2 / ( (1 / Precision) + (1 / Recall) )
 ROC (Receiver Operating Characteristics) = plot of FPR vs TPR
 AUC (Area Under the Curve)
 Kappa statistics
You can find more details about these measures here: The Best Metric to Measure Accuracy of
Classification Models.
All of these measures should be used with domain skills and balanced, as, for example, if you only get
a higher TPR in predicting patients who don’t have cancer, it will not help at all in diagnosing cancer.
In the same example of cancer diagnosis data, if only 2% or less of the patients have cancer, then this
would be a case of class imbalance, as the percentage of cancer patients is very small compared to
rest of the population. There are main 2 approaches to handle this issue:
Use of a cost function: In this approach, a cost associated with misclassifying data is evaluated with the
help of a cost matrix (similar to the confusion matrix, but more concerned with False Positives and
False Negatives). The main aim is to reduce the cost of misclassifying. The cost of a False Negative is
always more than the cost of a False Positive. e.g. wrongly predicting a cancer patient to be cancer-
free is more dangerous than wrongly predicting a cancer-free patient to have cancer.
Total Cost = Cost of FN * Count of FN + Cost of FP * Count of FP
Use of different sampling methods: In this approach, you can use over-sampling, under-sampling, or
hybrid sampling. In over-sampling, minority class observations are replicated to balance the data.
Replication of observations leading to overfitting, causing good accuracy in training but less accuracy
in unseen data. In under-sampling, the majority class observations are removed causing loss of
information. It is helpful in reducing processing time and storage, but only useful if you have a large
data set.
Find more about class imbalance here.
If there are multiple classes in the target variable, then a confusion matrix of dimensions equal to the
number of classes is formed, and all performance measures can be calculated for each of the classes.
This is called a multiclass confusion matrix. e.g. there are 3 classes X, Y, Z in the response variable, so
recall for each class will be calculated as below:
Recall_X = TP_X/(TP_X+FN_X)
Recall_Y = TP_Y/(TP_Y+FN_Y)
Recall_Z = TP_Z/(TP_Z+FN_Z)
Hyperparameter optimization
What is parameter and how does it differs from hyperparameter?
A variable of a model that the ML system trains on its own. For example, weights are parameters
whose values the ML system gradually learns through successive training iterations. Parameter update
is operation of adjusting a model's parameters during training, typically within a single iteration of
gradient descent.
Contrast with hyperparameter.
Hyperparameters are "knobs" that you tweak during successive runs of training a model. For example,
learning rate is a hyperparameter.
What is fine tuning?

Perform a secondary optimization to adjust the parameters of an already trained model to fit a new
problem. Fine tuning often refers to refitting the weights of a trained unsupervised model to a
supervised model.
What is checkpoint?
Data that captures the state of the variables of a model at a particular time. Checkpoints enable
exporting model weights, as well as performing training across multiple sessions. Checkpoints also
enable training to continue past errors (for example, job preemption). Note that the graph itself is not
included in a checkpoint.
Ensemble
What is ensemble?
A merger of the predictions of multiple models. You can create an ensemble via one or more of the
following:
 different initializations
 different hyperparameters
 different overall structure
Deep and wide models are a kind of ensemble.
What is the idea behind ensemble learning?

"In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain
better predictive performance than could be obtained from any of the constituent learning algorithms
alone."
– Wikipedia.
Imagine you are playing the game “Who wants to be millionaire?” and reached up to last question of 1
million dollars. You have no clue about the question, but you have audience poll and phone a friend
life lines. Thank God. At this stage you don’t want to take any risk, so what will you do to get sure-shot
right answer to become millionaire?
You will use both life lines, isn’t it? Let’s say 70% audience is saying right answer is D and your friend is
also saying the right answer is D with 90% confidence because he is an expert in the area of the
question. Use of both life lines gives you an average 80% confidence that D is correct and gets you
closer to becoming a millionaire.
This is the approach of ensemble methods.
The famous Netflix Prize competition took almost 3 years before the goal of 10% improvement was
reached. The winners used gradient boosted decision trees to combine over 500 models.
In ensemble methods, more diverse the models used, more robust will be the ultimate result.
Different models used in ensemble improves overall variance from difference in population, difference
in hypothesis generated, difference in algorithms used and difference in parametrization. There are
main 3 widely used ensembles techniques:
Bagging = Bootstrap aggregation
Boosting
Stacking
So if you have different models built for same data and same response variable, you can use one of
the above methods to build ensemble model. As every model used in the ensemble has its own
performance measures, some of the models may perform better than ultimate ensemble model and
some of them may perform poorer than or equal to ensemble model. But overall the ensemble
methods will improve overall accuracy and stability of the model, although at the expense of model
understandability.
For more on ensemble methods see:
Ensemble Methods: Elegant Techniques to Produce Improved Machine Learning Results
Data Science Basics: An Introduction to Ensemble Learners
Random Forest
An ensemble approach to finding the decision tree that best fits the training data by creating many
decision trees and then determining the "average" one. The "random" part of the term refers to
building each of the decision trees from a random selection of features; the "forest" refers to the set
of decision trees.
Gradient Boosting Machines (GBM)

What is Boosting?
A ML technique that iteratively combines a set of simple and not very accurate classifiers (referred to
as "weak" classifiers) into a classifier with high accuracy (a "strong" classifier) by upweighting the
examples that the model is currently misclassfying.
Bootstrapped Aggregation (Bagging)

AdaBoost
Stacked Generalization (Blending)
Gradient Boosted Regression Trees (GBRT)
General
Analysis
What is data analysis?
Obtaining an understanding of data by considering samples, measurement, and visualization. Data
analysis can be particularly useful when a dataset is first received, before one builds the first model. It
is also crucial in understanding experiments and debugging problems with the system.
What is power analysis?

An experimental design technique for determining the effect of a given sample size.
What is root cause analysis?

Root cause analysis was initially developed to analyse industrial accidents but is now widely
used in other areas. It is a problem-solving technique used for isolating the root causes of
faults or problems. A factor is called a root cause if its deduction from the problem-fault-
sequence averts the final undesirable event from reoccurring.
What is root cause analysis?
According to Wikipedia,
Root cause analysis (RCA) is a method of problem solving used for identifying the root causes of faults
or problems. A factor is considered a root cause if removal thereof from the problem-fault-sequence
prevents the final undesirable event from recurring; whereas a causal factor is one that affects an
event's outcome, but is not a root cause.
Root cause analysis was initially developed to analyse industrial accidents, but is now widely used in
other areas, such as healthcare, project management, or software testing.
Here is a useful Root Cause Analysis Toolkit from the state of Minnesota.
Essentially, you can find the root cause of a problem and show the relationship of causes by
repeatedly asking the question, "Why?", until you find the root of the problem. This technique is
commonly called "5 Whys", although is can be involve more or less than 5 questions.
Fig. 5 Whys Analysis Example, from The Art of Root Cause Analysis .
Why data cleaning plays a vital role in analysis?

Cleaning data from multiple sources to transform it into a format that data analysts or data scientists
can work with is a cumbersome process because - as the number of data sources increases, the time
take to clean the data increases exponentially due to the number of sources and the volume of data
generated in these sources. It might take up to 80% of the time for just cleaning data making it a
critical part of analysis task.
Differentiate between univariate, bivariate and multivariate analysis.

These are descriptive statistical analysis techniques which can be differentiated based on the number
of variables involved at a given point of time. For example, the pie charts of sales based on territory
involve only one variable and can be referred to as univariate analysis.
If the analysis attempts to understand the difference between 2 variables at time as in a scatterplot,
then it is referred to as bivariate analysis. For example, analysing the volume of sale and a spending
can be considered as an example of bivariate analysis.
Analysis that deals with the study of more than two variables to understand the effect of variables on
the responses is referred to as multivariate analysis.
Big data
Explain star schema.
It is a traditional database schema with a central table. Satellite tables map IDs to physical
names or descriptions and can be connected to the central fact table using the ID fields; these
tables are known as lookup tables and are principally useful in real-time applications, as they
save a lot of memory. Sometimes star schemas involve several layers of summarization to
recover information faster.
What is baseline model?
A model used as a reference point for comparing how well another model (typically, a more complex
one) is performing. For example, a logistic regression model might serve as a good baseline for a deep
model.
For a particular problem, the baseline helps model developers quantify the minimal expected
performance that a new model must achieve for the new model to be useful.
Is more data always better?

What are some of the common data quality issues when dealing with Big Data? What can be
done to avoid them or to mitigate their impact?
The most common data quality issues observed when dealing with Big Data can be best understood in
terms of the key characteristics of Big Data – Volume, Velocity, Variety, Veracity, and Value.
Volume:
In the traditional data warehouse environment, comprehensive data quality assessment and reporting
was at least possible (if not, ideal). However, in the Big Data projects the scale of data makes it
impossible. Thus, the data quality measurements can at best be approximations (i.e. need to be
described in probability and confidence intervals, and not in terms of absolute values). We also need
to re-define most of the data quality metrics based on the specific characteristics of the Big Data
project so that those metrics can have a clear meaning, be measured (good approximation) and be
used for evaluating the alternative strategies for data quality improvement.
Despite the great volume of underlying data, it is not uncommon to find out that some desired data
was not captured or is not available for other reasons (such as high cost, delay in getting it, etc.). It is
ironical but true that data availability continues to be a prominent data quality concern in the Big Data
era.
Velocity:
The tremendous pace of data generation and collection makes it incredibly hard to monitor data
quality within a reasonable overhead on time and resources (storage, compute, human effort, etc.).
So, by the time data quality assessment completes, the output might be outdated and of little use,
particularly if the Big Data project is to serve any real-time or near real-time business needs. In such
scenarios, you would need to re-define data quality metrics so that they are relevant as well as
feasible in the real-time context.
Sampling can help you gain speed for the data quality efforts, but this comes at the cost of a bias
(which eventually makes the end result less useful) because of the fact that samples are rarely an
accurate representation of the entire data. Lesser samples will give higher speed, but with a bigger
bias.
Another impact of velocity is that you might have to do data quality assessments on-the-fly, i.e.
somewhere plugged-in within the data collection/transfer/storage processes; as the critical time-
constraint does not give you the privilege of making a copy of a selected data subset, storing it
elsewhere and running data quality assessments on it.
Variety:
One of the biggest data quality issues in Big Data is that the data includes several data types
(structured, semi-structured, and unstructured) coming in from different data sources. Thus, often a
single data quality metric will not be applicable for the entire data and you would need to separately
define data quality metrics for each data type. Moreover, assessing and improving the data quality of
unstructured or semi-structured data is way more tricky and complex than that of structured data. For
example, when mining the physician notes from medical records across the world (related to a
particular medical condition) even if the language (and the grammar) is same the meaning might be
very different due to local dialects and slang. This leads to low data interpretability, another data
quality measure.
Data from different sources often has serious semantic differences. For example, “profit” can have
widely varied definitions across the business units of an organization or external agencies. Thus, the
fields with identical names may not mean the same thing. This problem is made worse by the lack of
adequate and consistent meta-data from each data source. In order to make sense of data, you need
reliable metadata (such as to make sense of sales numbers from a store, you need other information
such as date-time, items purchased, coupons used, etc.). Usually, a lot of these data sources are
outside an organization and thus, it is very hard to ensure good metadata for such data.
Another common issue is syntactic inconsistencies. For example, “time-stamp” values from different
sources would be incompatible unless they are captured along with the time zone information.
Image source.
Veracity:
Veracity, one of the most overlooked Big Data characteristics, is directly related to data quality, as it
refers to the inherent biases, noise and abnormality in data. Because of veracity, the data values might
not be exact real values, rather they might be approximations. In other words, the data might have
some inherent impreciseness and uncertainty. Besides data inaccuracies, Veracity also includes data
consistency (defined by the statistical reliability of data) and data trustworthiness (based on data
origin, data collection and processing methods, security infrastructure, etc.). These data quality issues
in turn impact data integrity and data accountability.
While the other V’s are relatively well-defined and can be easily measured, Veracity is a complex
theoretical construct with no standard approach for measurement. In a way this reflects how complex
the topic of “data quality” is within the Big Data context.
Data users and data providers are often different organizations with very different goals and
operational procedures. Thus, it is no surprise that their notions of data quality are very different. In
many cases, the data providers have no clue about the business use cases of data users (data
providers might not even care about it, unless they are getting paid for the data). This disconnect
between data source and data use is one of the prime reasons behind the data quality issues
symbolized by Veracity.
Value:
The Value characteristic connects directly to the end purpose. Organizations are harnessing Big Data
for many diverse business pursuits, and those pursuits are the real drivers of how data quality is
defined, measured, and improved.
A common and old definition of data quality is that it is the “fitness of use” for the data consumer. This
means that data quality is dependent on what you plan to do with the data. Thus, for a given data two
different organizations with different business goals will most likely have widely different
measurements of data quality.This nuance is often not well understood – data quality is a “relative”
term. A Big Data project might involve incomplete and inconsistent data, however, it is possible that
those data quality issues do not impact the utility of data towards the business goal. In such a case,
the business would say that the data quality is great (and will not be interested in investing in data
quality improvements). For example, for a producer of mashed potato cans a batch of small potatoes
would be of same quality as a batch of big potatoes. However, for a fast food restaurant making fries,
the quality of the two batches would be radically different.
The Value aspect also brings in the “cost-benefit” perspective to data quality – whether it would be
worth to resolve a given data quality issue, which issues should be resolved on priority, etc.
Putting it all together:
Data quality in Big Data projects is a very complex topic, where the theory and practice often differ. I
haven’t come across any standard theory yet that is widely-accepted. Rather, I see little interest in the
industry towards this goal.In practice, data quality does play an important role in the design of Big
Data architecture. All the data quality efforts must start from a solid understanding of high-priority
business use cases, and use that insight to navigate various trade-offs (samples given below) to
optimize the quality of the final output.
Sample trade-offs related to data quality:
Is it worth improving the timeliness of data at the expense of data completeness and/or inadequate
assessment of accuracy?
Should we select data for cleaning based on cost of cleaning effort or based on how frequently the
data is used or based on its relative importance within the data models consuming it? Or, a
combination of those factors? What sort of combination?
Is it a good idea to improve data accuracy through getting rid of incomplete or erroneous data? While
removing some data, how do we ensure that no bias is getting introduced?
Given the magnanimous scope of work and very limited resources (relatively!), one common way for
data quality efforts on Big Data projects is to adopt the baseline approach, in which, the data users are
surveyed to identify and document the bare minimum data quality needed to ensure that the business
processes they support are not disrupted. These minimum satisfactory levels of data quality are
referred to as the baseline, and the data quality efforts are focused on ensuring that data quality for
each data does not fall beyond its baseline level. It looks like a good starting point and you may later
move into more advanced endeavors (based on business needs and available budget).
Summary of Recommendations to improve data quality in Big Data projects:
Identify and prioritize the business use cases (then, use them to define data quality metrics,
measurement methodology, improvement goals, etc.)
Based on a strong understanding of the business use cases and the Big Data architecture implemented
to achieve them, design and implement an optimal layer of data governance (data definitions,
metadata requirements, data ownership, data flow diagrams, etc.)
Document baseline quality levels for key data (think of “critical-path” diagram and “throughput-
bottleneck” assessment)
Define ROI for data quality efforts (in order to create feedback loop on the ROI metric to improve
efficiency and to sustain funding for data quality efforts)
Integrate data quality efforts (to achieve efficiency through minimizing redundancy)
Automate data quality monitoring (to reduce cost as well as to let employees stay focused on complex
tasks)
Do not rely on machine learning to automatically take care of poor data quality (machine learning is
science and not magic!)
Tell us about the biggest data set you have processed till date and for what kind of analysis.
How do you handle big data sets?
Machine Learning
What are differences between discriminative and generative models?
A model that predicts labels from a set of one or more features. More formally, discriminative models
define the conditional probability of an output given the features and weights; that is:
p(output | features, weights)
For example, a model that predicts whether an email is spam from features and weights is a
discriminative model.
The vast majority of supervised learning models, including classification and regression models, are
discriminative models.
Contrast with generative model.
Practically speaking, a model that does either of the following:
 Creates (generates) new examples from the training dataset. For example, a generative model
could create poetry after training on a dataset of poems. The generator part of a generative
adversarial network falls into this category.
 Determines the probability that a new example comes from the training set, or was created
from the same mechanism that created the training set. For example, after training on a
dataset consisting of English sentences, a generative model could determine the probability
that new input is a valid English sentence.
A generative model can theoretically discern the distribution of examples or particular features in a
dataset. That is:
p(examples)
Unsupervised learning models are generative.
What is few-shot learning and one-shot learning?

A machine learning approach, often used for object classification, designed to learn effective
classifiers from only a small number of training examples.
See also one-shot learning.
A machine learning approach, often used for object classification, designed to learn effective
classifiers from a single training example.
How does machine learning differ from traditional programming?
Input/Output in Machine Learning
Problem types in Machine Learning
How do you know which Machine Learning model you should use?
While one should always keep the “no free lunch theorem” in mind, there are some general
guidelines. I wrote an article on how to select the proper regression model here. This cheatsheet is
also fantastic!
Differentiate between Data Science , Machine Learning and AI.

Data Science vs Machine Learning
Criteria Data Science Machine Learning Artificial Intelligence
A wide term that

Data Science is not exactly a subset of machine A subset of AI that focuses on
Definition learning but it uses machine learning to focuses on narrow range applications ranging
analyse and make future predictions. of activities. from Robotics to Text
Analysis.
It is a combination of
It is a purely technical
Role It can take on a business role. both business and
role.
technical aspects.
Data Science is a broad term for diverse Machine learning fits

AI is a sub-field of
Scope disciplines and is not merely about developing within the data science
computer science.
and training models. spectrum.
A sub- field of
computer science
consisting of various
task like planning,
moving around in the
Machine learning is a
world, recognizing
AI Loosely integrated sub field of AI and is
objects and sounds,
tightly integrated.
speaking, translating,
performing social or
business
transactions, creative
work..
What is Machine Learning?

A program or system that builds (trains) a predictive model from input data. The system uses the
learned model to make useful predictions from new (never-before-seen) data drawn from the same
distribution as the one used to train the model. Machine learning also refers to the field of study
concerned with these programs or systems.
The simplest way to answer this question is – we give the data and equation to the machine. Ask the
machine to look at the data and identify the coefficient values in an equation.
For example for the linear regression y=mx+c, we give the data for the variable x, y and the machine
learns about the values of m and c from the data.
Can you use machine learning for time series analysis?

Yes, it can be used but it depends on the applications.
Write a function that takes in two sorted lists and outputs a sorted list that is their union.
First solution which will come to your mind is to merge two lists and short them afterwards
Python code-
def return_union(list_a, list_b):
return sorted(list_a + list_b)
R code-
return_union <- function(list_a, list_b)
{
list_c<-list(c(unlist(list_a),unlist(list_b)))
return(list(list_c[[1]][order(list_c[[1]])]))
}
Generally, the tricky part of the question is not to use any sorting or ordering function. In that case
you will have to write your own logic to answer the question and impress your interviewer.
Python code-
def return_union(list_a, list_b):
len1 = len(list_a)
len2 = len(list_b)
final_sorted_list = []
j=0
k=0
for i in range(len1+len2):
if k == len1:
final_sorted_list.extend(list_b[j:])
break
elif j == len2:
final_sorted_list.extend(list_a[k:])
break
elif list_a[k] < list_b[j]:
final_sorted_list.append(list_a[k])
k += 1
else:
final_sorted_list.append(list_b[j])
j += 1
return final_sorted_list
Similar function can be returned in R as well by following the similar steps.
return_union <- function(list_a,list_b)

{
#Initializing length variables
len_a <- length(list_a)
len_b <- length(list_b)
len <- len_a + len_b
#initializing counter variables
j=1
k=1
#Creating an empty list which has length equal to sum of both the lists
list_c <- list(rep(NA,len))
#Here goes our for loop
for(i in 1:len)
{
if(j>len_a)
{
list_c[i:len] <- list_b[k:len_b]
break
}
else if(k>len_b)
{
list_c[i:len] <- list_a[j:len_a]
break
}
else if(list_a[[j]] <= list_b[[k]])
{
list_c[[i]] <- list_a[[j]]
j <- j+1
}
else if(list_a[[j]] > list_b[[k]])
{
list_c[[i]] <- list_b[[k]]
k <- k+1
}
}
return(list(unlist(list_c)))
What do you understand by Hypothesis in the content of Machine Learning?

Which is your favourite machine learning algorithm and why?
Visualization
What are your favourite data visualization tools?
What makes a good data visualization?
Note: This answer contains excerpts from the recent post What makes a good data visualization – a
Data Scientist perspective.
Data Science is more than just building predictive models - it is also about explaining the models and
using them to help people to understand data and make decisions. Data visualization is an integral
part of presenting data in a convincing way.
There is a ton of research of good data visualization and how people best perceive information - see
work by Stephen Few and many others.
Guidelines on improving human perception include:
position data along a common scale
bars are more effective than circles or squares in communicating size
color is more discernible than shape in scatterplots
avoid pie chart unless it is for showing proportions
avoid 3D charts and reduce chartjunk
Sunburst visualization is more effective for hierarchical plots
use small multiples (even though animation looks cool, it is less effective for understanding changing
data.)
See 39 studies about human perception, by Washington Post graphics editor for a lot more detail.
From Data Science point of view, what makes visualization important is highlighting the key aspects of
data - what are the most important variables, what is their relative importance, what are the changes
and trends.
Data visualization should be visually appealing but not at the expense of loading a chart with
unnecessary junk, like in this extreme example on the right.
Explain Edward Tufte's concept of "chart junk."
Chartjunk refers to all visual elements in charts and graphs that are not necessary to comprehend the
information represented on the graph, or that distract the viewer from this information.
The term chartjunk was coined by Edward Tufte in his 1983 book The Visual Display of Quantitative
Information.
Fig 15. Tufte writes: "an unintentional Necker Illusion, as two back planes optically flip to the front.
Some pyramids conceal others; and one variable (stacked depth of the stupid pyramids) has no label
or scale."
Here is a more modern example from exceluser where it is very hard to understand the column plot
because of workers and cranes that obscure them.
The problem with such decorations is that they forces readers to work much harder than necessary to
discover the meaning of data.
How do we make a good data visualization?

To do that, choose the right type of chart for your data:
Line Charts to track changes or trends over time and show the relationship between two or more
variables.
Bar Charts to compare quantities of different categories.

Scatter Plots show joint variation of two data items.
Pie Charts to compare parts of a whole - used them sparingly since people have hard time comparing
the area of pie slices
You can show additional variables on a 2-D plot using color, shape, and size
Use interactive dashboards to allow experiments with key variables
Here is an example of visualization of US Presidential Elections, 1976-2016, that shows multiple

variables at once: the electoral college votes difference (y-axis), the % popular vote difference (X-axis),
the size of the popular vote (circle area), winner party (color), and winner name and year (label). See
my post on What makes a good data visualization for more details.
US Presidential Elections, 1976-2016.
References:
What makes a good visualization, David McCandless, Information is Beautiful
5 Data Visualization Best Practices, GoodData
39 studies about human perception in 30 minutes, Kenn Elliott
Data Visualization for Human Perception, landmark work by Stephen Few (key ideas summarized here)
Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs).
How to efficiently represent 5 dimension in a chart (or in a video)?
There are many good tools for Data Visualization. R, Python, Tableau and Excel are among most
commonly used by Data Scientists.
Here are useful KDnuggets resources:
Visualization and Data Mining Software
Overview of Python Visualization Tools
21 Essential Data Visualization Tools
Top 30 Social Network Analysis and Visualization Tools
Tag: Data Visualization
There are many ways to representing more than 2 dimensions in a chart. 3rd dimension can be shown
with a 3D scatter plot which can be rotate. You can use color, shading, shape, size. Animation can be
used effectively to show time dimension (change over time).
Here is a good example.

Fig 20a: 5-dimensional scatter plot of Iris data, with size: sepal length; color: sepal width; shape: class;
x-column: petal length; y-column: petal width, from here.
For more than 5 dimensions, one approach is Parallel Coordinates, pioneered by Alfred Inselberg.
Fig 20b: Iris data in parallel coordinates
See also
Quora: What's the best way to visualize high-dimensional data? and
pioneering work of Georges Grinstein and his colleagues on High-Dimensional Visualizations .
Of course, when you have a lot of dimensions, it is best to reduce the number of dimensions or
features first.
Work
What is pipeline?
The infrastructure surrounding a machine learning algorithm. A pipeline includes gathering the data,
putting the data into training data files, training one or more models, and exporting the models to
production.
What is dynamic model?

A model that is trained online in a continuously updating fashion. That is, data is continuously entering
the model.
Give an example of how you would use experimental design to answer a question about user
behavior.
Answer by Bhavya Geethika.
Step 1: Formulate the Research Question:

What are the effects of page load times on user satisfaction ratings?
Step 2: Identify variables:

We identify the cause & effect. Independent variable -page load time, Dependent variable- user
satisfaction rating
Step 3: Generate Hypothesis:

Lower page download time will have more effect on the user satisfaction rating for a web page. Here
the factor we analyze is page load time.
Fig 12: There is a flaw in your experimental design (cartoon from here)
Step 4: Determine Experimental Design.

We consider experimental complexity i.e vary one factor at a time or multiple factors at one time in
which case we use factorial design (2^k design). A design is also selected based on the type of
objective (Comparative, Screening, Response surface) & number of factors.
Here we also identify within-participants, between-participants, and mixed model.For e.g.: There are
two versions of a page, one with Buy button (call to action) on left and the other version has this
button on the right.
Within-participants design - both user groups see both versions.
Between-participants design - one group of users see version A & the other user group version B.
Step 5: Develop experimental task & procedure:

Detailed description of steps involved in the experiment, tools used to measure user behavior, goals
and success metrics should be defined. Collect qualitative data about user engagement to allow
statistical analysis.
Step 6: Determine Manipulation & Measurements
Manipulation: One level of factor will be controlled and the other will be manipulated. We also
identify the behavioral measures:
Latency- time between a prompt and occurrence of behavior (how long it takes for a user to click buy
after being presented with products).
Frequency- number of times a behavior occurs (number of times the user clicks on a given page within
a time)
Duration-length of time a specific behavior lasts(time taken to add all products)
Intensity-force with which a behavior occurs ( how quickly the user purchased a product)
Step 7: Analyze results

Identify user behavior data and support the hypothesis or contradict according to the observations
made for e.g. how majority of users satisfaction ratings compared with page load times.
How to determine the influence of a Twitter user?

Social networks are at the center of today's web, and determining the influence in a social network is a
huge area of research. Twitter influence is a narrow area within the overall social network influence
research.
The influence of a Twitter user goes beyond the simple number of followers. We also want to examine
how effective are tweets - how likely they are to be retweeted, favorited, or the links inside clicked
upon. What exactly is an influential user depends on the definition - different types of influence
discussed included celebrities, opinion leaders, influencers, discussers, innovators, topical experts,
curators, commentators, and more.
A key challenge is to compute influence efficiently. An additional problem on Twitter is separating

humans and bots.
Common measures used to quantify influence on Twitter include many versions of network centrality -
how important is the node within the network, and PageRank-based metrics.
KDnuggets Twitter Social Network, as visualized in NodeXL in May 2014.
Traditional network measures used include
Closeness Centrality, based on the length of the shortest paths from a node to everyone else. It
measures the visibility or accessibility of each node with respect to the entire network
Betweenness centrality considers for each node i all the shortest paths that should pass through i to
connect all the other nodes in the network. It measures the ability of each node to facilitate
communication within the network.
Other proposed measures include retweet impact (how likely is the tweet be retweeted) and
variations of PageRank, such as TunkRank - see A Twitter Analog to PageRank.
An important refinement to overall influence is looking at influence within a topic - done by Agilience
and RightRelevant. For instance, Justin Bieber may have high influence overall, but he is less influential
than KDnuggets in the area of Data Science.
Twitter provides a REST API which allows access to key measures, but with limits on the number of
requests and the data returned.
There were a number of websites that measured Twitter user influence, but many of their business
models did not pan out, since many of them were acquired or went out of business. Ones which are
currently active include the following:
Free:
Agilience (KDnuggets is #1 in Machine Learning, #1 is Data Mining, #2 in Data Science)
Klout, klout.com (KDnuggets Klout score is 79)
Influence Tracker, www.influencetracker.com , KDnuggets influence metric 39.2
Right Relevance - measures specific relevance of twitter users within a topic.
Paid:
Brandwatch (bought PeerIndex)
Hubspot
Simplymeasured
Relevant KDnuggets posts:
Agilience Top Data Mining, Data Science Authorities
12 Data Analytics Thought Leaders on Twitter
The 123 Most Influential People in Data Science
RightRelevance helps find key topics, top influencers in Big Data, Data Science, and Beyond
Relevant KDnuggets tags:
/tag/influencers
/tag/big-data-influencers
For a more in-depth analysis, see technical articles below:
What is a good measure of the influence of a Twitter user?, Quora
Measuring User Influence in Twitter: The Million Follower Fallacy, AAAI, 2010
Measuring user influence on Twitter: A survey, arXiv, 2015
Measuring Influence on Twitter, I. Anger and C. Kittl
A Data Scientist Explains How To Maximize Your Influence On Twitter, Business Insider, 2014
How would you explain to the senior management in your organization as to why a particular
data set is important?
What kind of data is important for specific business requirements and how, as a data scientist
will you go about collecting that data?
How can you ensure that you don’t analyse something that ends up producing meaningless
results?
What types of data are important for business needs?
Suppose you are given a data set, what will you do with it to find out if it suits the business
needs of your project or not.
How will you assess the statistical significance of an insight whether it is a real insight or just
by chance?
Statistical importance of an insight can be accessed using Hypothesis Testing.
In which libraries for Data Science in Python and R, does your strength lie?
Why do you want to pursue a career in data science?
What have you done to upgrade your skills in analytics?
What has been the most useful business insight or development you have found?
How regularly must an algorithm be updated?
You will want to update an algorithm when:
You want the model to evolve as data streams through infrastructure
The underlying data source is changing
There is a case of non-stationarity
Are you familiar with price optimization, price elasticity, inventory management, competitive
intelligence? Give examples.
Those are economics terms that are not frequently asked of Data Scientists but they are useful to
know.
Price optimization is the use of mathematical tools to determine how customers will respond to
different prices for its products and services through different channels.
Big Data and data mining enable use of personalization for price optimization. Now companies like
Amazon can even take optimization further and show different prices to different visitors, based on
their history, although there is a strong debate about whether this is fair.
Price elasticity in common usage typically refers to
Price elasticity of demand, a measure of price sensitivity. It is computed as:

Price Elasticity of Demand = % Change in Quantity Demanded / % Change in Price.
Similarly, Price elasticity of supply is an economics measure that shows how the quantity supplied of a
good or service responds to a change in its price.
Inventory management is the overseeing and controlling of the ordering, storage and use of
components that a company will use in the production of the items it will sell as well as the overseeing
and controlling of quantities of finished products for sale.
Wikipedia defines
Competitive intelligence: the action of defining, gathering, analyzing, and distributing intelligence
about products, customers, competitors, and any aspect of the environment needed to support
executives and managers making strategic decisions for an organization.
Tools like Google Trends, Alexa, Compete, can be used to determine general trends and analyze your
competitors on the web.
Here are useful resources:
Competitive Intelligence Metrics, Reports by Avinash Kaushik
37 Best Marketing Tools to Spy on Your Competitors from Kissmetrics
10 best competitive intelligence tools from 10 experts
Which data scientists do you admire most? Which are your favourite data science start-ups?
This question does not have a correct answer, but here is my personal list of 12 Data Scientists I most
admire, not in any particular order.
Geoff Hinton, Yann LeCun, and Yoshua Bengio - for persevering with Neural Nets when and starting
the current Deep Learning revolution.
Demis Hassabis, for his amazing work on DeepMind, which achieved human or superhuman
performance on Atari games and recently Go.
Jake Porway from DataKind and Rayid Ghani from U. Chicago/DSSG, for enabling data science
contributions to social good.
DJ Patil, First US Chief Data Scientist, for using Data Science to make US government work better.
Kirk D. Borne for his influence and leadership on social media.
Claudia Perlich for brilliant work on ad ecosystem and serving as a great KDD-2014 chair.
Hilary Mason for great work at Bitly and inspiring others as a Big Data Rock Star.
Usama Fayyad, for showing leadership and setting high goals for KDD and Data Science, which helped
inspire me and many thousands of others to do their best.
Hadley Wickham, for his fantastic work on Data Science and Data Visualization in R, including dplyr,
ggplot2, and Rstudio.
There are too many excellent startups in Data Science area, but I will not list them here to avoid a
conflict of interest.
Here is some of our previous coverage of startups.
How would you create a taxonomy to identify key customer trends in unstructured data?
Tweet: Data Science Interview questions #1 - How would you create a taxonomy to identify key
customer trends in unstructured data? - http://ctt.ec/sdqZ0+
The best way to approach this question is to mention that it is good to check with the business owner
and understand their objectives before categorizing the data. Having done this, it is always good to
follow an iterative approach by pulling new data samples and improving the model accordingly by
validating it for accuracy by soliciting feedback from the stakeholders of the business. This helps
ensure that your model is producing actionable results and improving over the time.
How would you develop a model to identify plagiarism?

Explain the life cycle of a data science project.
What makes a dataset gold standard?
In experimental design, is it necessary to do randomization? If yes, why?
What are the most important skills for a data scientist to have?
When you get a new data set, what do you do with it to see if it will suit your needs for a
given project?
How do you model a quantity you can’t observe?
Do you have any other projects that would be related here?
Here you’ll really draw connections between your research and their business. Is there anything you
did or any skills you learned that could possibly connect back to their business or the role you are
applying for? It doesn’t have to be 100% exact, just somehow related such that you can show that you
will be able to directly add lots of value.
Explain your current masters research?

What worked? What didn’t? Future directions? Same as the last question!
Python or R – Which one would you prefer for text analytics?

The best possible answer for this would be Python because it has Pandas library that provides easy to
use data structures and high performance data analysis tools.
DataFrame is a popular datatype for representing datasets in pandas. A DataFrame is analogous to a

table. Each column of the DataFrame has a name (a header), and each row is identified by a number.
What were the business outcomes or decisions for the projects you worked on?
What unique skills you think can you add on to our data science team?
What data [at the company] would you go after and start working on?
What’s a project you would want to work on at our company? What data would you go after
to start working on it?
What unique skills do you think you’d bring to the team?
Other
Why is vectorization considered a powerful method for optimizing numerical code?
What is inter-rater agreement?
A measurement of how often human raters agree when doing a task. If raters disagree, the task
instructions may need to be improved. Also sometimes called inter-annotator agreement or inter-rater
reliability. See also Cohen's kappa, which is one of the most popular inter-rater agreement
measurements.
When can parallelism make your algorithms run faster? When could it make your algorithms
run slower?
Parallelism is a good idea when the task can be divided into sub-tasks that can be executed
independent of each other without communication or shared resources. Even then, efficient
implementation is key to achieving the benefits of parallelization. In real-life, most of the programs
have some sections that need to be executed in serialized fashion, and the parallelizable sub-tasks
need some kind of synchronization or data transfer. Thus, it is hard to predict whether parallelization
will actually make the algorithm run faster (than the serialized approach).
Parallelism would always have overhead compared to the compute cycles required to complete the
task sequentially. At the minimum, this overhead will comprise of dividing the task into sub-tasks and
compiling together the results of sub-tasks.
The performance of parallelism against sequential computing is largely determined by how the time
consumed by this overhead compares to the time saved due to parallelization.
Note: The overhead associated with parallelism is not just limited to the run-time of code, but also
includes the extra time required for coding and debugging (parallelism versus sequential code).
A widely-known theoretical approach to assessing the benefit of parallelization is Amdahl’s law, which
gives the following formula to measure the speedup of running sub-tasks in parallel (over different
processors) versus running them sequentially (on a single processor):
where:
Slatency is the theoretical speedup of the execution of the whole task;
s is the speedup of the part of the task that benefits from improved system resources;
p is the proportion of execution time that the part benefiting from improved resources originally
occupied.
To understand the implication of Amdahl’s Law, look at the following figure that illustrates the
theoretical speedup against an increasing number of processor cores, for tasks with different level of
achievable parallelization:
It is important to note that not every program can be effectively parallelized. Rather, very few
programs will scale with perfect speedups because of the limitations due to sequential portions, inter-
communication costs, etc. Usually, large data sets form a compelling case for parallelization. However,
it should not be assumed that parallelization would lead to performance benefits. Rather, the
performance of parallelism and sequential should be compared on a sub-set of the problem, before
investing effort into parallelization.
How can you iterate over a list and also retrieve element indices at the same time?
This can be done using the enumerate function which takes every element in a sequence just like in a
list and adds its location just before it.
What do you understand by Fuzzy merging ? Which language will you use to handle it?
What is broadcasting?
Expanding the shape of an operand in a matrix math operation to dimensions compatible for that
operation. For instance, linear algebra requires that the two operands in a matrix addition operation
must have the same dimensions. Consequently, you can't add a matrix of shape (m, n) to a vector of
length n. Broadcasting enables this operation by virtually expanding the vector of length n to a matrix
of shape (m,n) by replicating the same values down each column.
For example, given the following definitions, linear algebra prohibits A+B because A and B have
different dimensions:
A = [[7, 10, 4],

[13, 5, 9]]
B = [2]
However, broadcasting enables the operation A+B by virtually expanding B to:
[[2, 2, 2],
[2, 2, 2]]
Thus, A+B is now a valid operation:
[[7, 10, 4], + [[2, 2, 2], = [[ 9, 12, 6],

[13, 5, 9]] [2, 2, 2]] [15, 7, 11]]
See the following description of broadcasting in NumPy for more details.
What is an Eigenvalue and Eigenvector?

Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate
the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a
particular linear transformation acts by flipping, compressing or stretching. Eigenvalue can be referred
to as the strength of the transformation in the direction of eigenvector or the factor by which the
compression occurs.
What is Keras?
A popular Python machine learning API. Keras runs on several deep learning frameworks, including
TensorFlow, where it is made available as tf.keras.
What is Cloud TPU?

Tensor Processing Unit (TPU)
An ASIC (application-specific integrated circuit) that optimizes the performance of TensorFlow

programs.
Specialized accelerator technology to speed up machine learning workloads on Google Cloud.
What is Dataset API (tf.data)

A high-level TensorFlow API for reading data and transforming it into a form that a machine learning
algorithm requires. A tf.data.Dataset object represents a sequence of elements, in which each
element contains one or more Tensors. A tf.data.Iterator object provides access to the elements of a
Dataset.
For details about the Dataset API, see Importing Data in the TensorFlow Programmer's Guide.
What is Metrics API (tf.metrics)?

A TensorFlow API for evaluating models. For example, tf.metrics.accuracy determines how
often a model's predictions match labels. When writing a custom Estimator, you invoke Metrics API
functions to specify how your model should be evaluated.
What is Layers API (tf.layers)?
A TensorFlow API for constructing a deep neural network as a composition of layers. The Layers API
enables you to build different types of layers, such as:
 tf.layers.Dense for a fully-connected layer.
 tf.layers.Conv2D for a convolutional layer.
When writing a custom Estimator, you compose Layers objects to define the characteristics of all the
hidden layers.
The Layers API follows the Keras layers API conventions. That is, aside from a different prefix, all
functions in the Layers API have the same names and signatures as their counterparts in the Keras
layers API.
What is feature column (tf.feature_column)?

A function that specifies how a model should interpret a particular feature. A list that collects the
output returned by calls to such functions is a required parameter to all Estimators constructors.
The tf.feature_column functions enable models to easily experiment with different representations of
input features. For details, see the Feature Columns chapter in the TensorFlow Programmers Guide.
"Feature column" is Google-specific terminology. A feature column is referred to as a "namespace" in

the VW system (at Yahoo/Microsoft), or a field.
What is graph and graph execution?

In TensorFlow, a computation specification. Nodes in the graph represent operations. Edges are
directed and represent passing the result of an operation (a Tensor) as an operand to another
operation. Use TensorBoard to visualize a graph.
A TensorFlow programming environment in which the program first constructs a graph and then
executes all or part of that graph. Graph execution is the default execution mode in TensorFlow 1.x.
Contrast with eager execution.
What is eager execution?

A TensorFlow programming environment in which operations run immediately. By contrast, operations
called in graph execution don't run until they are explicitly evaluated. Eager execution is an imperative
interface, much like the code in most programming languages. Eager execution programs are
generally far easier to debug than graph execution programs.
Write a program in Python which takes input as the diameter of a coin and weight of the coin
and produces output as the money value of the coin.
What problems arise if the distribution of the new (unseen) test data is significantly different
than the distribution of the training data?
Gregory Piatetsky and Thuy Pham answer:
The main problem is that the predictions will be wrong!
If the new test data is sufficiently different in key parameters of the prediction model from the
training data, then predictive model is no longer valid.
The main reasons this can happen are sample selection bias, population drift, or non-stationary
environment.
a) Sample selection bias

Here the data is static, but the training examples have been obtained through a biased method, such
as non-uniform selection or non-random split of data into train and test.
If you have a large static dataset, then you should randomly split it into train/test data, and the
distribution of test data should be similar to training data.
b) Covariate shift aka population drift

Here the data is not static, with one population used as a training data, and another population used
for testing.
(Figure from http://iwann.ugr.es/2011/pdf/InvitedTalk-FHerrera-IWANN11.pdf).
Sometimes the training data and test data are derived via different processes - eg a drug tested on
one population is given to a new population that may have significant differences. As a result, a
classifier based on training data will perform poorly.
One proposed solution is to apply a statistical test to decide if the probabilities of target classes and
key variables used by the classifier are significantly different, and if they are, to retrain the model
using new data.
c) Non-stationary environments
Training environment is different from the test one, whether it's due to a temporal or a spatial change.
This is similar to case b, but applies to situation when data is not static - we have a stream of data and
we periodically sample it to develop predictive models of future behavior. This happens in adversarial
classification problems, such as spam filtering and network intrusion detection, where spammers and
hackers constantly change their behavior in response. Another typical case is customer analytics
where customer behavior changes over time. A telephone company develops a model for predicting
customer churn or a credit card company develops a model to predict transaction fraud. Training data
is historical data, while (new) test data is the current data.
Such models periodically need to be retrained and to determine when you can compare the
distribution of key variables in the predictive model in the old data (training set) and the new data,
and if there is a sufficiently significant difference, the model needs to be retrained.
For a more detailed and technical discussion, see references below.

References:
[1] Marco Saerens, Patrice Latinne, Christine Decaestecker: Adjusting the Outputs of a Classifier to
New a Priori Probabilities: A Simple Procedure. Neural Computation 14(1): 21-41 (2002)
[2] Machine Learning in Non-stationary Environments: Introduction to Covariate Shift Adaptation,

Masashi Sugiyama, Motoaki Kawanabe, MIT Press, 2012, ISBN 0262017091, 9780262017091
[3] Quora answer to What could be some issues if the distribution of the test data is significantly
different than the distribution of the training data?
[4] Dataset Shift in Classification: Approaches and Problems, Francisco Herrera invited talk, 2011.
[5] When Training and Test Sets are Different: Characterising Learning Transfer, Amos Storkey, 2013.
How can you deal with different types of seasonality in time series modelling?
Seasonality in time series occurs when time series shows a repeated pattern over time. E.g., stationary
sales decreases during holiday season, air conditioner sales increases during the summers etc. are few
examples of seasonality in a time series.
Seasonality makes your time series non-stationary because average value of the variables at different
time periods. Differentiating a time series is generally known as the best method of removing
seasonality from a time series. Seasonal differencing can be defined as a numerical difference
between a particular value and a value with a periodic lag (i.e. 12, if monthly seasonality is present)
What is Interpolation and Extrapolation?

Estimating a value from 2 known values from a list of values is Interpolation. Extrapolation is
approximating a value by extending a known set of values or facts.
Locally Weighted Learning (LWL)

Contents
My questions ........................................................................................................................................... 1
Whose job is it to make sure you have data? .................................................................................. 1
Who gets fired if all your insights aren’t used for anything? ........................................................... 1
Who picks the tools you use and makes sure they play nice with all the other infrastructure? ...... 1
Introduction ............................................................................................................................................. 1
Dimensionality reduction ......................................................................................................................... 3
1. What are dimensions? .................................................................................................................. 5
1. What is dimension reduction? ...................................................................................................... 5
1. Explain dimensionality reduction, where it’s used, and its benefits? ............................................ 5
2. What is the curse of dimensionality? ............................................................................................ 5
2. How do you combat the curse of dimensionality? ........................................................................ 6
3. What is the advantage of performing dimensionality reduction before fitting an SVM? ............... 6
4. Principal Componenet Analysis (PCA) ............................................................................................... 6
ICA........................................................................................................................................................ 8
Partial Least Squares Regression (PLSR)............................................................................................... 8
Sammon Mapping................................................................................................................................ 8
Multidimensional Scaling (MDS) .......................................................................................................... 8
Projection Pursuit ................................................................................................................................ 8
Principal Component Regression (PCR) ............................................................................................... 8
Partial Least Squares Discriminant Analysis ......................................................................................... 8
Mixture Discriminant Analysis (MDA) .................................................................................................. 8
Quadratic Discriminant Analysis (QDA) ................................................................................................ 8
Regularized Discriminant Analysis (RDA) ............................................................................................. 8
Flexible Discriminant Analysis (FDA) .................................................................................................... 8
Linear Discriminant Analysis (LDA) ....................................................................................................... 8
Classification ............................................................................................................................................ 8
What is classification model? ........................................................................................................... 8
What is class? Negative class and Positive class? .............................................................................. 8
What is binary classification?............................................................................................................ 8
What is decision boundary? ............................................................................................................. 9
What is classification threshold (decision threshold)? ...................................................................... 9
What is one-vs.-all? .......................................................................................................................... 9
What is rotational invariance, translational invariance and size invariance?..................................... 43
Decision Tree ....................................................................................................................................... 9
What is decision tree? ...................................................................................................................... 9
Explain the steps in making a decision tree. ................................................................................... 10
How do you work towards a random forest?.................................................................................. 10
Classification and Regression Tree (CART) ..................................................................................... 11
Iterative Dichotomiser 3 (ID3)........................................................................................................ 11
C4.5 ................................................................................................................................................ 11
C5.0 ................................................................................................................................................ 11
Chi-squared Automatic Interaction Detection (CHAID).................................................................. 11
Decision Stump .............................................................................................................................. 11
Conditional Decision Trees ............................................................................................................ 11
M5.................................................................................................................................................. 11
Bayes.................................................................................................................................................. 11
1. Naïve Bayes ................................................................................................................................ 11
1. Is Naïve Bayes bad? If yes, under what aspects........................................................................... 11
1. What is prior belief? ................................................................................................................... 11
1. What do you understand by conjugate-prior with respect to Naïve Bayes? ................................ 11
2. What is the difference between Bayesian Estimate and Maximum Likelihood Estimation (MLE)?
....................................................................................................................................................... 11
Averaged One-Dependence Estimators (AODE) ............................................................................ 11
Bayesian Belief Network (BBN) ...................................................................................................... 11
Gaussian Naïve Bayes .................................................................................................................... 11
Multinomial Naïve Bayes ............................................................................................................... 11
What is Bayesian Neural Network (BN)? ......................................................................................... 11
Instance based ................................................................................................................................... 11
k-Nearest Neighbour (kNN) ............................................................................................................ 12
Learning Vector Quantization (LVQ) .............................................................................................. 12
Logistic Regression............................................................................................................................. 11
What is logistic regression? What log loss is for? ............................................................................ 11
What is cross-entropy? .................................................................................................................. 12
Support Vector Machines .................................................................................................................. 12
What is Kernel Support Vector Machines (KSVMs)? ....................................................................... 12
Give some situations where you will use an SVM over a RandomForest Machine Learning algorithm
and vice-versa. ............................................................................................................................... 12
Clustering ............................................................................................................................................... 12
Explain this clustering algorithm? ................................................................................................... 12
What is clustering? ......................................................................................................................... 12
How will you define the number of clusters in a clustering algorithm? ........................................... 13
In unsupervised learning, if a ground truth about a dataset is unknown, how can we determine the
most useful number of clusters to be? ........................................................................................... 14
What is the difference between Cluster and Systematic Sampling? .............................................. 16
What is the difference between Supervised Learning an Unsupervised Learning? ....................... 16
What is similarity measure? ........................................................................................................... 16
K-means ............................................................................................................................................. 17
What is K-means? How can you select K for K-means? ................................................................... 17
How will you find the right K for K-means? ..................................................................................... 18
What is centroid? What is centroid-based clustering?.................................................................... 18
K-Medians .......................................................................................................................................... 18
What is k-median? ......................................................................................................................... 18
Mean-Shift ......................................................................................................................................... 19
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) .............................................. 19
Expectation Maximization (EM) using Gaussian Mixture Models (GMM)............................................ 20
Hierarchical Clustering (Agglomerative clustering and Divisive clustering) ......................................... 20
Regression ............................................................................................................................................. 20
What is regression? ........................................................................................................................ 20
Explain tradeoffs between different types of regression models and different types of
classification models. ..................................................................................................................... 30
What is linear regression? .............................................................................................................. 21
What is Linear Regression? ............................................................................................................. 21
What are the basic assumptions to be made for linear regression?................................................ 21
What are the assumptions required for linear regression? ............................................................. 21
What is multicollinearity and how you can overcome it?................................................................ 21
What are the drawbacks of the linear model? ................................................................................ 21
What is generalized linear model?................................................................................................. 31
How will you explain logistic regression to an economist, physics scientist and biologist? ............. 21
What is logistic regression? Or State an example when you have used logistic regression recently.21
What is logits? ................................................................................................................................ 21
Is it possible to perform logistic regression with Microsoft Excel? ................................................ 31
How would you validate a model you created to generate a predictive model of a quantitative
outcome variable using multiple regression. .................................................................................. 21
You created a predictive model of a quantitative outcome variable using multiple regressions. What
are the steps you would follow to validate the model? .................................................................. 30
How do you decide whether your linear regression model fits the data? ....................................... 30
How can you assess a good logistic model? .................................................................................... 30
How you can make data normal using Box-Cox transformation? .................................................. 30
Explain about the box cox transformation in regression models. .................................................. 30
Least squares regression ................................................................................................................ 31
Linear Regression ............................................................................................................................... 31
Ordinary Least Squares Regression (OLSR) ........................................................................................ 31
Stepwise Regression .......................................................................................................................... 31
Multivariate Adaptive Regression Splines (MARS) ............................................................................. 31
Locally Estimated Scatterplot Smoothing (LOESS) ............................................................................. 31
DeepLearing ........................................................................................................................................... 31
What is Tower? .............................................................................................................................. 31
Activation functions............................................................................................................................ 31
What is activation function? .......................................................................................................... 31
Linear ............................................................................................................................................. 32
What is Rectified Linear Unit (ReLU)? ............................................................................................. 32
Why is ReLU better and more often used than Sigmoid in Neural Networks? ................................ 32
Step function ................................................................................................................................. 32
Threshold logic............................................................................................................................... 32
What is Sigmoid Function? ............................................................................................................ 32
What is log-odds? .......................................................................................................................... 32
Optimization techniques .................................................................................................................... 33
What is optimizer? ......................................................................................................................... 33
Stochastic Gradient Descent .......................................................................................................... 33
What is Mini- batch stochastic gradient descent (SGD)? ............................................................... 33
Stochastic Gradient Descent (SGD) with momentum .................................................................... 33
Adam.............................................................................................................................................. 33
RMSprop ........................................................................................................................................ 33
Adadelta......................................................................................................................................... 33
Gradient descent ............................................................................................................................... 33
What is loss surface? How does gradient descent work? .............................................................. 33
What is gradient and gradient descent? ........................................................................................ 33
What is exploding gradient problem and vanishing gradient problem? ........................................ 34
What is convex optimization and convex set? ............................................................................... 34
What is AdaGrad? .......................................................................................................................... 35
Do gradient descent methods always converge to same point? ................................................... 35
Neural Networks ................................................................................................................................ 35
1. What is Neural network? ............................................................................................................ 35
1. What is deep model and deep neural network? How do I build a deep neural network? ........... 35
1. What is perceptron? ................................................................................................................... 35
1. What is layer? ............................................................................................................................. 36
1. What is input layer, dense layer (fully connected layer) and output layer? What is depth and
width? ............................................................................................................................................ 36
1. What is calibration layer? ........................................................................................................... 36
What is active learning? ................................................................................................................. 36
2. Feedforward Neural Networks (FFN) .......................................................................................... 37
2. What is backpropagation? .......................................................................................................... 37
3. What is epoch? ........................................................................................................................... 37
3. What is learning rate? ................................................................................................................ 37
3. Learning Rate Decay ................................................................................................................... 37
3. What is co-adaption?.................................................................................................................. 37
3. Dropout ...................................................................................................................................... 37
3. Pruning ....................................................................................................................................... 37
3. What is batch and batch size? .................................................................................................... 38
3. What is Batch Normalization?..................................................................................................... 38
3. What is batch normalization and why does it work? ................................................................... 38
4. What is Long Short-Term Memory? ............................................................................................ 38
4. What is forget gate? ................................................................................................................... 39
Skip-gram ....................................................................................................................................... 39
5. Transfer Learning........................................................................................................................ 39
Radial Basis Function Network (RBFN) ........................................................................................... 39
Hopfield Network........................................................................................................................... 39
Artificial Neural Network (ANN) ......................................................................................................... 39
Self-Organizing Map (SOM)............................................................................................................ 39
Convolutional Neural Network (CNN) ................................................................................................ 39
6. What is Convolution? ................................................................................................................. 39
6. What is Convolutional Neural Network? ..................................................................................... 39
6. What is convolutional filter and convolutional layer? How convolutional operation works?....... 40
6. Why would you use many small convolutional kernels such as 3x3 rather than a few large ones?
....................................................................................................................................................... 41
6. Why do we use convolutions for images rather than just FC layers?........................................... 41
6. What makes CNNs translation invariant?.................................................................................... 42
How CNN use shared weights as a extension across space to standard Neural Network? ............ 42
7. What is pooling? ......................................................................................................................... 42
7. Max Pooling ................................................................................................................................ 43
7. Why do we have max-pooling in classification CNNs?................................................................. 43
Why do segmentation CNNs typically have an encoder-decoder style / structure? ...................... 43
What is the significance of Residual Networks? ............................................................................ 43
What is dephtwise separable convolutional neural network (sepCNN)? ...................................... 43
Recurrent Neural Network (RNN) ....................................................................................................... 43
What is Recurrent Neural Network and timestep? ........................................................................ 44
What is special about RNN which makes it good in recognize sequences in time (speech signal,
texts)? ............................................................................................................................................ 45
How short memory works in RNN? ................................................................................................ 45
Recursive Neural Network .................................................................................................................. 45
Generative Adversarial Networks ....................................................................................................... 45
What is Generative Adversarial Networks (GAN)? ......................................................................... 45
What is Wasserstein loss?.............................................................................................................. 45
What is discriminator? ................................................................................................................... 45
Deep Boltzmann Machine (DBM) ...................................................................................................... 45
Deep Belief Networks (DBN) .............................................................................................................. 45
Stacked Auto-Encoders ...................................................................................................................... 45
Reinforcement ................................................................................................................................... 45
What is reinforcement learning? ................................................................................................... 45
What is candidate sampling? Full softmax, softmax ...................................................................... 46
Markov Decision Processes ................................................................................................................ 46
Recommender algorithms ..................................................................................................................... 46
What are Recommender Systems? ................................................................................................. 46
What is a recommendation engine? How does it work? ................................................................. 46
What is candidate generation, scoring and re-ranking? .................................................................. 47
What is Collaborative filtering? ...................................................................................................... 47
What are items, item matrix and user matrix? ............................................................................... 48
What is matrix factorization?.......................................................................................................... 48
What is Weighted Alternating Least Squares (WALS)? .................................................................... 49
NLP text processing ............................................................................................................................... 48
What is Natural language understanding? ...................................................................................... 49
Continuous Bag Of Words .............................................................................................................. 49
What is bag of words? .................................................................................................................... 49
What is N-gram and bigram? .......................................................................................................... 50
What are embeddings? .................................................................................................................. 50
What is embedding space?............................................................................................................. 50
What is crash blossom? ................................................................................................................. 51
What is sentyment analysis? .......................................................................................................... 51
Statistics ................................................................................................................................................. 51
How would you use either the extreme value theory, Monte Carlo simulations or mathematical
statistics (or anything else) to correctly estimate the chance of a very rare event? ........................ 51
Explain the use of Combinatorics in data science. .......................................................................... 52
What is the Law of Large Numbers? ............................................................................................... 52
What is Pearson correlation coefficient? How to calculate it having two lists, regression lines etc.?
....................................................................................................................................................... 52
What does P-value signify about the statistical data? ..................................................................... 52
Are expected value and mean value different? ............................................................................. 52
Explain what resampling methods are and why they are useful. Also explain their limitations. ... 52
How are confidence intervals constructed and how will you interpret them? ................................ 52
Parametric Tests................................................................................................................................. 52
Mean Tests .................................................................................................................................... 52
Variance Tests ................................................................................................................................ 52
Population proportion ................................................................................................................... 52
Non-parametric tests ......................................................................................................................... 52
Properties tests .............................................................................................................................. 52
Comparison tests ........................................................................................................................... 53
Distributions ....................................................................................................................................... 53
What is the difference between skewed and uniform distribution? ............................................. 53
What do you understand by the term Normal Distribution? ......................................................... 53
Correlaitons........................................................................................................................................ 54
Parametric ..................................................................................................................................... 54
Non-parametric ............................................................................................................................. 54
Approches .......................................................................................................................................... 54
Bayesian ......................................................................................................................................... 54
Frequentist..................................................................................................................................... 54
Likelihood....................................................................................................................................... 54
A/B tests ............................................................................................................................................ 54
How will you explain an A/B test to an engineer who does not know statistics?............................. 54
What is the goal of A/B Testing?..................................................................................................... 54
What is the goal of A/B Testing?..................................................................................................... 54
How can you prove that one improvement you've brought to an algorithm is really an improvement
over not doing anything?................................................................................................................ 54
In an A/B test, how can we ensure that assignment to the various buckets is truly random? ......... 56
How would you conduct an A/B test on an opt-in feature? ............................................................ 57
Model selection and Validation ............................................................................................................. 58
Bias .................................................................................................................................................... 58
What is bias (ethics/fairness)? ....................................................................................................... 58
What is reporting bias? .................................................................................................................. 59
What is prediction bias? ................................................................................................................ 59
What is confirmation bias (experimenter’s bias)? ......................................................................... 59
What is bias (math)? ...................................................................................................................... 59
What is group attribution bias, out-group homogeneity bias and in-group bias? ......................... 59
What is Automation bias? .............................................................................................................. 60
What are bias and variance, and what are their relation to modeling data?................................. 60
What’s the trade-off between bias and variance? ......................................................................... 62
What are the types of biases that can occur during sampling? ..................................................... 62
Explain selective bias. .................................................................................................................... 62
What is selection bias? .................................................................................................................. 63
What is the importance of having a selection bias? ...................................................................... 63
What is selection bias, why is it important and how can you avoid it? .......................................... 63
How do data management procedures like missing data handling make selection bias worse?... 63
What is the difference between bias and underfitting? And, analogously, what is the difference
between variance and overfitting? Do the terms of each pair mean the same thing? If not, what is
the difference? ............................................................................................................................... 64
Error ................................................................................................................................................... 64
What is convergence?.................................................................................................................... 64
What is the difference between squared error and absolute error? ............................................. 65
What do you understand by statistical power of sensitivity and how do you calculate it? ........... 65
What is accuracy? .......................................................................................................................... 65
What is statistical power? .............................................................................................................. 65
Can you cite some examples where a false negative important than a false positive? ................. 65
Can you cite some examples where a false positive is important than a false negative? .............. 66
Now what if they have sent it to false positive cases? ................................................................... 66
Can you cite some examples where both false positive and false negatives are equally important?
....................................................................................................................................................... 66
Explain what a false positive and a false negative are. Why is it important to differentiate these
from each other? ........................................................................................................................... 67
Explain what precision and recall are. How do they relate to the ROC curve? .............................. 67
What is AUC (Area under the ROC Curve)? .................................................................................... 69
Is it better to have too many false negatives or too many false positives? ................................... 70
Is it better to have too many false positives, or too many false negatives? Explain. ..................... 70
What do you understand by Recall and Precision? ........................................................................ 70
What error metric would you use to evaluate how good a binary classifier is? ............................ 71
What method do you use to determine whether the statistics published in an article (or appeared
in a newspaper or other media) are either wrong or presented to support the author's point of
view, rather than correct, comprehensive factual information on a specific subject? .................. 71
A test has a true positive rate of 100% and false positive rate of 5%. There is a population with a
1/1000 rate of having the condition the test identifies. Considering a positive test, what is the
probability of having that condition? ............................................................................................. 74
Under- and Overfitting....................................................................................................................... 74
Explain overfitting and underfitting and how to combat them? .................................................... 74
How can you overcome Overfitting? ............................................................................................. 75
Why might it be preferable to include fewer predictors over many? ............................................ 75
What is overfitting and how to avoid it? ........................................................................................ 76
Explain what is overfitting and how would you control for it ........................................................ 78
Validation ........................................................................................................................................... 80
What is validation? ........................................................................................................................ 80
How do I determine whether my model is effective? .................................................................... 80
What is perplexity? ........................................................................................................................ 80
What is convenience sampling?..................................................................................................... 80
Discuss various numerical optimization techniques. Show understanding of training, testing, and
validation of results........................................................................................................................ 80
Can you explain the difference between a Test Set and a Validation Set? .................................... 80
Explain cross-validation.................................................................................................................. 81
What is cross-validation? ............................................................................................................... 82
Why is resampling done? ............................................................................................................... 82
Explain what resampling methods are and why they are useful. Also explain their limitations. ... 82
Metrics ............................................................................................................................................... 83
What is metric? .............................................................................................................................. 83
Entropy .......................................................................................................................................... 83
Gini index ....................................................................................................................................... 83
Information gain ............................................................................................................................ 83
Variance reduction......................................................................................................................... 83
Classification error ......................................................................................................................... 83
Selection ............................................................................................................................................ 83
Mallow’s Cp ................................................................................................................................... 83
Akaike Information Criterion ......................................................................................................... 83
Bayesian Information Criterion ...................................................................................................... 83
Can you write the formula to calculat R-square?........................................................................... 83
Adjusted R^2 .................................................................................................................................. 84
Cross-Validation ............................................................................................................................. 84
Optimization of data and model ............................................................................................................ 84
What is model, model capacity and model function?.................................................................... 84
What is loss (cost), and how do I measure it? What is objective function? ................................... 84
What is empirical risk minimization (ERM) and Structural risk minimization (SRM)? ........................ 85
Regularization .................................................................................................................................... 85
What is regularization and regularization rate? ............................................................................. 85
What is convex function?............................................................................................................... 85
What is L1, L2 (squared loss) and L1 regularization and L2 regularization? ................................... 86
What is regularization, why do we use it, and give some examples of common methods? .......... 87
What is Regularization and what kind of problems does regularization solve? ............................. 87
What are the advantages and disadvantages of using regularization methods like Ridge
Regression?.................................................................................................................................... 87
Why L1 regularizations causes parameter sparsity whereas L2 regularization does not? ............. 87
Explain what regularization is and why it is useful. ........................................................................ 87
What is dropout regularization? .................................................................................................... 88
What is early stopping? ................................................................................................................. 88
Ridge Regression ............................................................................................................................ 88
Least Absolute Shrinkage and Selection Operator (LASSO) ........................................................... 88
Elastic Net ...................................................................................................................................... 88
Least Angle Regression (LARS) ....................................................................................................... 88
Tichonov Regularisation................................................................................................................. 88
Feature .............................................................................................................................................. 88
What is feature and example? How does it differs from feature set and feature vector? ............ 88
What is label and labelled example? What is proxy label? ............................................................ 89
What is feature engineering and feature extraction? .................................................................... 89
How do I represent my data so that a program can learn from it?................................................ 89
What is continuous and discrete feature? What is dense feature and sparse feature? ................ 89
How do you control model complexity? ........................................................................................ 90
What is binning and bucketing? ..................................................................................................... 90
What is synthetic feature? ............................................................................................................. 90
What is feature cross? ................................................................................................................... 90
How can you determine which features are the most important in your model? ......................... 90
What are feature vectors? ............................................................................................................. 91
What do you understand by feature vectors? ............................................................................... 91
What is data normalization and why do we need it? Scaling ......................................................... 91
What do you understand by long and wide data formats? ............................................................ 92
What is the difference between "long" ("tall") and "wide" format data? ...................................... 92
Differentiate between wide and tall data formats? ....................................................................... 93
How would you go about doing an Exploratory Data Analysis (EDA)? ........................................... 93
Explain Principal Component Analysis (PCA)? ................................................................................ 93
What are confounding variables? .................................................................................................. 93
What is numerical data (continuous features)? ............................................................................. 93
Categorical ......................................................................................................................................... 94
What is one-hot encoding?............................................................................................................ 94
What are categorical variables (categorical data)? ........................................................................ 94
How will you find the correlation between a categorical variable and a continuous variable? ..... 94
You can use the analysis of covariance technique to find the correlation between a categorical
variable and a continuous variable. ............................................................................................... 94
Which technique is used to predict categorical responses? .......................................................... 94
Missing ............................................................................................................................................... 94
How do you handle missing or corrupted data in a dataset? ........................................................ 94
During analysis, how do you treat missing values? ........................................................................ 95
What are your favourite imputation techniques to handle missing data? .................................... 95
Outlier ................................................................................................................................................ 95
What are outliers? ......................................................................................................................... 95
How can outlier values be treated? ............................................................................................... 95
What do you understand by outliers and inliers? What would you do if you find them in your
dataset? ......................................................................................................................................... 95
How would you screen for outliers and what should you do if you find one? ............................... 95
How would you screen for outliers and what should you do if you find one? ............................... 96
What is clipping?............................................................................................................................ 96
What are some ways I can make my model more robust to outliers?........................................... 96
Imbalanced ........................................................................................................................................ 97
What is class-imbalanced dataset? What is minority and majority class? ..................................... 97
How would you handle an imbalanced dataset? ........................................................................... 98
What is data augmentation?.......................................................................................................... 98
What is downsampling? ................................................................................................................. 98
What is confusion matrix? ............................................................................................................. 98
What error metric would you use to evaluate how good a binary classifier is? What if the classes
are imbalanced? What if there are more than 2 groups? .............................................................. 99
Hyperparameter optimization ......................................................................................................... 100
What is parameter and how does it differs from hyperparameter? ............................................ 100
What is fine tuning? ..................................................................................................................... 100
What is checkpoint? .................................................................................................................... 100
Ensemble ......................................................................................................................................... 101
What is ensemble? ...................................................................................................................... 101
What is the idea behind ensemble learning?............................................................................... 101
Random Forest............................................................................................................................. 102
Gradient Boosting Machines (GBM) ............................................................................................ 102
What is Boosting? ........................................................................................................................ 102
Bootstrapped Aggregation (Bagging) ........................................................................................... 102
AdaBoost...................................................................................................................................... 102
Stacked Generalization (Blending) ............................................................................................... 102
Gradient Boosted Regression Trees (GBRT) ................................................................................. 102
General ................................................................................................................................................ 102
Analysis ............................................................................................................................................ 102
What is data analysis?.................................................................................................................. 102
What is power analysis? ............................................................................................................... 102
What is root cause analysis? ........................................................................................................ 102
What is root cause analysis? ........................................................................................................ 102
Why data cleaning plays a vital role in analysis? ........................................................................... 103
Differentiate between univariate, bivariate and multivariate analysis. ......................................... 103
Big data ............................................................................................................................................ 104
Explain star schema. ..................................................................................................................... 104
What is baseline model? .............................................................................................................. 104
Is more data always better? ......................................................................................................... 104
What are some of the common data quality issues when dealing with Big Data? What can be done
to avoid them or to mitigate their impact? ................................................................................... 104
Tell us about the biggest data set you have processed till date and for what kind of analysis....... 108
How do you handle big data sets? ................................................................................................ 108
Machine Learning ............................................................................................................................ 108
What are differences between discriminative and generative models? ...................................... 108
What is few-shot learning and one-shot learning? ...................................................................... 108
How does machine learning differ from traditional programming? ............................................ 109
Input/Output in Machine Learning .............................................................................................. 109
Problem types in Machine Learning ............................................................................................ 109
How do you know which Machine Learning model you should use? .......................................... 109
Differentiate between Data Science , Machine Learning and AI. ................................................. 109
What is Machine Learning? ......................................................................................................... 109
Can you use machine learning for time series analysis? .............................................................. 110
What do you understand by Hypothesis in the content of Machine Learning? ........................... 111
Which is your favourite machine learning algorithm and why?................................................... 111
Visualization ..................................................................................................................................... 111
What are your favourite data visualization tools? ....................................................................... 111
What makes a good data visualization? ....................................................................................... 111
Explain Edward Tufte's concept of "chart junk." .......................................................................... 113
How do we make a good data visualization? ............................................................................... 114
Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs).
How to efficiently represent 5 dimension in a chart (or in a video)? ........................................... 116
Work ................................................................................................................................................ 119
What is pipeline? ......................................................................................................................... 119
What is dynamic model?.............................................................................................................. 119
Give an example of how you would use experimental design to answer a question about user
behavior. ...................................................................................................................................... 119
How to determine the influence of a Twitter user?..................................................................... 120
How would you explain to the senior management in your organization as to why a particular
data set is important? .................................................................................................................. 123
What kind of data is important for specific business requirements and how, as a data scientist will
you go about collecting that data? .............................................................................................. 123
How can you ensure that you don’t analyse something that ends up producing meaningless
results? ........................................................................................................................................ 123
What types of data are important for business needs? ............................................................... 123
Suppose you are given a data set, what will you do with it to find out if it suits the business needs
of your project or not. ................................................................................................................. 123
How will you assess the statistical significance of an insight whether it is a real insight or just by
chance? ........................................................................................................................................ 123
In which libraries for Data Science in Python and R, does your strength lie? .............................. 123
Why do you want to pursue a career in data science? ................................................................ 123
What have you done to upgrade your skills in analytics? ............................................................ 123
What has been the most useful business insight or development you have found? ................... 123
How regularly must an algorithm be updated? ........................................................................... 123
Are you familiar with price optimization, price elasticity, inventory management, competitive
intelligence? Give examples. ........................................................................................................ 123
Which data scientists do you admire most? Which are your favourite data science start-ups? . 124
How would you create a taxonomy to identify key customer trends in unstructured data? ....... 125
How would you develop a model to identify plagiarism? ............................................................ 125
Explain the life cycle of a data science project............................................................................. 125
What makes a dataset gold standard?......................................................................................... 125
In experimental design, is it necessary to do randomization? If yes, why? ................................. 125
What are the most important skills for a data scientist to have? ................................................ 125
When you get a new data set, what do you do with it to see if it will suit your needs for a given
project?........................................................................................................................................ 125
How do you model a quantity you can’t observe? ...................................................................... 125
Do you have any other projects that would be related here? ..................................................... 125
Explain your current masters research? ...................................................................................... 125
Python or R – Which one would you prefer for text analytics? ................................................... 125
What were the business outcomes or decisions for the projects you worked on? ..................... 126
What unique skills you think can you add on to our data science team? .................................... 126
What data [at the company] would you go after and start working on? ..................................... 126
What’s a project you would want to work on at our company? What data would you go after to
start working on it? ...................................................................................................................... 126
What unique skills do you think you’d bring to the team? .......................................................... 126
Other................................................................................................................................................ 126
Why is vectorization considered a powerful method for optimizing numerical code? ................ 126
What is inter-rater agreement? ................................................................................................... 126
When can parallelism make your algorithms run faster? When could it make your algorithms run
slower? ........................................................................................................................................ 126
How can you iterate over a list and also retrieve element indices at the same time?................. 127
What do you understand by Fuzzy merging ? Which language will you use to handle it? .......... 127
What is broadcasting? ................................................................................................................. 127
What is an Eigenvalue and Eigenvector? ..................................................................................... 128
What is Keras? ............................................................................................................................. 128
What is Cloud TPU? ..................................................................................................................... 128
What is Dataset API (tf.data)........................................................................................................ 128
What is Metrics API (tf.metrics)? ................................................................................................. 128
What is Layers API (tf.layers)?...................................................................................................... 129
What is feature column (tf.feature_column)? ............................................................................. 129
What is graph and graph execution? ........................................................................................... 129
What is eager execution? ............................................................................................................ 129
Write a program in Python which takes input as the diameter of a coin and weight of the coin and
produces output as the money value of the coin. ....................................................................... 129
What problems arise if the distribution of the new (unseen) test data is significantly different
than the distribution of the training data? .................................................................................. 129
How can you deal with different types of seasonality in time series modelling? ........................ 131
What is Interpolation and Extrapolation? .................................................................................... 131
Locally Weighted Learning (LWL) ................................................................................................. 131

ML Interview Questions

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

ML Interview Questions

Загружено:

Авторское право:

Доступные форматы

My questions for interviewer

Whose job is it to make sure you have data?

Introduction – how to become better data scientist

 Data exploration. You should have pandas functions like .corr(),

probability and statistics knowledge

software engineering know-how

 The number of levels of coordinates in a Tensor. For example:

o A scalar has zero dimensions; for example, ["Hello"].

o A vector has one dimension; for example, [3, 5, 7, 11].

 The number of entries in a feature vector.

 The number of elements in an embedding layer.

1. What is dimension reduction?

1. Explain dimensionality reduction, where it’s used, and its benefits?

2. What is the curse of dimensionality?

10^3 = 1000 and so on...

2. How do you combat the curse of dimensionality?

4. Principal Componenet Analysis (PCA)

1.2 What is class? Negative class and Positive class?

Multi-class classification (multinomial classification)

1.3 What is binary classification?

1.5 What is classification threshold (decision threshold)?

2.1 What is one-vs.-all?

 animal vs. not animal

 vegetable vs. not vegetable

 mineral vs. not mineral

3.1 Explain the steps in making a decision tree.

4.4 What do you understand by conjugate-prior with respect to Naïve Bayes?

6.1 What is Bayesian Neural Network (BN)?

Averaged One-Dependence Estimators (AODE)

Log loss is the loss function used in binary logistic regression.

7.2 What is cross-entropy?

e) SVM is preferred in multi-dimensional problem set - like text classification

2 How will you define the number of clusters in a clustering algorithm?

For example, the following image shows three different groups.

2 In unsupervised learning, if a ground truth about a dataset is unknown, how can we

The Elbow Method

The Silhouette Method

What is the difference between Cluster and Systematic Sampling?

What is the difference between Supervised Learning an Unsupervised Learning?

What is similarity measure?

3 What is K-means? How can you select K for K-means?

 Iteratively determines the best k center points (known as centroids).

3 How will you find the right K for K-means?

Contrast with hierarchical clustering algorithms.

Note that the definitions of distance are also different:

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

5 Hierarchical Clustering (Agglomerative clustering and Divisive clustering)

Contrast with centroid-based clustering.

2 What is Linear Regression?

2 What are the basic assumptions to be made for linear regression?

2 What are the assumptions required for linear regression?

Is it possible to perform logistic regression with Microsoft Excel?

a) One is to use Add-ins provided by many websites which we can use.

Next we have to create a logit function using independent variables, i.e.

Logit = L = β0 + β1*X1 + β2*X2

Probability= e^Logit/(1+ e^Logit )

e is base of natural logarithm i.e. e = 2.71828163

And this p can be calculated as-

Then we have to take natural log of the above function-

Which turns out to be –

Yactual*ln⁡〖[ 〗 P(X)]*(Yactual- 1)*ln[1-P(X)]

We’ll use Excel’s solver add-in to achieve the same.

Now click on Solve button at the bottom –

Logit = L = β0 + β1X1 + β2X2

Yactualln⁡〖[ 〗 P(X)](Yactual- 1)*ln[1-P(X)]