Workshop Book

Why this Workshop?
Overview
Technical Round
Skill Assessment Round
Behavioral Round
Conclusion
Are you preparing for interviews for analytics and data science roles in coming days? I bet, you are!
Why would you attend this workshop otherwise? Here is something I will confess - I envy you! Yes,
you heard it right - I envy you because of 2 reasons:
1. Firstly, you have access to this workshop, which I never had. I so wished some one would
have created it for me. I figured the tips and tricks which I discuss in this workshop through
experiences or through my mentors or through reading. Believe me, there is no guide or
workshop as comprehensive as this one!
2. Secondly, you are going to join data science industry in an era when the exponential growth
of data has just taken off. It is just the right time to join the industry and be part of the coming
exponential growth. When I passed out of college, no one had heard of the term data
science - Statisticians used to rule data experiments and any tool beyond excel was either
geeky or way too expensive. You are stepping in an ecosystem, where people are helping
each other openly and freely for the betterment of everyone. The timing couldn't have been
better!
Types of Interviews
Overview
Recruiters undertake various practices to assess a candidates fit for the job. Remember, core
technical knowledge will get you through just one stage of interview rounds. Recruiters expect you to
know a lot more than that. Beyond technical knowledge, they want you to be well informed, good
with numbers, empathetic, problem solver, friendly, cooperative etc.
These interview stages are set up with most effective evaluation tests. They can be verbal, written
or presentation driven. Thus, its essential for you to learn about these devils coming your way.
Following are the types:
Technical Interview : As the name suggests, such interviews assess & rate your subject
knowledge i.e. questions on statistics, data science, machine learning are asked. Precisely,
the range of technical questions vary according to the job role youve applied for. Also,
candidates might be required to demonstrate their tools (R/ Python/SAS/MATLAB) prowess.
Generally, recruiters agitate candidates with basic questions asked in a tricky way. Dont
worry, such question are covered in this workshop.
Behavioral Interview : Behavioral interviews are conducted to assess your non - technical
skills such as your behaviour, attitude, learning appetite, thought process and ownership for
situations that you will encounter in the work place. Over the years, behavioral interview has
become a crucial part of screening job candidates. As a matter of fact, even before a
candidate is evaluated technically, employers want to predict the candidates future job
performance based on previous behaviour. Simply, the employer wants to assess if you are
ready enough to face workplace challenges? There are several ways to succeed in
behavioral interviews. But, the most common way leading to failure is lying about yourself.
You can learn more about it in upcoming Behavorial section.
Skill Assessments Interviews : Such interviews offer most intriguing, controversial and
challenging questions to solve. It comprises of questions on logical puzzles, guess-
estimates, business case studies etc. Solving such problems doesnt require any technical
knowledge but analytical mindset. The ideal skill required to ace such interview rounds
is math. As long you struggle with numbers, youll interview progress will get stalled at this
stage. But dont worry, this workshop has several practice questions to help you become
better.
Resume Preparation
Step 2: Tailor your resume
Once your basics are at place, now is the time to customize your resume according to the job role
you are applying for. Many a times, it is seen that candidates use one resume for all the job profiles
they apply. This leads to nothing but disappointing results. The sole reason being, employers are
more interested in your skills which can get their job done, instead of your general overall (unrelated)
skills.
To begin with, its a good practice to emphasize and highlight the projects which showcases the
skills required in the role you are applying for. There are broadly two classes of roles available in
analytics industry currently:
1. Technical data science roles: These roles require you to spend time working on / leading
data science projects. You need to slice and dice data to come out with insights for your
customers / stakeholders. If you are applying to such roles, here are a few tips:
1. Focus on showcasing your key projects: Identify and select a group of project (5
10) and focus on mentioning them only. Highlight the key benefits or uniqueness of
each of these projects.
2. Illustrate depth of knowledge: These roles require you to possess deep knowledge
on a topic.
3. Showcase technical skills: You should mention the type of tools you have worked
with and your level of expertise using them.
4. Get (meaningful) certifications and add them to your profile Lookout for
courses running on platforms like Coursera, Udacity, edX and complete them along
with their assignments.
5. Create a GitHub profile and share your work from past Share whatever work
you can share on GitHub (e.g. projects done while doing courses on Coursera, edX
etc.), codes from Kaggle competitions, tutorials and blogs you have written in past
1. Non-technical data science roles: These comprises of the roles, which are either client
facing or related to sales of analytics solutions here. If you are applying for jobs like these,
you dont necessarily need to showcase your expertise on a programming language. What
you need to showcase are the skills which are critical to that role. For example, a client
facing analytics consultant would need to be exceptionally good at structured thinking. A pre-
sales person would need skills to influence people basis data based stories. Here are a few
tips to make your resume stand out for these roles:
1. Focus on breadth of knowledge rather than depth: These roles typically require
you to know the entire spectrum of data science fields, so that you can understand
the client requirement well and advice accordingly. Therefore, you should focus on
revealing the breadth of knowledge rather than depth.
2. Emphasize the business impact from your previous engagements: This could
be the biggest sales you made, the number of clients you consulted or any metric,
which directly impact the top line (or bottom line in some rare cases).
3. If you have written blogs / views on the subject, you should include them too in your
resume.
Step 3: Stepping up the game
By now, you should have a basic resume ready and you would have emphasized on the right skills
for the role. Next, you should focus on the given examples (below) and techniques to take your
resume to the next level. Consider these examples as High risk, High return strategies. You should
not attempt them until your basic resume is ready.
If you do these things well, you can get tremendous returns. However, if you do a bad job in
executing these things, they can certainly backfire. The main purpose of these techniques is to
distinguish your resume against the bunch of competitive resumes, which employers get for good
positions.
Example 1: Create an Infographic about yourself

Here is a good example of an infographic designed by a graphic designer. He has highlighted his
skills and knowledge of tools.
Why does an infographic like this work? This works because creating such piece of design shows
that you can take a lot of data and represent that in a nice and effective manner a skill every data
scientist or manager needs to possess.
Example 2: Creating a dashboard / story in the tool of your choice
Consider your resume as your real estate for showcasing yourself in a limited space something
similar to what you need to do in business dashboards. So, if you can create an interactive
dashboard about yourself (with drill downs), you can again drive your resume home. Here is a
resume from Dmitry Gudkov, which was available on the internet:
You can also use story-telling features from QlikSense or Tableau. This not only shows that you can
visualize data effectively, but also that you know about the latest features in some of the tools.
Here is another example of a resume, which an applicant created while applying for a Digital
marketing manager job profile. Google Analytics is the most commonly used tool for digital
marketing. So, he created his resume in form of a dashboard, which resembles Google Analytics
dashboard closely.
With these examples, you would now understand what we meant by high risk, high return strategy. If
you do pursue to create something like this, you will need to spend a lot of time thinking what
information should come on resume and what should go off. However, that is time well spent.
After all, if you do create a nice visualization about yourself, it will definitely set you apart from the
league.
R/Python
Overview
In analytics, having theoretical knowledge is valuable. But, failing to unify it with practical
implementation does no good. Employers prefer and appoint candidates who are equally proficient
on doing practical tasks by understanding their underlying concepts. In simple words, you must be
good at coding. R and Python are the most widely used languages in analytics industry today.
According to a recent salary survey by Oreilly Media, 80% of the chosen random sample size
admitted that they use either R or Python for data analysis and model building. If you are still unclear
about your choice of tool, this comprehensive comparison should give you enough motivation to get
started.
In this section, youll undertake exclusive skill test designed to enhance and evaluate your
proficiency on using R or Python. This skill test contain questions on data exploration, data
visualization, data manipulation and feature engineering. You can take either of the test or both also
if you are proficient at both languages.
Success Rule: While taking the skill test, try to solve questions at your end instead of guessing their
answers. This will help you memorize solving such data manipulation tasks for longer time.
Take Test
For R Users: Skilltest in R for Data Science
For Python Users: Skilltest in Python for Data Science
Overview
Statistics is a crucial aspect of analytics / data science. In fact, recruiters from analytics consulting /
boutique firms, particularly assesses a candidate on concepts of statistical modeling rather than
machine learning. Surprising but true, most analytics employers still today rely on statistical
modeling.
A reason being, statistical models are easier to interpret, explain and implement in real life, unlike
the black box models created by machine learning. However in most of the cases, irrespective of the
specific job profile, a candidate is expected to be aware of statistics in great details.
You cant neglect studying statistics if you want to become a data scientist. But where is statistics
used in machine learning? Some students ask this question! So, how do you evaluate a ML model?
You use statistical concepts to determine its performance. There are several instance where it is
evident that ML cant survive without statistics.
Therefore, to become solid in statistics, this section provides you an exclusive skill test on
statistics. This test is designed to help you understand core statistical concepts in an interactive
way.
Take Test
Statistics Skill Test I
Statistics Skill Test II
Machine Learning
Q1. You are given a train data set having 1000 columns and 1 million rows. The data set is
based on a classification problem. Your manager has asked you to reduce the dimension of
this data so that model computation time can be reduced. Your machine has memory
constraints. What would you do? (You are free to make practical assumptions.)
Answer: Processing a high dimensional data on a limited memory machine is a strenuous task, your
interviewer would be fully aware of that. Following are the methods you can use to tackle such
situation:
1. Since we have lower RAM, we should close all other applications in our machine,
including the web browser, so that most of the memory can be put to use.
2. We can randomly sample the data set. This means, we can create a smaller data set, lets
say, having 1000 variables and 300000 rows and do the computations.
3. To reduce dimensionality, we can separate the numerical and categorical variables and
remove the correlated variables. For numerical variables, well use correlation. For
categorical variables, well use chi-square test.
4. Also, we can use PCA and pick the components which can explain the maximum variance in
the data set.
5. Using online learning algorithms like Vowpal Wabbit (available in Python) is a possible
option.
6. Building a linear model using Stochastic Gradient Descent is also helpful.
7. We can also apply our business understanding to estimate which all predictors can impact
the response variable. But, this is an intuitive approach, failing to identify useful predictors
might result in significant loss of information.
Note: For point 4 & 5, make sure you read about online learning algorithms & Stochastic Gradient
Descent. These are advanced methods.
Q2. Is rotation necessary in PCA? If yes, Why? What will happen if you dont rotate the
components?
Answer: Yes, rotation (orthogonal) is necessary because it maximizes the difference between
variance captured by the component. This makes the components easier to interpret. Not to forget,
thats the motive of doing PCA where, we aim to select fewer components (than features) which can
explain the maximum variance in the data set. By doing rotation, the relative location of the
components doesnt change, it only changes the actual coordinates of the points.
If we dont rotate the components, the effect of PCA will diminish and well have to select more
number of components to explain variance in the data set.
Know more: PCA
Q3. You are given a data set. The data set has missing values which spread along 1 standard
deviation from the median. What percentage of data would remain unaffected? Why?
Answer: This question has enough hints for you to start thinking! Since, the data is spread across
median, lets assume its a normal distribution. We know, in a normal distribution, ~68% of the data
lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data
unaffected. Therefore, ~32% of the data would remain unaffected by missing values.
Q4. You are given a data set on cancer detection. Youve build a classification model and
achieved an accuracy of 96%. Why shouldnt you be happy with your model performance?
What can you do about it?
Answer: If you have worked on enough data sets, you should deduce that cancer detection results
in imbalanced data. In an imbalanced data set, accuracy should not be used as a measure of
performance because 96% (as given) might only be predicting majority class correctly, but our class
of interest is minority class (4%) which is the people who actually got diagnosed with cancer. Hence,
in order to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity
(True Negative Rate), F measure to determine class wise performance of the classifier. If the
minority class performance is found to to be poor, we can undertake the following steps:
1. We can use undersampling, oversampling or SMOTE to make the data balanced.

2. We can alter the prediction threshold value by doing probability caliberation and finding a
optimal threshold using AUC-ROC curve.
3. We can assign weight to classes such that the minority classes gets larger weight.
4. We can also use anomaly detection.
Know more: Imbalanced Classification
Q5. Why is naive Bayes so naive ?
Answer: naive Bayes is so naive because it assumes that all of the features in a data set are
equally important and independent. As we know, these assumption are rarely true in real world
scenario.
Q6. Explain prior probability, likelihood and marginal likelihood in context of naiveBayes
algorithm?
Answer: Prior probability is nothing but, the proportion of dependent (binary) variable in the data set.
It is the closest guess you can make about a class, without any further information. For example: In
a data set, the dependent variable is binary (1 and 0). The proportion of 1 (spam) is 70% and 0 (not
spam) is 30%. Hence, we can estimate that there are 70% chances that any new email would be
classified as spam.
Likelihood is the probability of classifying a given observation as 1 in presence of some other

variable. For example: The probability that the word FREE is used in previous spam message is
likelihood. Marginal likelihood is, the probability that the word FREE is used in any message.
Q7. You are working on a time series data set. You manager has asked you to build a high
accuracy model. You start with the decision tree algorithm, since you know it works fairly
well on all kinds of data. Later, you tried a time series regression model and got higher
accuracy than decision tree model. Can this happen? Why?
Answer: Time series data is known to posses linearity. On the other hand, a decision tree algorithm
is known to work best to detect non linear interactions. The reason why decision tree failed to
provide robust predictions because it couldnt map the linear relationship as good as a regression
model did. Therefore, we learned that, a linear regression model can provide robust prediction given
the data set satisfies its linearity assumptions.
Q8. You are assigned a new project which involves helping a food delivery company save
more money. The problem is, companys delivery team arent able to deliver food on time. As
a result, their customers get unhappy. And, to keep them happy, they end up delivering food
for free. Which machine learning algorithm can save them?
Answer: You might have started hopping through the list of ML algorithms in your mind. But,
wait! Such questions are asked to test your machine learning fundamentals.
This is not a machine learning problem. This is a route optimization problem. A machine learning
problem consist of three things:
1. There exist a pattern.

2. You cannot solve it mathematically (even by writing exponential equations).
3. You have data on it.
Always look for these three factors to decide if machine learning is a tool to solve a particular
problem.
Q9. You are given a data set. The data set contains many variables, some of which are highly
correlated and you know about it. Your manager has asked you to run PCA. Would you
remove correlated variables first? Why?
Answer: Chances are, you might be tempted to say No, but that would be incorrect. Discarding
correlated variables have a substantial effect on PCA because, in presence of correlated variables,
the variance explained by a particular component gets inflated.
For example: You have 3 variables in a data set, of which 2 are correlated. If you run PCA on this
data set, the first principal component would exhibit twice the variance than it would exhibit with
uncorrelated variables. Also, adding correlated variables lets PCA put more importance on those
variable, which is misleading.
Q10. How is kNN different from kmeans clustering?

Answer: Dont get mislead by k in their names. You should know that the fundamental difference
between both these algorithms is, kmeans is unsupervised in nature and kNN is supervised in
nature. kmeans is a clustering algorithm. kNN is a classification (or regression) algorithm.
kmeans algorithm partitions a data set into clusters such that a cluster formed is homogeneous and
the points in each cluster are close to each other. The algorithm tries to maintain enough separability
between these clusters. Due to unsupervised nature, the clusters have no labels.
kNN algorithm tries to classify an unlabeled observation based on its k (can be any number )
surrounding neighbors. It is also known as lazy learner because it involves minimal training of model.
Hence, it doesnt use training data to make generalization on unseen data set.
Q11. How is True Positive Rate and Recall related? Write the equation.
Answer: True Positive Rate = Recall. Yes, they are equal having the formula (TP/TP + FN).
Know more: Evaluation Metrics
Q12. You have built a multiple regression model. Your model R isnt as good as you wanted.
For improvement, your remove the intercept term, your model R becomes 0.8 from 0.3. Is it
possible? How?
Answer: Yes, it is possible. We need to understand the significance of intercept term in a regression
model. The intercept term shows model prediction without any independent variable i.e. mean
prediction. The formula of R = 1 (y y)/(y ymean) where y is predicted value.
When intercept term is present, R value evaluates your model wrt. to the mean model. In absence
of intercept term (ymean), the model can make no such evaluation, with large denominator, (y -
y)/(y) equations value becomes smaller than actual, resulting in higher R.
Q13. After analyzing the model, your manager has informed that your regression model is
suffering from multicollinearity. How would you check if hes true? Without losing any
information, can you still build a better model?
Answer: To check multicollinearity, we can create a correlation matrix to identify & remove variables
having correlation above 75% (deciding a threshold is subjective). In addition, we can use calculate
VIF (variance inflation factor) to check the presence of multicollinearity. VIF value <= 4 suggests no
multicollinearity whereas a value of >= 10 implies serious multicollinearity. Also, we can use
tolerance as an indicator of multicollinearity.
But, removing correlated variables might lead to loss of information. In order to retain those
variables, we can use penalized regression models like ridge or lasso regression. Also, we can add
some random noise in correlated variable so that the variables become different from each other.
But, adding noise might affect the prediction accuracy, hence this approach should be carefully
used.
Know more: Regression

Q14. Rise in global average temperature led to decrease in number of pirates around the
world. Does that mean that decrease in number of pirates caused the climate change?
Answer: After reading this question, you should have understood that this is a classic case of
causation and correlation. No, we cant conclude that decrease in number of pirates caused the
climate change because there might be other factors (lurking or confounding variables) influencing
this phenomenon.
Therefore, there might be a correlation between global average temperature and number of pirates,
but based on this information we cant say that pirated died because of rise in global average
temperature.
Know more: Causation and Correlation
Q15. While working on a data set, how do you select important variables? Explain your
methods.
Answer: Following are the methods of variable selection you can use:
1. Remove the correlated variables prior to selecting important variables

2. Use linear regression and select variables based on p values
3. Use Forward Selection, Backward Selection, Stepwise Selection
4. Use Random Forest, Xgboost and plot variable importance chart
5. Use Lasso Regression
6. Measure information gain for the available set of features and select top n features
accordingly.
Q16. What is the difference between covariance and correlation?
Answer: Correlation is the standardized form of covariance.
Covariances are difficult to compare. For example: if we calculate the covariances of salary ($) and
age (years), well get different covariances which cant be compared because of having unequal
scales. To combat such situation, we calculate correlation to get a value between -1 and 1,
irrespective of their respective scale.
Q17. Is it possible capture the correlation between continuous and categorical variable? If
yes, how?
Answer: Yes, we can use ANCOVA (analysis of covariance) technique to capture association
between continuous and categorical variables.
Q18. Running a binary classification tree algorithm is the easy part. Do you know how does a
tree splitting takes place i.e. how does the tree decide which variable to split at the root node
and succeeding nodes?
Answer: A classification trees makes decision based on Gini Index and Node Entropy. In simple
words, the tree algorithm find the best possible feature which can divide the data set into purest
possible children nodes.
Gini index says, if we select two items from a population at random then they must be of same class
and probability for this is 1 if population is pure. We can calculate Gini as following:
1. Calculate Gini for sub-nodes, using formula sum of square of probability for success and
failure (p^2+q^2).
2. Calculate Gini for split using weighted Gini score of each node of that split
Entropy is the measure of impurity as given by (for binary

class):
Here p and q is probability of success and failure respectively in that node. Entropy is zero when a
node is homogeneous. It is maximum when a both the classes are present in a node at 50%
50%. Lower entropy is desirable.
Q19. What is convex hull ? (Hint: Think SVM)
Answer: In case of linearly separable data, convex hull represents the outer boundaries of the two
group of data points. Once convex hull is created, we get maximum margin hyperplane (MMH) as a
perpendicular bisector between two convex hulls. MMH is the line which attempts to create greatest
separation between two groups.
Q20. You are given a data set consisting of variables having more than 30% missing values?
Lets say, out of 50 variables, 8 variables have missing values higher than 30%. How will you
deal with them?
Answer: We can deal with them in the following ways:
1. Assign a unique category to missing values, who knows the missing values might decipher
some trend
2. We can remove them blatantly.
3. Or, we can sensibly check their distribution with the target variable, and if found any pattern
well keep those missing values and assign them a new category while removing others.
Q21. People who bought this, also bought recommendations seen on amazon is a result
of which algorithm?
Answer: The basic idea for this kind of recommendation engine comes from collaborative filtering.
Collaborative Filtering algorithm considers User Behavior for recommending items. They exploit
behavior of other users and items in terms of transaction history, ratings, selection and purchase
information. Other users behaviour and preferences over the items are used to recommend items to
the new users. In this case, features of the items are not known.
Know more: Recommender System
Q22. What do you understand by Type I vs Type II error ?
Answer: Type I error is committed when the null hypothesis is true and we reject it, also known as a
False Positive. Type II error is committed when the null hypothesis is false and we accept it, also
known as False Negative.
In the context of confusion matrix, we can say Type I error occurs when we classify a value as
positive (1) when it is actually negative (0). Type II error occurs when we classify a value as negative
(0) when it is actually positive(1).
Q23. You are working on a classification problem. For validation purposes, youve randomly
sampled the training data set into train and validation. You are confident that your model will
work incredibly well on unseen data since your validation accuracy is high. However, you get
shocked after getting poor test accuracy. What went wrong?
Answer: In case of classification problem, we should always use stratified sampling instead of
random sampling. A random sampling doesnt takes into consideration the proportion of target
classes. On the contrary, stratified sampling helps to maintain the distribution of target variable in the
resultant distributed samples also.
Q24. You have been asked to evaluate a regression model based on R, adjusted R and
tolerance. What will be your criteria?
Answer: Tolerance (1 / VIF) is used as an indicator of multicollinearity. It is an indicator of percent of

variance in a predictor which cannot be accounted by other predictors. Large values of tolerance is
desirable.
We will consider adjusted R as opposed to R to evaluate model fit because R increases
irrespective of improvement in prediction accuracy as we add more variables. But, adjusted R would
only increase if an additional variable improves the accuracy of model, otherwise stays same. It is
difficult to commit a general threshold value for adjusted R because it varies between data sets. For
example: a gene mutation data set might result in lower adjusted R and still provide fairly good
predictions, as compared to a stock market data where lower adjusted R implies that model is not
good.
Q25. In k-means or kNN, we use euclidean distance to calculate the distance between nearest
neighbors. Why not manhattan distance ?
Answer: We dont use manhattan distance because it calculates distance horizontally or vertically
only. It has dimension restrictions. On the other hand, euclidean metric can be used in any space to
calculate distance. Since, the data points can be present in any dimension, euclidean distance is a
more viable option.
Example: Think of a chess board, the movement made by a bishop or a rook is calculated by
manhattan distance because of their respective vertical & horizontal movements.
26. Explain machine learning to me like a 5 year old.
Answer: Its simple. Its just like how babies learn to walk. Every time they fall down, they learn
(unconsciously) & realize that their legs should be straight and not in a bend position. The next time
they fall down, they feel pain. They cry. But, they learn not to stand like that again. In order to avoid
that pain, they try harder. To succeed, they even seek support from the door or wall or anything near
them, which helps them stand firm.
This is how a machine works & develops intuition from its environment.
Note: The interview is only trying to test if have the ability of explain complex concepts in simple
terms.
27. I know that a linear regression model is generally evaluated using Adjusted R or F value.
How would you evaluate a logistic regression model?
Answer: We can use the following methods:
1. Since logistic regression is used to predict probabilities, we can use AUC-ROC curve along
with confusion matrix to determine its performance.
2. Also, the analogous metric of adjusted R in logistic regression is AIC. AIC is the measure of
fit which penalizes model for the number of model coefficients. Therefore, we always prefer
model with minimum AIC value.
3. Null Deviance indicates the response predicted by a model with nothing but an intercept.
Lower the value, better the model. Residual deviance indicates the response predicted by a
model on adding independent variables. Lower the value, better the model.
Know more: Logistic Regression
28. Do you suggest that treating a categorical variable as continuous variable would result in
a better predictive model?
Answer: For better predictions, categorical variable can be considered as a continuous variable only
when the variable is ordinal in nature.
29. What do you understand by Bias Variance trade off?
Answer: The error emerging from any model can be broken down into three components
mathematically. Following are these component :
Bias error is useful to quantify how much on an average are the predicted values different from the
actual value. A high bias error means we have a under-performing model which keeps on missing
important trends. Variance on the other side quantifies how are the prediction made on same
observation different from each other. A high variance model will over-fit on your training population
and perform badly on any observation beyond training.
Q30. OLS is to linear regression. Maximum likelihood is to logistic regression. Explain the
statement.
Answer: OLS and Maximum likelihood are the methods used by the respective regression methods
to approximate the unknown parameter (coefficient) value. In simple words,
Ordinary least square(OLS) is a method used in linear regression which approximates the
parameters resulting in minimum distance between actual and predicted values. Maximum
Likelihood helps in choosing the the values of parameters which maximizes the likelihood that the
parameters are most likely to produce observed data.
Overview
If you are reading this and youve 0-3 years work experience in analytics industry, you should pay
more attention on this section.
Analytics companies evaluate freshers & less experienced professionals using questions which
require logical, structured, analytical & creative skills. You can say, your inexperience in dealing with
real world business situations is compensated by such questions. Hence, you need to be extremely
careful while attempting such questions.
Such question can be broadly categorized into 3 categories:
Guess Estimates
Business Case Studies
Puzzles
Lets understand them one by one before proceeding to practice session.
Guess estimates: A good thing about these questions is, there is no definite answer. Hence, this
question is not meant to check your final answer, but the approach taken to conclude it. A key to
solve guess-estimate question is Structured Thinking. Structured thinking is a process of putting a
framework to an unstructured problem. Having a structure not only helps an analyst understand the
problem at a macro level, it also helps by identifying areas which require deeper understanding.
Business Case Study: Case Studies are designed to give you a taste of real world analytics. In
other words, it comprises of business situations supported by numbers, and you are required to
understand, define and solve the problem.
Puzzles: Who doesnt love solving puzzles? If you dont, you are most likely to struggle in this
interview round. Puzzles are asked to access your creativity & out of the box thinking. Recruiters
want candidates who can think differently and bring new ideas on the table. Couldnt be a better way
than puzzles to assess this skill.
Before you start practicing with solved questions, check out this 10 minute video sharing the magical
approach to solve any guess estimate questions:
Guess Estimates
Case 1: Estimate the number of cigarettes consumed monthly in India
Solution: A good proxy in such problem is the population of India i.e. 1 billion. Following is an
effective way to segment this population:
Following were the key considerations in building the segmentation and the intermediate guesses:
1. The rural population consumes far lesser cigarettes than urban because of the purchasing
power difference.
2. Male consume more cigarettes than female in both urban and rural populations.
3. Children below 16 years consume a negligible number of cigarettes.
4. Male to Female ratio in Urban is closer to 1 than that of Rural.
5. Male to Female ratio in younger generations is closer to 1 than that of older. This is because
of the increase in awareness level.
6. Bulk of population start smoking after getting into a job and hence the average number
cigarettes are higher in older groups.
7. Total number of cigarettes from the supply side also come to around 10 Trillion, which gives
a good sense check on the final number.
Case 2: Estimate the number of WhatsApp Android application installed
Solution: A good proxy in this problem is the world population i.e. ~7 Billion. Following is a possible
approach to this problem :
Solving guesstimate questions require firm hold on arithmetic math. Such question often make
students confused between number of zeroes to put. Make sure you practice enough to avoid
making such mistakes.
Case 3: Estimate the number of tennis balls bought in India per month
Solution: A good proxy in this problem is the number of cities in India i.e. ~1700. The catch in this
problem is to analyze where all can we use tennis balls. Once we have the number of tennis balls
used monthly, we can easily find the number of tennis ball bought in a month using the lifetime of
tennis balls.
Following is an effective way to segment this population:
Following were the key considerations in building the segmentation and the intermediate guesses:
1. Rural areas have negligible number of tennis courts.

2. Metro cities have the highest number of sectors.
3. For each sectors in metro cities, the number of grounds for both tennis and cricket is higher.
This is both because of the bigger area and the higher buying capacity in metros.
4. Number of balls consumed in metros per ground is higher because of the higher
engagement in metros.
Question :
Estimate the number of black colored cars in New Delhi (You are free to make logical assumptions)
Case Study
https://www.youtube.com/watch?v=A9fw6R4GcDQ
Call Center Optimization - Level Easy
Introduction
What is the biggest turn-off when you call the customer service center for seeking support on any
issue?
For most of us, it is the waiting time. The line : Your call is important to us really does not do any
consolation to the customer who is waiting on line.
Lets say, you are looking at comparable Internet Service providers : A, B and C. What will be your
main considerations while choosing one of them? First, will be the internet speed and second will be
the customer support. Customer support is supremely important for any company whether it is a
Telecom company, Internet Service Provider, Bank, Insurance company or an E-Commerce Firm. It
is the assurance from the company that anything going bad will be resolved as soon as possible.
So, we all understand that Customer Service Center is probably the second most important
consideration just after the actual product. Also, customer service is one of the biggest contributors
to the cost component for any firm. So, within the same budget we can make the customer service
better using analytics. Lets try looking at it with a case study. Note that all the numbers are
simulated and are used to bring out a concept and does not come from a real case.
Type of channels used in customer service

All companies provide multiple channels through which customers can reach out / connect with
them. Here are a few of these channels :
1. Call Center: This is the most expensive channel which customers use to reach out to companies.
Every company wants to minimize this channels cost to reduce operational expense, but its a
necessary evil. Every customer wants a human respondent to truly assure that everything will be
resolved. This preference might change over time, but as of now we have to accept the truth that
machines are not as good as humans to understand customers problem, justify emotions and
provide customized assurance.
2. IVR: This is a cheaper channel and is used by almost every company. In general, you get to IVR
and if customers query is not resolved on IVR, he/she finally reaches the call center.
3. Social Media: This might seem like the cheapest channel to receive complaints from a customer,
however is the most deadly channel today. The reason being, everyone who comes on the social
media page can see these complaints. You are no longer on a 1-to-1 conversation with the
customer.
4. E-Mail: This is probably the easiest to handle and cheapest to resolve medium for any company.
However, this is one of the least preferred channels for the customer to reach out to the company.
Other channels might include brick and mortar branches / outlets, 1-on-1 customer relationship
managers etc. The most important of all is the call center (calling process) which every company
needs to maintain but at minimum cost possible.
So, how can we optimize the expense for a call center. To optimize this problem, you first need to
understand that here we are dealing with two entities : Customers and the caller. And to optimize
combination of customer-caller pair, you need to understand how are customer different from each
other and how are callers different from each other.
Customers : What might influence the time a customer takes on Call?

Because customers might not call the call centers frequently, more important attributes will be his
/her demographic and relationship with the company. Lets consider a bank. So, for a bank, these
attributes can be :
1. Relationship balance of the customer : How valuable is the customer for the bank
2. Type of product which a customer holds : Is the product more prone to disputes or are plain
vanilla deposit products?
3. Region of customer : Customers from rural areas might take longer time to understand the
solution provided by the caller.
4. Tenure of customer: A new customer might have more number of questions than a tenured
customer
Beyond these, there can be past call data which might be useful :
1. Last Call duration

2. Time difference from the last call
3. Reason of last call
4. Whether the reason was resolved after the last call or the ticket is still open
Callers : What might take a caller higher time than others?

Even though the time taken by callers might not be very different from each other, some level of
segmentation can definitely be found. Here are a few variables which can influence the amount of
time a caller takes :
1. Vintage of the caller : A new caller might take longer time to resolve the same query
2. Training done by the caller : Caller training might help them to understand and resolve the
customer queries faster than the others.
3. Location of call center : Onshore vs. Off-shore call center might have different resolution time
because the callers can easily understand the accent of the customers from the same
country and might understand customer issues better.
4. Utilization rate of the caller : A caller who is highly utilized might have a lower efficiency
compared to others.
How do start modeling ?

What is the objective function when we are trying to optimize the call center efficiency? Here are a
bunch of objective function we could use:
1. Call duration : This directly hits the total cost.

2. Customer satisfaction : It might not directly hit the cost or revenue but definitely hits the
bottom line in a longer term.
So, how do we solve for two objective function? In general, to make it simpler, we take one of the
objective function as a constraint and other as the main objective function. We will try to create a
formulation which can be solved using an assignment problem. For simplicity we will not take
customer satisfaction in this analysis.
Sample Problem Statement

You have 7 types of callers and 7 types of customers. Assume that all seven customers call at the
same time, how do you assign callers to each of these customers so that the total time of the call
center is least.
All numbers shown here represent time in minutes. Imagine this problem getting to an extent where
1000s of callers respond and 100000s of customers call. So, we probably need a more scientific way
to do this problem.
Lets first see, what is the time if 1st caller gets assigned to 1st customer and 2nd to 2nd and so on.
The total time becomes 23 + 84 + 91 + 82 + 67 + 63 + 6 = 416 which is 59.4 minutes/customer.
Now, lets try to optimize this problem using something called assignment problem solution using
Hungarian method. The steps are :
1. Reduce every row with the minimum of the same row.
2. Reduce every column with the minimum of the same column

3. Find a row which has only one zero and make the assignment. Also cancel all the zeros in
the same column. In case two rows comes out to have zero in than same column, leave it to
be assigned at the end
4. Now select the same cells in the main grid and do the total time calculation.
Doing the calculation, the total time now comes out as 165 which is just 23 minutes/ customer. This
is a far better assignment than the random allotment.
You feel challenged? Should you practice more, do it with solved case studied below:
Call Center Optimization - Level Medium

Analytics Business Case
o Case 1
o Case 2
o Case 3
Analytics Interview Case Study
Dawn of Taxi Aggregators
Optimize Product Pricing
Taxi Replacement
Puzzles
Before you start practicing with solved puzzles, check out this 10 minute video sharing secrets to
solve any brain teaser problem:
https://www.youtube.com/watch?v=l2_IuKaj5vQ
Puzzle 1 Who took that Coconut?
Ten people land on a deserted island. There they find lots of coconuts and a monkey. During their
first day they gather coconuts and put them all in a community pile. After working all day they decide
to sleep and divide them into ten equal piles the next morning.
That night one castaway wakes up hungry and decides to take his share early. After dividing up the
coconuts he finds he is one coconut short of ten equal piles. He also notices the monkey holding one
more coconut. So he tries to take the monkeys coconut to have a total evenly divisible by 10.
However when he tries to take it the monkey conks him on the head with it and kills him.
Later another castaway wakes up hungry and decides to take his share early. On the way to the
coconuts he finds the body of the first castaway, which pleases him because he will now be entitled
to 1/9 of the total pile. After dividing them up into nine piles he is again one coconut short and tries to
take the monkeys slightly bloodied coconut. The monkey conks the second man on the head and
kills him.
One by one each of the remaining castaways goes through the same process, until the 10th person
to wake up gets the entire pile for himself. What is the smallest number of possible coconuts in the
pile, not counting the monkeys?
Answer : 2519
Logic: LCM (Lowest Common Multiple) of 10,9,8,7,6,5,4,3,2,1 -1. LCM would give the least number
which is divisible by all of these number and subtracting one would give us the number of coconuts
which were initially there.
Puzzle 2 Hat Riddle
A stark raving mad king tells his 100 wisest men he is about to line them up and he will place either
a red or blue hat on each of their heads. Once lined up, they must not communicate amongs
themselves. Nor may they attempt to look behind them or remove their own hat.
The king tells the wise men that they will be able to see all the hats in front of them. They will not be
able to see the color of their own hat or the hats behind them, although they will be able to hear the
answers from all those behind them.
The king will then start with the wise man in the back and ask what color is your hat? The wise man
will only be allowed to answer red or blue, nothing more. If the answer is incorrect then the wise
man will be silently killed. If the answer is correct then the wise man may live but must remain
absolutely silent.
The king will then move on to the next wise man and repeat the question.
The king makes it clear that if anyone breaks the rules then all the wise men will die, then allows the
wise men to consult before lining them up. The king listens in while the wise men consult each other
to make sure they dont devise a plan to cheat. To communicate anything more than their guess of
red or blue by coughing or shuffling would be breaking the rules.
What is the maximum number of men they can be guaranteed to save?
Answer : 99.
Logic: The first wise man counts all the red hats he can see (A) and then answers blue if the
number is odd or red if the number is even. Each subsequent wise man keeps track of the number
of red hats known to have been saved from behind (B), and counts the number of red hats in front
(C). If A was even, and if B&C are either both even or are both odd, then the wise man would
answer blue. Otherwise the wise man would answer red. If A was odd, and if B&C are either both
even or are both odd, then the wise man would answer red. Otherwise the wise man would answer
blue.
Puzzle 3 Are you ready for the final one
A king wants his daughter to marry the smartest of 3 extremely intelligent young Prince, and so the
kings wise men devised an intelligence test.
All the prince are gathered into a room and seated, facing one another, and are shown 2 black hats
and 3 white hats. They are blindfolded, and 1 hat is placed on each of their heads, with the
remaining hats hidden in a different room. The king tells them that the first prince to deduce the color
of his hat without removing it or looking at it will marry his daughter. A wrong guess will mean death.
The blindfolds are then removed.
You are one of the prince. You see 2 white hats on the other princes heads. After some time you
realize that the other prince are unable to deduce the color of their hat, or are unwilling to guess.
What color is your hat?
Note: You know that your competitors are very intelligent and want nothing more than to marry the
princess. You also know that the king is a man of his word, and he has said that the test is a fair test
of intelligence and bravery.
Answer : White.
Logic: The king would not select two white hats and one black hat. This would mean two prince
would see one black hat and one white hat. You would be at a disadvantage if you were the only
prince wearing a black hat. If you were wearing the black hat, it would not take long for one of the
other princes to deduce he was wearing a white hat.
If an intelligent prince saw a white hat and a black hat, he would eventually realize that the king
would never select two black hats and one white hat. Any prince seeing two black hats would
instantly know he was wearing a white hat. Therefore if a prince can see one black hat, he can work
out he is wearing white.
Therefore the only fair test is for all three prince to be wearing white hats. After waiting some time
just to be sure, you can safely assert you are wearing a white hat.
More Puzzles:
Commonly asked - Part I

Commonly asked - Part II
20 Challenging Interview Questions
Question :
Out of 10 coins, one weighs less then the others. You have a scale. How can you
determine which one weighs less in 3 weighs? Now how would you do it if you didn't know if
the odd coin weighs less or more?
Overview
Regardless of how much technical knowledge youve gained, if you cant narrate stories (relevant
ones) about yourself, be ready for a tough time clearing this round.
In behavioral interview rounds, recruiters want to know about you as a person based on your past
accomplishments. This activity would not only reveal your thought process as a person, but your
past actions, choices, decisions would give them a worthy indicator of your future performance. They
want to understand your tactics to solve problem. They want to know if youve dealt with similar
problems in past.
Alongside, recruiters are keen to find out, job work aside, if you could be a good fit in companys
culture. For recruiters, its a crucial task to assess the overlap between companys vision and
candidates aspirations. The larger it is, higher are the chances of you getting hired.
Behavioral interviews can be broadly classified into 5 categories:
Teamwork Interview Questions: Such questions are asked to convey / express your fit as
a team player in the company.
Leadership Interview Questions: Such questions are asked to help you demonstrate your
proven leadership qualities at workplace.
Handling Conflict Interview Questions: Such questions are asked to determine your
diplomatic approach towards tackling critical workplace situations. Except experience,
nothing can help you master this talent.
Problem Solving Interview Questions: Such questions are asked to test your abilities to
deal with problems arising at workplace. Some people lose their calm when find themselves
in adverse situations, and eventually they walk out of offices. Can you endure? Thats what
recruiters want to see!
Biggest Failure Interview Questions: Such questions are asked to access your attitude,
spirit & state of mind while dealing with failures in your life. Do you give up easily? Do you
not pursue it further? Thats what recruiters want to know. If you say youve never failed in
life, all it says about you is that youve never succeeded.
Up next are some incredible tips to help you ace behavioral interview rounds with flying colors!
Question :
Recruiter tells you that your aspirations doesn't sync with company's goals. You wouldn't be a good
fit for the organization. What will you say?
Read Tips
Sadly, there is nothing immediate you can do to ace behavioral interview questions. Its a gradual
process of life events & experiences which makes you a nice story teller in the interview room.
Below are our tips you should keep in mind & practice in your day to day work / study routine:
1. Be Honest: Not everyone has been through every situation in life. Recruiter knows it. So,
dont lie about situations. Theyll know when you do. Instead, be honest about your
achievements, failures & aspirations.
2. Experiment More: Some students fail to answer behavioral questions because they have
never found themselves under such circumstances. Its a red flag to you. A recruiter would
never hire a candidate who hasnt failed or handled pressure or lead people or solved
problems. If you are in a similar situation, start experimenting more in life.
3. Motivation is the key: Do you know what is common among all 5 categories of behavioral
interviews ? Every question in some way tries to test your motivation. Are you motivated?
Because, if you are motivated youll give your best at solving problems, coping with failure
and working with people.
4. Learn to connect situations: This ability to analyze and connect situation can help you
fetch brownie points in the interview. For example: A recruiter asks you about a situation.
Chances are you havent faced the exact situation till now. But, there might be instances in
your life which are similar to it. Hence, should learn to connect situations to compensate your
life / work inexperience.
5. Use Structured Thinking: Most of the students dont know from where to start their
answers. Moreover, they get lost while speaking it out. Do you feel the same? It means you
are suffering from lack of structured thinking. Check out tools for improving structured
thinking here.
6. Talk about failures: Dont describe all the situations with happy ending. Mix them up with
your failures and your key learnings from them. This is important because the recruiter would
understand that you know how to deal with failures, learn from the mistakes and become
better.
7. Use STAR Technique: Its the most effective and cardinal strategy often used by candidates
to answer behavioral questions. Regardless of the question asked, this strategy helps you
put a framework to your answer, which makes the recruiter understand your story better.
STAR technique is as follows:
o ST (Situation or Task) - Describe a situation you were in, or a task you were
required to do. Make sure the situation or task is related to the job youve applied for,
and not any random life situation. Do provide enough details about your situation and
not assume that the recruiter would know about such situation. You can describe the
situations / task from your past volunteering work, job, internship, live projects,
industry projects etc.
o A (Action) - For the situation described above, what actions did you take to improve
it. Did you face any challenges while undertaking those actions? Describe them too.
Make sure, you convey clearly about the actions you took personally, not as a team
effort or the actions you could have taken.
o R (Result) - For the actions taken above, what were the results? Did you achieve
what you wanted to? Did you fail? What did you learn? All these questions must be
answered in this stage.
Question :
Suppose, your manager has forced you to work after office hours too. Even after your repetitive
requests to leave office latest by 7pm so that you can spend time with family too, he doesn't care.
What would you do?
Conclusion
We believe this is the most comprehensive & interactive analytics interview guide available today.
This guide is meant to prepare you well for your upcoming analytics interviews. It covers a broad
range of proven subjects / questions which are asked in analytics interviews across the world,
though their forms may vary.
Today, every college student who has heard of analytics / data science, wants to become a data
scientist. Dont you?
Then, you must know, the path to become a noted & powerful data scientist is troublesome. You will
face constant failures, rejections, disappointments on the way. But, dont let it stop you, because
nothing worth having comes easy.
To ace your first or next job interview, not only you should practice these workshops solved
questions by heart, but also interact with data scientists and learn more from their work experiences.
Before you finish this workshop, check out this video where a person sailed through the path of job
interviews and finally got hired as a data scientist:
https://www.youtube.com/watch?time_continue=1&v=3BRLGRqj8ps

Workshop Book

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Workshop Book

Загружено:

Авторское право:

Доступные форматы

Why this Workshop?

Example 1: Create an Infographic about yourself

Know more: PCA

1. We can use undersampling, oversampling or SMOTE to make the data balanced.

Know more: Imbalanced Classification

Q5. Why is naive Bayes so naive ?

Likelihood is the probability of classifying a given observation as 1 in presence of some other

1. There exist a pattern.

Q10. How is kNN different from kmeans clustering?

Know more: Regression

Know more: Causation and Correlation

1. Remove the correlated variables prior to selecting important variables

Q16. What is the difference between covariance and correlation?

Answer: Correlation is the standardized form of covariance.

Entropy is the measure of impurity as given by (for binary

Q19. What is convex hull ? (Hint: Think SVM)

Answer: We can deal with them in the following ways:

Know more: Recommender System

Q22. What do you understand by Type I vs Type II error ?

Answer: Tolerance (1 / VIF) is used as an indicator of multicollinearity. It is an indicator of percent of

26. Explain machine learning to me like a 5 year old.

Answer: We can use the following methods:

Know more: Logistic Regression

29. What do you understand by Bias Variance trade off?

Such question can be broadly categorized into 3 categories:

Lets understand them one by one before proceeding to practice session.

Case 2: Estimate the number of WhatsApp Android application installed

Following is an effective way to segment this population:

1. Rural areas have negligible number of tennis courts.

Call Center Optimization - Level Easy

Type of channels used in customer service

Customers : What might influence the time a customer takes on Call?

1. Last Call duration

Callers : What might take a caller higher time than others?

How do start modeling ?

1. Call duration : This directly hits the total cost.

Sample Problem Statement

1. Reduce every row with the minimum of the same row.

2. Reduce every column with the minimum of the same column

Call Center Optimization - Level Medium

Puzzle 1 Who took that Coconut?

Puzzle 2 Hat Riddle

What is the maximum number of men they can be guaranteed to save?

Commonly asked - Part I

Behavioral interviews can be broadly classified into 5 categories:

Вам также может понравиться