Вы находитесь на странице: 1из 24

DATATHON

BAI

Group No. 10
Kundan Thakan – 1811060
Konark Patel – 1811071
Anvi Gupta – 1811096
Sourabh Choraria – 1811317
Jimmitesh Singla - 1811362
Improving Lead Generation at Eureka Forbes Using Machine Learning Algorithms

Executive summary

Eureka Forbes, founded in 1982, is a current market leader and has product lines such water
purification, vacuum cleaning, air purification and security solutions. However, it is facing
tough competition from new and local players in the industry. The company is currently
pursuing every lead physically and therefore incurring high cost. It is now looking for a cost-
effective distribution system.

Its website is a comprehensive one-stop solution to obtain information about the different
products and services offered by them. Though millions of potential customers visit the
website regularly, they don’t translate into prospects. Shashank Sinha, the Chief
Transformation Officer, is of the opinion that the team should use the data collected by his
team on the visitors, use it to predict the probability of conversion and send a representative to
the door of only those who have high probability of making a purchase. The similar analysis
can be extended for further applications such as one-to-one marketing and short-listing leads
according to sales budget.
Q.1 Perform descriptive analytics on the training data, write the insights based on the
descriptive analytics (3 points)

Average of different variables were computed for both converted and non-converted consumers to
understand which variables could influence the chances of a customer’s conversion. All the below
mentioned numbers are averages.

1. Time spent on air purifier & water purifier page

Analyzing the dataset, we realize that the consumers who got converted spent significantly more
time (26.75) on air purifier page as compared to non-converted consumers (2.886). Also,
consumers who got converted spent higher time (183) on water purifier page than non-
converted consumers (83). The above analysis concludes that the converted customers spent
more time on air and water purifier product page as compared to non-converted customers.
Also, converted consumers are spending more time on average on the water purifier page than
air purifier. Given the fact that Eureka Forbes is more widely known for water purifier, this insight
seems logical.

2. Time spent on checkout page

Consumers who purchased the product spent less time on checkout page as compared to those
who did not get converted. Converted consumers spent 9 units as compared to non-converted
users spent 3.7. (Reason is explained in Q2-c). This is also logical as the consumers who had a
higher intent to buy the product did the same immediately whereas the consumers who were
unsure of buying the product spent more time on the page and eventually dropped the idea of
purchasing the project.

3. Time spent on customer service amc login

Consumers who purchased the product spent less time (1.9) on customer service amc login than
those who did not get converted (11.2). This may suggest that those consumers who have
already bought product know their login details and directly log on the page. Those who have not
purchased any product might just want to check the features of annual maintenance contract
(amc). Also, another reason could be that the consumers who bought the product might have
evaluated the product features in advance and hence, did not spend time on the login page
whereas the non-converted customers were evaluating the features of the products.

4. Day since last session

Consumers who purchased the product visited webpage more often (6.6 days since last session)
than those who did not get converted (11 days since last session). Intuitively, this suggests that
those consumers who got converted make their decision fast and often visits webpage to either
order or to get more information about product. The converted consumers might have also been
evaluating other products and hence, felt the need to visit the page more frequently.

5. Time spent on offer page

Consumers who purchased the product spent more time (13.8) on offer page than those who did
not get converted (5.6). Those consumers who are interested in buying the product would want
to know offers available for purchasing that product and would spend more time on offer page to
get complete value of their offer. Those who are not interested would just check out the page
and not spend more time as they don’t intend to buy the product.

6. Users 30 days pageviews history

Consumers who purchased the product has high 30 days pageviews history (42.5) than those
who did not get converted (27.5). This shows that consumers who got converted visited more
pages than those who did not get converted as they would be checking out multiple products to
compare before purchasing. Also, for example, those who did not get converted might have just
visited the webpage through google search while searching about generic product and not
Eureka Forbes specific product.

7. Time spend on security solutions page

Consumers who purchased the product has spent less time on security solutions (avg 0) than
those who did not get converted (0.99). Eureka Forbes is famous for air and water purifier and
most consumers who got converted might want to buy only these products and do not visit
security solutions product page. While those who did not get converted might want to check out
all the products available with the company (Window shopping).

8. Total duration (in seconds) of users' sessions

Consumers who purchased the product has more average total duration spent on sessions (511)
than those who did not get converted (372.4). This is intuitive that those who got converted
would spend more time on different pages like product, checkout, amc login etc.

9. Time spent on store locator page

Consumers who purchased the product spent less time on store locator page (0) than those who
did not get converted (3). This shows that consumers who got converted would not spend time
on store locator as they directly purchased the product online and do not need to visit store
locator page. Also, the consumers who were more indecisive wanted to visit the store to get
more details and compare the products and hence, remained non-converted on the platform.

10. Time spent on success book demo page

Consumers who purchased the product spent considerably more time on success-book-demo
page (9.4) than those who did not get converted (0.6). This shows that consumers who
purchased product would want to have demo of those products and spend more time on
booking demo.
Q.2 How will Kashif test the following claims? (3 points)

a. Customers using mobile, desktop, and tablet are equally distributed

Using ANOVA

Observed value:
Row Labels desktop mobile tablet Grand Total
0 1801 9152 82 11035
1 4 61   65
Grand Total 1805 9213 82 11100

Corresponding distribution:
Row Labels desktop mobile tablet
0 100% 99% 100%
1 0% 1% 0%
Grand Total 100% 100% 100%

Expected value:
Grand
Row Labels desktop mobile tablet Total
9159.0 81.5198
0 1794.43 5 2 11035
1 10.56982 53.95 0.48018 65
Grand Total 1805 9213 82 11100

Corresponding distribution
Row Labels desktop mobile tablet
0 99% 99% 99%
1 1% 1% 1%
Grand Total 100% 100% 100%

Statistics for Chi square distribution = 0.008 for


H0: Devices are equally distributed
H1: Devices are not equally distributed

Now since statistics is less than null value hence, we cannot reject null hypothesis, hence, devices are
equally distributed

b. Repeat visitors are as likely to convert as new customers

From all the data from the data – downloaded from the link in the case
Observed value
Corresponding distribution
Row Labels New Repeated
0 99.4% 99.7%
1 0.6% 0.3%
Grand Total 100% 100%

Expected value
Row Labels New Repeated
0 99.6% 99.6%
1 0.4% 0.4%
Grand Total 100% 100%

Statistics for Chi square distribution = 0.000 for


H0: Repeat visitors are as likely to convert as new customers
H1: Repeat visitors are not as likely to convert as new customers

Now since statistics is less than null value, hence, we cannot reject null hypothesis hence devices are
equally distributed

c. Customers who convert spend more time on the website.

From the logistic regression explained in Q3 we find session_duration to have positive coefficient
whereas session_duration_hist to be negative which means that more decisive the customer is, the
more time they will spend during that particular session while they are making the purchase to make
their decision. However, if the previous (hist) time spend is large, this implies that the customer is just
fishing/exploring and does not have a serious intent to buy the product.

Below is the output of the logistic regression.


Q.3 Build machine learning models that can be used for improving the lead generation
at Eureka Forbes. State model accuracies and insights gained from each model. (10
points)

 Imputation of missing values:


For imputing blank values in the dataset, we have used K nearest neighbor algorithm (KNN)
with k=5. We have imputed missing values with nearest median value because a few very
large values are skewing the values of the mean whereas the first quartile, median and third
quartile of most variables are zero. (Refer Appendix)

 Data balancing
The data is highly imbalanced or skewed because 99% of the website visitors didn’t convert
and only 1% of the website visitors converted. Therefore, we have used both under sampling
and over sampling to balance the data. Only over sampling would lead to high repetition
because we would have to replicate the same 65 records of converted visitors multiple times.
Only under sampling would significantly reduce the size of training data. Assuming an ideal
ratio of 3, we will have a total of 260 records only. Finally, we took an equal percentage of
converts and non-converts in the training data.

 Feature engineering
We have enlisted the engineered features below. They all are significant in the final model
and have significantly reduced the null deviance.

1. Session duration per session (sd.s) – This feature was derived by dividing total
duration of users’ session by total number of sessions. If the potential customer is
spending less time per session, he/she is more certain of the product he wants to
purchase and therefore conversion probability is higher.

2. Session duration history per session (sd.s_hist) – This feature was derived by dividing
30 days session duration history by total number of sessions. If recently (i.e. last 30
days) the customer has been spending more time per session, he/she might be getting
more serious about the purchase.

3. Number of pages viewed per session (p.s) – This feature was derived by dividing
number of pageviews per user by total number of sessions. The base variable given in
the data, Pageviews, was insignificant in the model. However, when divided by the
number of sessions, it became significant. If the customer is exploring the website
more, he might be collecting information more holistically and is therefore more likely
to convert.

4. Number of pages viewed in last 30 days per session (p.s_hist) – This feature was
derived by dividing number of pageviews in last 30 days per user by total number of
sessions. The base variable given in the data, Pageviews_hist, was insignificant in the
model. However, when divided by the number of sessions, it became significant.

5. Access by mobile device type or non-mobile device type (device.enon_mobile) – We


developed CART tree model in order to determine importance of variables. Access by
device type was used for branching. So, data was divided into only 2 types i.e. mobile
and non-mobile (Laptop & Tablet).
6. Access by referral source channel or not the source channel (sourceMedium.e) - We
developed CART tree model in order to determine importance of variables. Access by
referral source was used for branching. So, data was divided into only 2 types i.e.
referral and non-referral (google_cpc, google_organic, direct-none, facebook_social)

 Model Creation:

Logistic Regression Model:

AUC of the model of test data:


Confusion Matrix

Logistic regression model accuracy is 59%.

Decision Tree Model


Confusion Matrix of Model

Accuracy of Decision Tree model is 91.3%.


AUC of Model
Random Forest Model
Confusion Matrix of Model

Accuracy of Random Forrest Model is 99.6%


AUC of Model
Adaptive Boosting Model

Error Matrix
Accuracy of Adaptive Boosting Model is 75.3%
AUC of Model
Business Insights from various models
1. Logistic regression Model:
 We can see all the service variables like visiting request page, amc page, contactus is
positively correlated showing serious customers buying the product
 We can see people spending time looking at other products are not serious, -ive
correlation
 The variable source.referral has the highest beta showing referral plays an important
role in conversion

2. Decision Tree Model:


 The model helped us to formulate two of the variables
 We also see that demo sessions when requested more leads to higher conversion
 The tree also shows that all the engineered feature to be important as it comes in the
top of the tree

3. Random Forest Model:


 From RF model we can see session variables are to be most important
 Also Page/session is also important where as page views are not important

Usage of Model:

 We will be using ensembling method that is, getting result from multiple models that is using
Logistic Regression, Random Forest and Adaptive boosting and take a vote according to
accuracy. This is because all the model consider different important variables.

We will not be using Tree model because of less accuracy

Q.4 Based on the different model results, what would be your final recommendation to Kashif?
(4 points)

We built 4 different machine learning models in order to increase lead generation at Eureka Forbes.
We compared different parameter of different models and results are as below.

Company will have to shell out money in order to chase the potential lead customer. So, error of
misclassifying potential non-lead as lead is more important than overall error. Therefore, it is
important to look at error of misclassifying y=1 case.

Error of misclassifying
Model AUC Overall Error
as lead consumer
Logistic Regression 0.84 41% 0%
Decision Tree 0.7 8.7% 70%
Random Forest 0.92 0.4% 40%
Adaptive boosting 0.94 24.7% 0%

From above table, it can be seen that adaptive boosting model is best suited for unbalanced data. It
gives error of misclassifying as lead consumer as 0% and AUC of 0.94 which is highest of all the
model. Logistic regression also gives error of misclassifying as lead consumer 0% but it has low AUC
of 0.84.

Recommendations based on model:

4. Sessions duration per session is most important variable for deciding lead generation. It is
important to make session duration interactive for consumers. Company can add information
about its product on webpage in order to engage and inform consumers about its products.

5. As seen from the insights, consumers who spend more time on multiple product pages do
not get converted to lead. So, company should not pass on the lead for the consumers who
are browsing through multiple product webpages. Lead should be generated once consumer
starts spending more time on one product page above the threshold limit.

6. Lead conversion is more when the consumer has been referred to the website. Company
should increase the engagement with referral leads at early stage.

7. Since consumers from referral has high conversion, company should incentivize referral and
should take steps to increase net promoter score (NPS).
8. Consumers who uses mobile to access website has more probability of getting converted to
lead. So, company should improve its mobile website to make it more accessible. Eureka
Forbes can also develop a mobile app to engage more consumers.

9. Consumers who got converted spend more time on offer page. Eureka Forbes can partner
with financial institutions to get offers on various payment options. In era of e-commerce,
consumers look for payment offers and platforms like amazon & flipkart give many offers.
Appendix:
Variable 1st Quarter Median 3rd quarter Mean Maximum
Bounces_Hist 0 0 1 1.55 210
Help_me_buy 0 0 0 0.244 19
_evt_count
Pageviews_hi 4 9 26 27.6 580
st
Paid_Hist 0 1 2 1.74 27
Hone_clicks_e 0 0 0 0.075 12
vt_count_hist
SessionDurati 89 477 1904 2011 88846
on_hist
Sessions_hist 1 2 6 6.316 233
Visited_air_pu 0 0 0 0.128 12
rifier_page_hi
st
Visited_check 0 0 0 0.494 15
out_page_hist
Visited_conta 0 0 0 0.07 15
ctus_hist
Visited_custo 0 0 0 0.181 21
mer_service_
amc_login_his
t
Visited_custo 0 0 0 0.053 20
mer_service_r
equest_login_
hist
Visited_demo 0 0 1 0.769 18
_page_hist
Visited_offer_ 0 0 0 0.397 16
page_hist
Visited_securi 0 0 0 0.019 3
ty_solutions_
page_hist
Visited_storel 0 0 0 0.032 8
ocator_hist
Visited_vacuu 0 0 0 0.717 20
m_cleaner_pa
ge_hist
Visited_water 0 0 1 1.62 26
_purifier_pag
e_hist

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -3.25118416 0.17768337 -18.298 < 2e-16 ***


DemoReqPg_CallClicks_evt_count -17.85495372 864.28367732 -0.021 0.983518

air_purifier_page_top 0.00415941 0.00138159 3.011 0.002607 **

bounces -0.21492122 0.07259194 -2.961 0.003070 **

bounces_hist -0.09609688 0.02403169 -3.999 6.37e-05 ***

checkout_page_top -0.00288890 0.00187738 -1.539 0.123855

contactus_top 0.00019618 0.00054971 0.357 0.721176

countryi 0.68913787 0.17050408 4.042 5.30e-05 ***

customer_service_amc_login_top -0.03430404 0.01002248 -3.423 0.000620 ***

customer_service_request_login_top -0.00108166 0.00108689 -0.995 0.319645

demo_page_top -0.00185950 0.00018301 -10.160 < 2e-16 ***

device.enon_mobile -4.23925799 0.39718819 -10.673 < 2e-16 ***

dsls -0.00817475 0.00199611 -4.095 4.22e-05 ***

fired_DemoReqPg_CallClicks_evt 19.69396363 864.28370851 0.023 0.981821

fired_help_me_buy_evt -4.86378012 0.65251917 -7.454 9.07e-14 ***

fired_phone_clicks_evt -0.71599563 0.34257451 -2.090 0.036614 *

goal4Completions -31.47158874 1403.69230354 -0.022 0.982112

help_me_buy_evt_count 2.59499343 0.32768415 7.919 2.39e-15 ***

help_me_buy_evt_count_hist -0.48905568 0.09438811 -5.181 2.20e-07 ***

offer_page_top -0.00168366 0.00129784 -1.297 0.194536

pageviews -0.01181204 0.02076766 -0.569 0.569512

pageviews_hist 0.00784485 0.00516589 1.519 0.128866

paid 1.50969242 0.14148625 10.670 < 2e-16 ***

paid_hist -0.52967004 0.04236657 -12.502 < 2e-16 ***

phone_clicks_evt_count 0.67182962 0.24998605 2.687 0.007200 **

phone_clicks_evt_count_hist 0.61237235 0.10946508 5.594 2.22e-08 ***

security_solutions_page_top 0.00250247 1.13775733 0.002 0.998245

sessionDuration 0.00125844 0.00021452 5.866 4.45e-09 ***

sessionDuration_hist -0.00021122 0.00004871 -4.337 1.45e-05 ***

sessions 0.14267448 0.08047760 1.773 0.076254 .


sessions_hist 0.07645416 0.02349908 3.253 0.001140 **

sd.s -0.00133716 0.00026940 -4.963 6.92e-07 ***

sd.s_his 0.00077925 0.00014068 5.539 3.04e-08 ***

p.s 0.11045646 0.02845922 3.881 0.000104 ***

p.s_hist -0.05508644 0.01899208 -2.900 0.003726 **

sourceMedium.ereferral 10.32736287 0.60575702 17.049 < 2e-16 ***

storelocator_top 0.00709582 0.54685063 0.013 0.989647

successbookdemo_top 0.01392691 0.00344655 4.041 5.33e-05 ***

vacuum_cleaner_page_top -0.00017817 0.00047166 -0.378 0.705610

visited_air_purifier_page -2.17435787 0.61699492 -3.524 0.000425 ***

visited_air_purifier_page_hist 0.22480431 0.13001590 1.729 0.083800 .

visited_checkout_page -1.31869560 0.41522733 -3.176 0.001494 **

visited_checkout_page_hist -1.33594852 0.10683986 -12.504 < 2e-16 ***

visited_contactus 1.38800448 0.23086836 6.012 1.83e-09 ***

visited_contactus_hist -0.19864674 0.15676092 -1.267 0.205085

visited_customer_service_amc_login 0.69374863 0.39940279 1.737 0.082393 .

visited_customer_service_amc_login_hist -0.07999978 0.11542701 -0.693 0.488261

visited_customer_service_request_login 1.43899179 0.32433922 4.437 9.14e-06 ***

visited_customer_service_request_login_hist 0.60270382 0.16017007 3.763 0.000168 ***

visited_demo_page 1.34634018 0.13747743 9.793 < 2e-16 ***

visited_demo_page_hist 0.75000545 0.05087913 14.741 < 2e-16 ***

visited_offer_page -0.29541621 0.19158848 -1.542 0.123090

visited_offer_page_hist -0.16724712 0.07775156 -2.151 0.031473 *

visited_security_solutions_page -19.53636502 605.07425119 -0.032 0.974243

visited_security_solutions_page_hist -15.08781082 341.57220452 -0.044 0.964768

visited_storelocator -19.97168993 428.58546344 -0.047 0.962833

visited_storelocator_hist -20.32854172 335.10473085 -0.061 0.951627

visited_successbookdemo 32.84048630 1403.69227439 0.023 0.981335

visited_vacuum_cleaner_page -0.36860437 0.23711216 -1.555 0.120052


visited_vacuum_cleaner_page_hist 0.08376279 0.06511974 1.286 0.198342

visited_water_purifier_page 1.40577580 0.12570812 11.183 < 2e-16 ***

visited_water_purifier_page_hist 0.34336653 0.03838287 8.946 < 2e-16 ***

water_purifier_page_top -0.00098056 0.00021860 -4.486 7.27e-06 ***

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -3.2045382 0.1581746 -20.259 < 2e-16 ***

air_purifier_page_top 0.0045749 0.0012324 3.712 0.000205 ***

bounces -0.2584955 0.0636975 -4.058 4.95e-05 ***

bounces_hist -0.0960539 0.0202121 -4.752 2.01e-06 ***

countryi 0.6594093 0.1597345 4.128 3.66e-05 ***

customer_service_amc_login_top -0.0505902 0.0103466 -4.890 1.01e-06 ***

demo_page_top -0.0014143 0.0001589 -8.901 < 2e-16 ***

device.enon_mobile -4.8346324 0.4276306 -11.306 < 2e-16 ***

fired_help_me_buy_evt -4.5730778 0.5651232 -8.092 5.86e-16 ***

fired_phone_clicks_evt -1.0924387 0.3510129 -3.112 0.001857 **

help_me_buy_evt_count 2.4556866 0.2833272 8.667 < 2e-16 ***

help_me_buy_evt_count_hist -0.4191308 0.0778048 -5.387 7.17e-08 ***

paid 1.4388370 0.1296427 11.098 < 2e-16 ***

paid_hist -0.4972631 0.0384225 -12.942 < 2e-16 ***

phone_clicks_evt_count 0.8569251 0.2605416 3.289 0.001005 **

phone_clicks_evt_count_hist 0.5046040 0.1006385 5.014 5.33e-07 ***

sessionDuration 0.0007238 0.0001717 4.215 2.50e-05 ***

sessionDuration_hist -0.0002177 0.0000375 -5.805 6.42e-09 ***

sessions 0.1580555 0.0583974 2.707 0.006799 **

sessions_hist 0.0848723 0.0171010 4.963 6.94e-07 ***


sd.s -0.0010719 0.0002263 -4.736 2.18e-06 ***

sd.s_his 0.0005375 0.0001211 4.437 9.11e-06 ***

p.s 0.0811029 0.0127542 6.359 2.03e-10 ***

p.s_hist -0.0258690 0.0134435 -1.924 0.054320 .

sourceMedium.ereferral 10.7000621 0.5666758 18.882 < 2e-16 ***

successbookdemo_top 0.0254676 0.0025929 9.822 < 2e-16 ***

visited_air_purifier_page -2.1551119 0.5702502 -3.779 0.000157 ***

visited_air_purifier_page_hist 0.2637309 0.1171688 2.251 0.024394 *

visited_checkout_page -1.4778892 0.2984918 -4.951 7.38e-07 ***

visited_checkout_page_hist -1.2101764 0.0942133 -12.845 < 2e-16 ***

visited_contactus 1.4404839 0.1768549 8.145 3.79e-16 ***

visited_customer_service_amc_login 1.0285116 0.3760540 2.735 0.006238 **

visited_customer_service_request_login 1.3927490 0.2586894 5.384 7.29e-08 ***

visited_customer_service_request_login_hist 0.7436959 0.1181880 6.292 3.12e-10 ***

visited_demo_page 1.4809747 0.1242353 11.921 < 2e-16 ***

visited_demo_page_hist 0.7169562 0.0451215 15.889 < 2e-16 ***

visited_offer_page_hist -0.3081055 0.0727708 -4.234 2.30e-05 ***

visited_water_purifier_page 1.1663008 0.1138703 10.242 < 2e-16 ***

visited_water_purifier_page_hist 0.4022705 0.0356589 11.281 < 2e-16 ***

water_purifier_page_top -0.0003673 0.0001818 -2.020 0.043338 *

Вам также может понравиться