Вы находитесь на странице: 1из 2

The Smart Recruit

AnalyticsVidhya organised a weekend hackathon called The Smart Recruit, which was held on
23rd-24th July, 2016.
I won the previous hackathon, The Seer's Accuracy, and was hoping to do well in this one too.
Problem
The problem was to identify which agents would be successful in making sales of financial
products. So, it was a binary classification problem.
Data
Like the previous hackathon, the data seemed quite good and promising.
Train and test data consisted of agent applications with data about the application, the manager
and a few related features about them.
Model
I'm sure most participants just went ahead and dumped the data into XGB-type models with a lot
of scores hovering in the 0.63 - 0.66 AUC range.
I tried to get a robust/stable validation framework, like I mentioned in AV's article on Winning
Tips. Didn't seem to work/help. The CV/LB scores were all over the place in my first few
submissions.
Thats when I decided to take a step back and inspect the data in detail. It was evident to me that
there would be a huge LB shake-up due to the variance between the CV-LB scores. Hence, didn't
make much sense to spend too much time on the data trying to optimize models. Instead, I tried
to look for some pattern/feature which could boost me score over the expected error margins.
And that's exactly what happened. A simple plot of the target variable showed a pattern, which
seemed too good to be true. I tried a feature using this and my CV jumped to 0.8... and that was
the feature that ultimately proved to be the winning one.
Here's the plot that changed everything:

This is the plot of the target variable for the first four days. A clear pattern exists where you see
most of the 1's at the beginning of a day and most of the 0's at the end of the day. You can plot

the target variable of any single day and observe a similar trend.
Leakage? Possible. Hidden trend? Possible. At first I was convinced it was leakage and a data
preparation issue, but later, felt there was a possibility that applications received towards the end
of the day are more likely to be rejected than ones received early.
Either ways, I polished this feature using Order_Percentile in my code, which was the most
important feature.
My final model was a single XGBoost with 14 features, with the other 13 being cleaned up
features from the raw variables. I achieved a CV of 0.887 which was in the same range as the
LB. I'd have liked to try out some more parameter tuning and ensembling, but with the limited
duration of a hackathon, there wasn't any time left.
GitHub
View My Complete Solution
Results
I stood 1st on the public LB with 0.885, with good friend and rival competitor SRK in 2nd, who
teamed up with Kaggler Mark Landry, with 0.876 and another team of Kanishk Agarwal and
Yaasna Dua in 3rd with 0.839. No other team figured out the winning feature and their scores
were below 0.71.
The rankings held same on the private LB, but it was much closer, with SRK-Mark scoring
0.7647 and I scoring 0.7658.
My username is 'vopani'.
View Complete Results
Views
My 2nd AV win on the trot and while not the best way to win it, I'm happy I could find a useful
winning feature in the data.
Congrats to the ever consistent SRK, who also happens to be someone I'm chasing on Kaggle :-)
Fun weekend, bonus to win it, and looking forward to the next hackathon, where I'll be on a hattrick!
An interesting co-incidence: I got the exact same score on the public LB (0.8856) in the previous
hackathon too, The Seer's Accuracy !!!
External Links
View AV article on the winners
View 2nd place solution by SRK
View 3rd place solution by Kanishk Agarwal