Вы находитесь на странице: 1из 12

MIS 6334.

001 Spring 2017


Advanced Business Analytics with SAS
PROJECT

Group 1
Abhinav Aravabhumi
Megha Nellutla
Kavya Chandershekar
Prateek Sahu
1. Reading books.txt, generating the count dataset and printing the first 10 records.

Solution:

Code for reading the text file and the subsequent steps

LIBNAME project 'C:\mis6334';


DATA project.books (drop=VAR15);
infile 'C:\mis6334\project\books.txt' delimiter='09'x MISSOVER DSD
lrecl=50000 firstobs=2 IGNOREDOSEOF;
informat userid best32. ;
informat education best32. ;
informat region $1. ;
informat hhsz best32. ;
informat age best32. ;
informat income best32. ;
informat child best32. ;
informat race best32. ;
informat country best32. ;
informat domain $20. ;
informat date best32. ;
informat product $132. ;
informat qty best32. ;
informat price best32. ;
informat VAR15 $1. ;
format userid best12. ;
format education best12. ;
format region $1. ;
format hhsz best12. ;
format age best12. ;
format income best12. ;
format child best12. ;
format race best12. ;
format country best12. ;
format domain $20. ;
format date best12. ;
format product $132. ;
format qty best12. ;
format price best12. ;
format VAR15 $1. ;
input
userid
education
region $
hhsz
age
income
child
race
country
domain $
date
product $
qty
price
VAR15 $
;
RUN;

data barnesnoble;
set project.books;
if domain = "barnesandnoble.com";
run;

proc means data=barnesnoble NOPRINT;


class userid;
id education region hhsz income child race country;
output out=barnesnoblesum
sum(qty) = NumBooks;
run;

Data barnesnoblesum;
set barnesnoblesum (drop = _TYPE_ _FREQ_);
if userid = . then delete;
run;

PROC PRINT data=barnesnoblesum (obs=10);


run;
Results:

2. Build an NBD model, ignoring the demographic variables. Report your results.

Solution:

Code for counting the number of amazon books purchased

data amazonbookcount;
set project.books;
if domain = "amazon.com";
run;

proc means data=amazonbookcount NOPRINT;


class userid;
id education region hhsz income child race country;
output out=amazonbookssum
sum(qty) = NumBooksamazon;
run;

Data amazonbookssum;
set amazonbookssum (drop = _TYPE_ _FREQ_);
if userid = . then delete;
run;
/* merging barnesandnoble count dataset with amazon count dataset */

data bothbooks;
merge amazonbookssum barnesnoblesum;
by userid;
if NumBooks = . then NumBooks = 0;
run;

proc means data=bothbooks NOPRINT;


class NumBooks;
output out=nbdmodel
n(userid) = peoplecount;
run;

data nbdmodel;
set nbdmodel (drop= _TYPE_ _FREQ_);
if NumBooks = . then delete;
run;

/*Printing the first ten observations */

proc print data=nbdmodel (obs=10);


run;

Results:
Code for NBD model

proc nlmixed data=nbdmodel;


parms alpha=1 r=1;
form =
(gamma(r+NumBooks)/(gamma(r)*fact(NumBooks)))*((alpha/(alpha+1))**r)*(1/(alph
a+1))**NumBooks; /*formula for P(X=x|r,alpha)*/
ll = peoplecount*log(form);
model peoplecount ~ general(ll);
run;

Optimal values of r and alpha are 0.1299 and 0.09723 respectively. Below is the screenshot of
the fit statistics and parameter estimates.

3. Calculate the values of (i) Reach, (ii) Average Frequency, and (iii) Gross Ratings Points
(GRPs) based on the NBD Model. Show your work.

Solution:

To calculate reach, the formula is given below:

P(X=x|r,alpha) = (Γ(r+x))/Γ(r)x! (alpha/(alpha+1))^r (1/(alpha+1))^x

P(X=0|r, alpha) = (alpha/(alpha+1))^r which is equal to


(0.1299/(0.1299+1))^0.09723 = 0.81032

If P(X>0), the reach is 1-0.81032 =0.1897

Hence, reach = 0.1897*100 = 18.96%

To calculate the average frequency, we need the below,

E(X(t)) = r/alpha = 0.09723/0.1299 = 0.7484

Hence, average frequency = 0.7484/0.1897= 3.9451%

To calculate the gross rating points (GRPs), we need the below,

GRPs = 100*0.7484 = 74.84%

Hence, GRPs = 74.84%

4. Build a Poisson regression model using the demographic information (customer


characteristics) provided. Report your results. What are the managerial takeaways | which
customer characteristics seem to be important?
Optional: You have the flexibility in choosing the variables to include | if you wish to do so,
you can choose to eliminate some (via feature selection, for example) or create new ones
(from the variables you have available| for example, fraction of weekend purchases). This is
optional for this project, but if you do anything along these lines, please provide your
justification.

Solution:

Code for Poisson regression model

Data poissonbooks;
set bothbooks (drop=NumBooksamazon);
run;

/* checking for missing values */

Proc means data=poissonbooks N;


class education;
var education;
run;

/* fixing missing region values */

data poissonbooks;
set poissonbooks;
if region='*' then region=.;
run;
/* building Poisson Regression Model */

proc nlmixed data=poissonbooks;


/* m stands for lambda */
parms m0=1 b1=0 b2=0 b3=0 b4=0 b5=0 b6=0 b7=0;
m=m0*exp(b1*region+b2*hhsz+b3*age+b4*income+b5*child+b6*race+b7*country);
ll = NumBooks*log(m)-m-log(fact(NumBooks));
model NumBooks ~ general(ll);
run;

Results:

Managerial Takeaways:

We can infer the following from the above parameter estimates.


● The characteristic variable household size are not significant at 5% significance level as
they have a p value much greater than 0.05.
● The other variables that are significant in our analysis which explain the heterogeneity in
lambda (purchasing behavior on barnesandnoble.com) are age, region, income, child,
race as their p value is less than 0.05.

5. Next, we start the setup for developing an NBD regression model. What is the formula
for the log-likelihood expression, LL?

Solution:

LL=log((gamma(r+NumBooks)/(gamma(r)fact(NumBooks)))*((alpha/(alpha+expBx))**r)*((exp
Bx
/(alpha+expBx))**NumBooks))

Where expBx =
exp((b1*region)+(b2*hhsz)+(b3*age)+(b4*income)+(b5*child)+(b6*race)+(b7*country))

6. Build a NBD regression model using the demographic information provided. Report
your results. What are the managerial takeaways | which customer characteristics seem to
be important?
Optional: As with the Poisson regression, you have the flexibility in choosing the variables
to include | if you wish to do so, you can choose to eliminate some (via feature selection, for
example) or create new ones (from the variables you have available | for example, fraction
of weekend purchases). This is optional for this project, but if you do anything along these
lines, please provide your justification.

Solution:

Code for NBD regression model

/* Building an NBD Regression Model for books dataset*/


proc nlmixed data=poissonbooks;
parms r=1 a=1 b1=0 b2=0 b3=0 b4=0 b5=0 b6=0 b7=0;
expBX=exp(b1*region+b2*hhsz+b3*age+b4*income+b5*child+b6*race+b7*country);
ll = log(gamma(r+NumBooks))-log(gamma(r))-
log(fact(NumBooks))+r*log(a/(a+expBX))+NumBooks*log(expBX/(a+expBX));
model NumBooks ~ general(ll);
run;
Results:

Managerial Takeaways:

We can infer the following from the above parameter estimates.


● Education is not included as it has too many missing values
● The characteristic variables region and race are the only variables which are significant at
5% significance level as they have a p value less than 0.05 which that they are negatively
proportional to number of books purchased
● The other variables that are not significant in our analysis and which cannot explain the
heterogeneity in lambda (purchasing behavior on barnesandnoble.com) are education,
country, age, household size, child,income as they have a p-value much greater than 0.05.

7. Are there any significant differences between the results from the Poisson and NBD
regressions? If so, what exactly is the difference? Discuss what you believe about the
cause(s) of the difference.

Solution:
Difference between NBD regression and Poisson regression models:

By comparing the fit statistics of both models, we can find the differences
Poisson Regression model Fit Statistics

NBD Regression model Fit Statistics

To compare models, all we need are Log-likelihood and Bayesian values of both models.We can
infer the below by looking at the BIC and log-likelihood values of the above models.

● Log-likelihood value for poisson regression model is -18834 and BIC value is 37751
where as the log-likelihood value for NBD regression model is -8364.5 and BIC value is
16820. Generally, higher the log-likelihood value, the better is the model and lower the
BI value, the better is the model but in this case specifically, the results are contrasting
but NBD model overall is a better fit.
● Other difference is that region and race are the only significant variables in NBD
regression in determining the number of visits to the website whereas in poisson
regression model, the purchasing behavior directly depends on age, region, income, child,
racee. The variables that are not significant in NBD model are education, country, age,
household size, child,income whereas the variables insignificant in poisson regression
model is household size.
● The causes of the difference in these two models could be due to the fact that the
assumptions of each model are different. Poisson distribution assumes that mean and
variance are same. Sometimes, the data shows variation greater than the mean which
leads to overdispersion and negative binomial regression is more flexible than Poisson
regression .The NBD model has shape and rate parameters that adjusts for the variance
independently from the mean. Hence, the NBD model is appropriate to model count data
in the case of overdispersion. If the variance is equal to the mean then the Poisson
regression would be more appropriate for this scenario.
● Also, another inference is that poisson model was better at predicting larger number of
books and the NBD model was better at predicting smaller number of books. However,
neither of the model was a perfect fit in observing the purchasing behavior.

8. Briefly summarize what you learned from this project. This is an open-ended question,
so please include anything you found worthwhile | relating to the modeling tool (SAS), the
modeling process, insights from the modeling, any managerial takeaways that were
insightful to you, and so on.

Solution:

We have learnt the following from the project

● Building advanced predictive models by using SAS, especially count models.


● Identifying a business problem/business case and building several models to come up
with solutions to the business problems and understand what the companies could do
differently.
● Testing rigorously as selecting and rejecting the variables on a trial basis could give
different results. Also, testing could reveal which variables are actually significant in the
analysis and which are not.
● Using Proc nlmixed to find the optimized shape and rate parameters which maximizes the
log-likelihood expression.
● Learning that different models produce different results based on the data. No model is
perfect so trying with different models to find the perfect fit for your business case is the
key.
● The importance of the concept of significance, p-values and other statistical concepts and
how at any point p-values are useful in determining the significance of any model.
● Understanding how by simply using customer demographics, we can build models and
take key business decisions.
● Understanding how to use dummy variables.
● Gained hands-on SAS modelling experience which is crucial for the industry.
● Finally, working with a group of diverse individuals with different approaches to the
problems can be helpful in strengthening understanding about different statistical
procedures.

Вам также может понравиться