Вы находитесь на странице: 1из 21

Maximum Likelihood

Estimation
Maximum Likelihood Estimation
(MLE)
 Most statistical methods are designed to
minimize error.
 Choose the parameter values that minimizes
predictive error: |y - y’| or (y - y’)2
 Maximum likelihood estimation seeks the
parameter values that are most likely to have
produced the observed distribution.
Likelihood and PDFs
 For a continuous variable, the likelihood of
a particular value is obtained from the PDF
(probability density function).

Normal Gamma
Likelihood ≠ Probability
(for continuous distributions)

P(x=2)≈0 P(x≤2)=red AUC


Maximum likelihood
estimates of parameters
 For MLE, the goal is to determine the most
likely values of the population parameter value
(e.g, µ, , , , … ) given an observed sample
value (e.g., x-bar, s, b, r, ….)
• Any model’s parameters (e.g.,  in linear
regression, a, b, c, etc. in nonlinear models,
weights in backprop) can be estimated using MLE.
Likelihood is based on shape of
the d.v.’s distribution!!!
 ANOVA, Pearson’s r, t-test, regression…
all assume that d.v. is normally distributed.
 Under those conditions, the LSE (least squares
estimate) is the MLE.
 If the d.v. is not normally distributed, the
LSE is not the MLE.
 So, first step is to determine the shape of the
distribution of your d.v.
Step 1: Identify the distribution
 Normal, lognormal, beta, gamma, binomial,
multinomial, Weibull, Poisson, exponential….
 AAAHHH!
 Precision isn’t critical unless the sample size is
huge.
 Most stats package can fit a dv distribution using
various distribution classes.
 In R, do a “distribution” analysis and then try
various distribution fits
 Note, 0 and negative values are illegal for some fits.
Step 2: Choose analysis
 If only looking at linear models and fixed
effects, use GLM.
 GLM allows you to specify the dv distribution
type.
 (You’ll know you have an MLE method if the
output includes likelihoods).
 Random effects will be considered later.
 Otherwise, you need to modify your fitting
method to use a different loss function.
GLM: Distributions and
Link Functions
 The distribution specifies the nature of the error
distribution.
 The link function provides the relationship between
the linear predictor and the mean of the distribution
function: E(Y) = g(Y)
 Most often, the distribution determines the best link
function (aka “canonical link functions”).
 For example, a Y distribution that ranges between 0 and 1
(binomial) would necessitate a link function that is likewise
constrained (logit, probit, comp loglog).
Distributions and Link Functions
 Distributions
 In JMP: Normal, binomial (0/1), Poisson (count
data), exponential (positive continuous)
 In R: Gaussian, binomial, Gamma,
inverse.gaussian, poisson, quasi, quasibinomial,
quasipoisson
 Link functions
 In JMP: Identity, logit, probit, log, reciprocal,
power, comp loglog
 In R: lots... Depends on distribution.
 See help(family) to read more.
http://www.mathworks.com/products/demos/statistics/glmdemo.html
Canonical Link Functions
Step 3: Loss functions
 LSE uses (y - y’)2 as the loss function and tries
to minimize the sum of this quantity (across
rows)  SSE.
 MLE loss functions depend on the assumed
distribution of the d.v.
MLE Loss functions
 Likelihood function is for the joint probability
of all of the data.
 For example, P(µ=2) for row 1 and P(µ=2) for row
2 and P(µ=2) for row 3…
N
Which equals:
 P(  2)

i
i1

 It’s mathematically easier to deal with sums, so


we’ll take the
 log of that quantity:
N N
log(  P(  2) i )   log( P(  2) i )
i1 i1
MLE Loss functions, cont.
 Now, we have something that can be computed
for each row and summed…
 …but, we want the maximum of that last equation
whereas loss functions should be minimized.
 Easy! We’ll just take negate it.
 Negative log likelihood becomes our loss function.
So, once you know the PDF…
 …take the log of the function and negate it.
 This doesn’t change the point of the
maximum/minimum of the PDF…
PDF

Log(PDF)

-Log(PDF)
Step 4: Find Parameter Values
that Maximize Likelihood
(i.e., that minimize LL)
 No general solutions, so iterative methods
again are used.
Step 5: Model comparison in MLE
 Model choice using AIC/BIC
 AIC: -2LL + 2k; BIC: -2LL + k*ln(n)
 Computation of likelihood ratio (LR)
 L(model 1) / L(model 2)
 USE LR ONLY FOR NESTED MODELS!

Model 1 LR=.2
LR=2
LR=1

Model 2
MLE = The Future

Вам также может понравиться