Академический Документы
Профессиональный Документы
Культура Документы
Poisonous
Fish
Brian
Bahmanyar
Linear
Regression
Project,
Dr.Chance
October
December,
2014
This
project
was
completed
in
3
parts:
Part
1:
Weighted
Single
Regression
~
pages
2-10
Part
2:
Weighted
Multiple
Regression
~
pages
11-27
Part
3:
Weighted
Logistic
Regression
~
pages
28-47
PART
1:
Introduction
Mercury
poisoning
is
a
medical
condition
caused
by
exposure
to
mercury
or
its
compounds.
Mercury
is
a
heavy
metal
occurring
in
several
forms,
all
of
which
can
produce
toxic
effects
in
high
enough
doses.
Toxic
effects
include
damage
to
the
brain,
kidneys
and
lungs.
The
type
and
degree
of
symptoms
exhibited
depend
upon
the
individual
toxin,
the
dose,
and
the
method
and
duration
of
exposure.
The
consumption
of
fish
is
by
far
the
most
significant
source
of
ingestion-related
mercury
exposure
in
humans.
(http://en.wikipedia.org/wiki/Mercury_poisoning)
In
my
investigation
I
look
at
factors
contributing
to
the
mercury
concentration
of
Largemouth
Bass
in
Florida
lakes.
I
was
interested
to
see
if
there
are
statistically
significant
predictors
for
the
mercury
contents
of
these
fish.
This
would
allow
fisherman
to
have
a
better
sense
of
whether
a
Largemouth
Bass
is
safe
to
consume
based
on
various
factors
measured
from
the
water.
I
found
data
for
this
investigation
at
The
Data
and
Story
Library
(DASL).
Those
who
collected
the
data
studied
53
different
Florida
lakes
to
examine
the
factors
that
influence
the
level
of
mercury
contamination
in
Largemouth
Bass.
Unfortunately,
there
is
no
indication
as
to
whether
this
was
a
random
sample
of
53
lakes;
we
may
not
be
able
to
generalize
our
data
to
some
larger
population.
They
collected
water
samples
from
the
middle
of
each
lake
in
August
1990
and
then
again
in
March
1991.
The
pH
level,
the
amount
of
chlorophyll,
calcium,
and
alkalinity
were
measured
in
each
sample.
The
average
of
the
August
and
March
values
were
recorded.
Next,
a
sample
of
fish
was
taken
from
each
lake
with
sample
sizes
ranging
from
4
to
44
fish
(also
unaware
if
these
samples
are
random).
The
age
of
each
fish
and
mercury
concentration
in
the
muscle
tissue
was
measured.
Since
fish
absorb
mercury
over
time,
older
fish
will
tend
to
have
higher
concentrations.
Thus,
to
make
a
fair
comparison
of
the
fish
in
different
lakes,
the
investigators
used
a
regression
estimate
of
the
expected
mercury
concentration
in
a
three-year-old
fish
as
the
standardized
value
for
each
lake.
Finally,
in
10
of
the
53
lakes,
the
age
of
the
individual
fish
could
not
be
determined
and
the
average
mercury
concentration
of
the
sampled
fish
was
used
instead
of
the
standardized
value.
(http://lib.stat.cmu.edu/DASL/Datafiles/MercuryinBass.html)
The
observational
units
in
this
study
are
the
samples
of
Largemouth
Bass
collected
form
the
various
Florida
lakes.
In
Part
I
of
this
project
I
will
focus
on
the
relationship
between
the
alkalinity
and
the
average
mercury
concentration
in
the
sample
of
Largemouth
Bass.
Alkalinity
is
the
name
given
to
the
quantitative
capacity
of
an
aqueous
solution
to
neutralize
an
acid.
Measuring
alkalinity
is
important
in
determining
a
stream's
ability
to
neutralize
acidic
pollution
from
rainfall
or
wastewater
(http://en.wikipedia.org/wiki/Alkalinity).
I
will
use
the
alkalinity
of
the
lake,
measured
in
mg/L,
as
an
explanatory
variable
in
an
effort
to
explain
some
of
the
variability
in
the
mercury
concentrations
in
the
Largemouth
Bass,
measured
in
parts
per
million
in
the
muscle
tissue.
Due
to
the
fact
that
alkalinity
helps
neutralize
acidic
pollution,
I
predict
that
higher
alkalinity
levels
in
a
lake
will
be
associated
with
lower
concentrations
of
mercury
in
the
Largemouth
Bass
that
live
there.
Table 1: Multivariate
Weight: No.samples
Correlations
Avg_Mercury
Log10-alkalinity
Avg_Mercury
1.0000
-0.6729
Log10-alkalinity
-0.6729
1.0000
According
to
Table
1
the
correlation
coefficient,
r,
is
about
0.673.
A
correlation
coefficient
of
0.673
tells
us
that
there
is
a
reasonably
strong,
negative,
linear
relationship
between
the
average,
average
mercury
concentration
in
the
bass
and
the
base
10
log
of
alkalinity.
Now
we
will
look
at
the
individual
distributions
of
the
important
variables
in
our
model.
0.5271698
0.3410356
0.0468448
0.6211709
0.4331688
From
the
histogram
of
average
mercury
concentrations
we
can
see
that
the
data
are
skewed
to
the
right.
This
means
that
most
of
the
lakes
we
sampled
contained
Largemouth
Bass
that
had
low
average
mercury
concentrations,
and
few
lakes
had
Bass
with
high
average
mercury
concentrations.
The
53
sampled
lakes
had
contained
Largemouth
Bass
with
an
estimated
average,
average
mercury
concentration
of
0.527
parts/million.
(estimated
not
in
the
sense
that
the
mean
is
a
prediction,
estimated
in
the
sense
that
researchers
used
a
sample
of
bass
to
estimate
mercury
concentrations
of
all
bass
in
the
lake)
Histogram Alkalinity
37.530189
38.203527
5.247658
48.060385
26.999993
From
the
histogram
of
alkalinity
levels
we
can
see
that
the
data
are
skewed
to
the
right.
This
means
that
most
of
the
lakes
we
sampled
contained
low
alkalinity
levels,
and
few
lakes
had
high
levels
of
alkalinity.
The
53
sampled
lakes
had
an
average
alkalinity
of
37.53
parts/million.
Summary Statistics
Sample Size/Observation
Mean
Std Dev
Std Err Mean
Upper 95% Mean
Lower 95% Mean
13.056604
8.5606773
1.1758995
15.416219
10.696989
Due
to
the
fact
that
observations
vary
by
sample
size,
it
is
important
to
look
at
the
distribution
of
these
weights.
The
majority
of
observations
of
mercury
concentrations
come
from
samples
of
between
5
and
15
Largemouth
Bass.
This
may
or
may
not
be
representative
of
all
the
Largemouth
Bass
in
the
lake
depending
on
how
they
were
selected,
and
how
many
of
these
fish
there
are
in
the
lake
total.
On
average,
the
researchers
who
collected
the
data,
measured
mercury
concentrations
from
samples
of
about
13
Largemouth
Bass.
-1
-2
0.015 0.05
0.16
0.3
0.5
0.7
0.84
0.95
Referring
to
the
histogram
in
Figure
4
we
can
see
that
there
is
indeed
right
skew
in
the
studentized
residuals.
This
suggests
that
I
should
try
another
transformation
to
correct
for
this.
I
will
now
look
at
the
base
10
log
of
average
mercury
by
the
base
10
log
of
alkalinity.
Log10-avg. mercury
Figure
5
shows
us
a
scatterplot
of
the
data
0
after
the
second
transformation.
The
data
still
seems
linear.
Table
2
shows
us
the
output
for
a
lack
of
fit
test.
We
observed
an
-0.5
F-Ratio
of
0.8962,
which
led
to
a
large
p-value
of
0.664.
At
the
alpha
equals
0.05
-1
level,
we
dont
have
significant
evidence
that
the
linear
model
is
not
appropriate.
However,
there
are
a
few
observations,
-1.5
circled
in
blue,
that
seem
to
have
much
0
0.5
1
1.5
2
Log10-alkalinity
larger
residuals
than
the
rest
of
the
data.
The
residuals
for
these
data
will
be
looked
at
in
more
detail
when
I
evaluate
unusual
observations.
-1.64-1.28 -0.67
0.0
0.67
1.281.64
-1
-2
0.015 0.05
0.16
0.3
0.5
0.7
0.84
0.95
We
can
assume
independence
because
the
average
mercury
concentration
in
the
sample
of
bass
in
one
Florida
Lake
should
not
have
an
effect
on
the
mercury
concentrations
of
the
bass
in
other
lakes
because
they
are
isolated
bodies
of
water.
To
summarize,
this
model
is
behaving
fairly
well.
Using
a
weighted
regression
will
give
more
influence
to
the
observations
that
came
from
larger
samples
of
Largemouth
Bass.
The
monotonically
decreasing
trend
in
the
data
was
taken
care
of
by
taking
the
base
10
log
of
alkalinity.
I
then
ran
into
some
non-normality
in
the
residuals
that
I
dealt
with
by
taking
the
base
10
log
of
average
mercury.
Unfortunately,
there
still
appears
to
be
some
unequal
variance
in
the
residuals
shows
in
Figure
7,
I
would
say
is
the
weakest
point
of
my
model.
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.50667
0.496997
0.847766
-0.36503
692
Linear Fit
log(10)[avg. mercury] - hat = 0.2472067 - 0.4630143*log(10)[alkalinity]
A
10-fold
increase
in
the
alkalinity
of
a
Florida
lake
is
associated
with
a
100.463
=
2.904
multiplicative
decrease
in
the
predicted
median,
average
mercury
concentration.
In
other
words
a
10-fold
increase
in
the
alkalinity
of
a
Florida
lake
is
associated
with
an
estimated
190.4%
decrease
in
the
mercury
concentrations
found
in
Largemouth
Bass
living
in
the
lake.
Due
to
the
fact
that
I
am
using
a
log-log
model
the
intercept
is
more
difficult
to
interpret.
We
can
interpret
the
intercept
as
Florida
lakes
with
0
log(10)alkalinity
have
Largemouth
Fish
with
a
predicted
average
log(10)average
mercury
concentration
of
0.247.
However,
In
order
to
get
this
interpretation
in
terms
of
alkalinity
and
average
mercury
we
must
manipulate
the
regression
equation.
Our
regression
equation,
which
can
be
seen
in
the
Linear
Fit
output
above,
is
log(10)[avg.
mercury]
-
hat
=
0.247
-
0.463*log(10)[alkalinity].
I
must
first
remove
the
log(10)[alkalinity]
term
from
the
equation
and
we
can
do
this
by
letting
alkalinity
equal
1
so
that
log(10)[alkalinity]
=
log(10)[1]
=
0.
Now
we
are
left
with
log(10)[avg.
mercury]
-
hat
=
0.247.
After
applying
base
10
to
both
sides
we
get
[avg.
mercury]
=
100.247
=
1.766.
This
means
that
Florida
lakes
with
1
mg/L
of
alkalinity
contain
Largemouth
Bass
with
a
predicted
average
mercury
concentration
of
1.766
parts/million.
We
have
data
for
Lake
Trafford,
which
has
1.2
mg/L
of
alkalinity,
so
it
is
not
much
of
an
extrapolation
making
a
prediction
for
the
average
mercury
concentrations
of
Largemouth
Fish
in
Florida
lakes
with
1
mg/L
of
alkalinity.
I
would
say
the
intercept
is
meaningful
in
this
context.
Referring
to
the
Summary
of
Fit
output
we
can
see
that
R-Square
0.507.
This
means
that
50.7%
of
the
variability
in
the
base
10
log
of
mercury
concentrations
in
Largemouth
Fish
in
Florida
lakes
is
explained
by
this
regression
model
on
the
base
10
log
of
alkalinity
level
(the
other
49.3%
is
unexplained
variability).
Table
3:
Top
5
Leverages
Lake
Griffin
East
Tohopekaliga
Trout
Brick
Tohopekaliga
hi
0.1736525592
0.1164689823
0.102453403
0.0757299358
0.0757299358
Table
3
displays
the
lakes
with
the
top
5
leverages.
Lakes
Griffin,
East
Tohopekaliga,
and
Trout
all
have
leverages
greater
than
5 53 =
0.094.
However
lakes
East
Tohopekaliga
and
Trout
have
similar
hat
values,
0.116
and
0.102
respectively,
which
are
not
much
greater
than
that
of
the
other
observations.
Lake
Griffin,
on
the
other
hand,
has
a
leverage
of
0.174
which
separates
it
from
the
leverages
of
the
other
observations.
I
would
consider
lake
Griffin
an
observation
with
high
leverage,
depending
on
whether
it
has
a
large
residual
it
may
greatly
influence
the
model.
Table
4:
Top
5
Studentized
Residuals
Lake
Puzzle
Farm-13
Apopka
Parker
Deer
Point
Stud.
Residual
2.6355479835
2.3036408905
2.0195020498
2.0071976662
1.9428364132
Cook's
Distance
0.1994561949
0.1529166686
0.1310342084
0.1087293064
0.070541742
Table
4
displays
the
lakes
with
the
5
largest
studentized
residuals.
Lake
Puzzle
has
the
largest
studentized
residual
of
2.635,
but
it
did
not
have
a
high
leverage
so
it
likely
wont
be
the
most
influential
observation.
The
top
5
studentized
residuals
are
all
around
2,
which
is
fairly
high.
This
means
that
there
are
some
lakes
in
our
data
that
the
model
does
not
predict
Largemouth
Bass
mercury
concentrations
for
very
well.
From
Table
3
and
Table
4,
we
can
see
that
most
observations
with
high
leverages
did
not
tend
to
have
very
large
residuals,
and
vice
versa,
so
there
were
not
any
extremely
influential
observations.
Table
5
displays
the
lakes
with
the
5
largest
Cooks
Distances.
Although
East
Tohopekaliga
has
the
largest
Cooks
Distance
of
0.199,
it
is
not
too
far
from
the
Cooks
Distances
of
the
other
influential
observations
so
I
dont
think
it
was
too
much
of
a
problem
in
the
regression
analysis.
Where
10^ 1
represents
the
true
multiplicative
chance
in
median,
average
mercury
associated
with
a
10-fold
change
in
alkalinity.
I
ran
a
two
sided
test
because
I
did
not
have
any
initial
conjectures
of
the
direction
of
the
relationship
between
log(10)[average
mercury]
and
log(10)[alkalinity].
Parameter Estimates
Term
Estimate Std Error t Ratio Prob>|t| Lower 95% Upper 95%
Intercept
0.2472067 0.090525 2.73 0.0087*
0.06547 0.4289433
Log10-alkalinity -0.463014 0.063976 -7.24 <.0001* -0.591451 -0.334578
For
the
regression
of
log(10)[average
mercury]
vs.
log(10)[alkalinity]
the
observed
slope
of
-0.463
led
to
a
test
statistic
of
-7.24
(which
follows
a
t-distribution
with
53-2=51
degrees
of
freedom)
and
yielded
a
p-value
of
<0.0001
(form
Parameter
Estimates
output).
If
there
was
truly
no
association
between
log(10)[average
mercury]
and
log(10)[alkalinity],
we
would
only
expect
to
see
a
sample
slope
as
extreme
as
our
less
than
0.01%
of
the
time
due
to
random
chance.
At
the
99%
confidence
level,
we
have
extremely
strong
evidence
that
there
exists
a
genuine
relationship
between
the
mercury
concentration
in
Largemouth
Bass,
from
Florida
lakes,
and
the
alkalinity
level
of
the
lake.
We
are
95%
confident
each
10-fold
increase
in
alkalinity
(mg/L)
is
associated
with
a
predicted
multiplicative
decrease
in
expected
mercury
concentrations
in
Largemouth
Fish
of
between
10-0.591
=
0.256
parts/million
and
10-0.225
=
0.596
parts/million.
Alkalinity log(10)[Alk]
53 25
1.398
Sample
Lower
Upper
Lower
Size
95%
Mean
95%
Mean
95%
Indv.
13.05
-0.466
-0.335
-0.876
Upper
95%
Indv.
0.076
I
wanted
to
estimate
the
mean
and
individual
response
prediction
for
a
fictional
lake
in
Florida
with
an
alkalinity
of
25
mg/L.
Due
to
the
fact
that
I
used
weighted
regression
I
had
to
assign
a
weight
to
this
observation
and
I
used
the
average
of
the
weights
from
the
other
observation.
So
I
am
making
predictions
for
a
lake
with
25
mg/L
where
an
imaginary
sample
of
13
Largemouth
Bass
was
taken.
We
are
95%
confident
that
all
Florida
lakes
with
alkalinity
of
25
mg/L
will
contain
Largemouth
Bass
with
an
expected
average
mercury
concentration
of
between
10-0.466
=
0.342
parts/million
and
10-0.335
=
0.462
parts/million.
We
are
95%
confident
that
a
Florida
lake
with
alkalinity
of
25
mg/L
will
contain
Largemouth
Bass
with
an
average
mercury
concentration
of
between
10-0.876
=
0.133
parts/million
and
100.076
=
1.191
parts/million.
PART
1:
Conclusion
There
is
statistically
significant
evidence
that
there
exists
a
genuine
relationship
between
alkalinity
and
mercury
concentrations
in
Largemouth
Bass,
for
Florida
Lakes.
As
mentioned
earlier
we
may
not
be
able
to
generalize
our
findings
to
all
Florida
lakes
depending
on
how
the
lakes
were
selected.
Regardless
of
whether
we
are
able
to
generalize,
we
cannot
draw
a
case-and-effect
conclusion
because
a
randomized
experiment
was
not
conducted.
There
may
have
been
other
confounding
variables
influencing
this
relationship.
There
is
unequal
variance
in
the
residuals
that
could
be
effecting
our
regression
equation,
parameter
estimates,
and
confidence
intervals.
None
of
the
lakes
observed
were
extremely
unusual,
relative
to
the
other
lakes
in
the
sample.
The
3
observations
circled
in
Figure
5
(Lake
Parker,
Lake
Apopka,
and
Lake
Farm-13
from
left
to
right)
all
have
about
the
same
alkalinity
level
and
similarly
large
negative
residuals.
This
may
warrant
some
further
investigation;
perhaps
they
have
some
similarities
that
I
could
include
in
the
analysis
to
improve
the
model.
The
only
other
question
I
would
ask
about
this
data
is
whether
this
trend
can
truly
be
generalized
to
all
Florida
Lakes.
If
I
had
more
data
from
lakes
of
different
states
I
may
even
be
able
to
create
a
model
that
could
be
generalize
to
lakes
across
the
nation.
10
PART
2:
Introduction
In
Part
I
of
this
project
I
focused
on
the
relationship
between
the
alkalinity
and
the
average
mercury
concentration
in
the
sample
of
Largemouth
Bass.
Alkalinity
is
the
name
given
to
the
quantitative
capacity
of
an
aqueous
solution
to
neutralize
an
acid.
Measuring
alkalinity
is
important
in
determining
a
stream's
ability
to
neutralize
acidic
pollution
from
rainfall
or
wastewater
(http://en.wikipedia.org/wiki/Alkalinity).
I
was
able
to
conclude
that
there
is
statistically
significant
evidence
that
there
exists
a
genuine
relationship
between
alkalinity
and
mercury
concentrations
in
Largemouth
Bass,
for
Florida
Lakes.
Relevant
computer
output
supporting
my
conclusion
is
below.
Figure
1:
Regression
Plot
(log(10)
Average
Mercury
by
log(10)
Alkalinity)
Weight:
No.samples
Summary of Model
S = 0.847766
PRESS = 39.8274
R-Sq = 50.67%
R-Sq(adj) = 49.70%
R-Sq(pred) = 46.4%
Regression Equation: log(merc)
Coefficients:
Term
Coef
Constant
0.247207
loc(alk) -0.463014
T
2.73081
-7.23734
SE Coef
0.0905251
0.0639758
Analysis of Variance:
Source
DF
Seq SS
Regression
1 37.6452
loc(alk)
1 37.6452
Error
51 36.6540
Lack-of-Fit 49 35.0573
Pure Error
2
1.5967
Total
52 74.2992
Adj SS
37.6452
37.6452
36.6540
35.0573
1.5967
Adj MS
37.6452
37.6452
0.7187
0.7155
0.7984
F
52.3791
52.3791
P
0.000000
0.000000
0.8962
0.664189
In
this
part
of
my
project
I
hope
to
build
a
better
model
to
predict
more
of
the
variability
in
mercury
concentrations
of
Largemouth
Bass,
in
Florida
Lakes,
by
using
more
explanatory
variables.
11
Correlations: Avg_Mercury, Alkalinity, Calcium, Chlorophyll
Avg_Mercury
-0.594
0.000
Alkalinity
Calcium
-0.401
0.003
0.833
0.000
Chlorophyll
-0.491
0.000
0.478
0.000
Alkalinity
Calcium
0.410
0.002
Alkalinity,
Calcium,
and
Chlorophyll
(the
quantitative
explanatory
variables)
all
seem
to
have
a
strong
negative
concave
up
association
with
the
average
mercury
contents
of
the
samples
of
Largemouth
Bass
(the
response
variable).
Unfortunately
the
Alkalinity,
Calcium,
and
Chlorophyll
levels
are
also
associated
with
each
other
so
they
might
be
explaining
a
lot
of
the
same
variability
in
average
mercury
content.
Aside
from
the
issue
of
multicollinearity,
the
relationships
between
alkalinity,
calcium,
and
chlorophyll
with
average
mercury
are
not
linear.
This
suggests
that
transformations
will
need
to
be
made
on
the
variables
in
our
model.
I
had
no
prior
conjecture
about
how
the
associations
between
these
variables
were
going
to
12
behave.
From
the
matrix
scatter
plot
we
can
identify
some
seemingly
unusual
observations
(annotated
with
arrows).
In
the
plot
of
average
mercury
by
alkalinity
there
appears
to
be
two
unusual
observations;
from
left
to
right
they
are
Lake
Harney
and
Lake
Puzzle
respectively.
Both
these
Florida
lakes
appear
to
have
higher
mercury
concentrations
than
other
lakes
with
similar
levels
of
alkalinity.
In
the
plot
of
average
mercury
by
calcium
there
appears
to
be
three
unusual
observations;
from
left
to
right
they
are
Lake
Talquin,
Lake
Harney
and
Lake
Puzzle
respectively.
These
three
Florida
lakes
appear
to
have
higher
mercury
concentrations
than
other
lakes
with
similar
levels
of
calcium.
There
does
not
appear
to
be
any
unusual
observations
in
plot
of
average
mercury
by
chlorophyll.
My
original
data
set
did
not
have
a
categorical
variable,
but
I
did
have
a
quantitative
pH
variable.
The
pH
measures
how
acidic
or
basic
a
substance
is,
where
a
pH
between
1
and
7
is
acidic,
7
is
neutral,
and
between
7
and
14
is
basic.
I
created
a
categorical
variable
from
this
variable
by
using
1,0
indicator
parameterization
for
whether
a
lake
was
acidic
(1
for
acidic,
0
otherwise).
From
these
scatterplots,
because
the
data
is
not
linearly
associated,
it
is
hard
to
see
if
whether
a
lake
is
acidic
has
an
effect
on
the
average
mercury
concentration;
or
how
acidity
affects
the
magnitude
of
the
change
in
average
mercury
resulting
from
constant
increases
in
the
alkalinity,
calcium,
and
chlorophyll
levels
of
a
lake.
13
These
are
3
scatterplots
produced
after
by
graphing
the
square
root
of
average
mercury
with
log
base
10
alkalinity,
log
base
10
calcium,
and
log
base
10
chlorophyll.
This
corrects
the
nonlinearity
present
in
the
3
previous
scatterplots
so
we
can
better
see
the
effect
of
acidic
lakes
on
average
mercury
concentration.
If
an
interaction
was
significant
it
would
mean
that
the
effect
of
the
explanatory
variable
in
question
(alkalinity,
calcium,
or
chlorophyll)
on
average
mercury
concentration
is
different
for
acidic
lakes,
opposed
to
basic
or
neutral
lakes.
In
the
scatterplots
of
sqrt(mercury)
by
log(alkalinity),
and
sqrt(mercury)
by
log(calcium)
it
does
not
seem
like
there
is
evidence
of
a
difference
in
the
effect
of
alkalinity
on
average
mercury
depending
on
whether
the
lake
is
acidic
or
not.
The
lines
fitting
each
group
in
both
the
graphs
seem
about
parallel,
any
deviation
from
perfect
parallel
lines
is
likely
due
to
random
chance.
The
interaction
between
log(chlorophyll)
and
the
acidic
indicator
seems
to
be
the
most
significant,
as
the
two
regression
lines
in
the
scatterplot
of
sqrt(mercury)
by
log(chlorophyll)
are
not
that
close
to
parallel.
However
I
do
not
think
the
interaction
will
be
significant
because
the
data
in
the
scatterplot
still
seems
to
follow
the
same
overall
trend.
It
is
also
worth
noting
that
from
these
scatterplots
it
appears
that
basic
or
neutral
lakes,7-14
on
the
pH
scale,
14
seem
to
have
higher
levels
of
alkalinity,
calcium,
and
chlorophyll
than
acidic
lakes.
This
can
be
seen
by
observing
that
the
black
dots,
indicating
a
non-acidic
lake,
are
more
often
than
not
in
the
right
half
of
the
x-values
for
all
these
scatterplots.
0.606622 - 0.00501981 Alkalinity + 0.00283653 Calcium 0.00147748 Chlorophyll + 0.187879 acidic - 0.00216577
chlorophyll*acidic
Coefficients
Term
Constant
Alkalinity
Calcium
Chlorophyll
acidic
chlorophyll*acidic
Coef
0.606622
-0.005020
0.002837
-0.001477
0.187879
-0.002166
SE Coef
0.111369
0.001947
0.002699
0.001793
0.123221
0.004895
T
5.44696
-2.57877
1.05096
-0.82403
1.52474
-0.44247
P
0.000
0.013
0.299
0.414
0.134
0.660
VIF
4.54862
3.26451
1.84176
2.93183
1.80388
Summary of Model
S = 0.913213
PRESS = 52.4426
R-Sq = 47.98%
R-Sq(pred) = 30.41%
R-Sq(adj) = 42.45%
Above
are
standardized
residual
plots
of
a
weighted
regression
of
average
mercury,
by
alkalinity,
calcium,
chlorophyll,
acidic,
and
the
interaction
between
acidic
and
chlorophyll.
Weights
are
the
number
of
fish
sampled
from
the
Florida
lakes
used
to
15
1.14335 - 0.31862 log(alkalinity) + 0.0945459 log(calcium) 0.142475 log(chlorophyll) - 0.0897938 acidic + 0.163463
loc(chlorophyll)*acidic
Coefficients
Term
Constant
log(alkalinity)
log(calcium)
log(chlorophyll)
acidic
loc(chlorophyll)*acidic
Coef
1.14335
-0.31862
0.09455
-0.14248
-0.08979
0.16346
SE Coef
0.146689
0.095922
0.085773
0.082786
0.139334
0.101493
T
7.79437
-3.32166
1.10228
-1.72101
-0.64445
1.61059
P
0.000
0.002
0.276
0.092
0.522
0.114
VIF
4.88083
3.97675
3.97249
9.44431
6.70924
Summary of Model
S = 0.575348
PRESS = 20.6954
R-Sq = 58.32%
R-Sq(pred) = 44.55%
R-Sq(adj) = 53.88%
16
After
performing
the
transformations
the
graphs
of
the
standardized
residuals
look
much
better.
The
linearity
assumption
is
now
met
and
can
be
verified
from
the
scatterplots
of
the
square
root
of
mercury
vs
the
log
base
10
alkalinity,
log
base
10
calcium,
and
log
base
10
chlorophyll
in
the
previous
section
Descriptive
Statistics.
The
normal
probability
plot
shows
that
the
standardized
residuals
are
more
normal
as
there
is
no
pattern
of
curvature
about
the
normal
line.
There
is
no
longer
fanning
in
the
standardized
residual
versus
fits
plot.
In
fact
the
standardized
residuals
are
randomly
scattered
above
and
below
the
horizontal
zero
line
and
the
majority
of
the
standardized
residuals
are
between
-2
and
2,,
so
we
can
assume
equal
variance.
There
is
no
need
to
refer
to
the
versus
order
plot
as
the
observations
were
not
sampled
in
any
order
that
could
violate
the
independence
condition.
This
is
a
suitable
transformation
as
it
fixes
the
assumptions
that
were
violated
in
the
previous,
untransformed,
model.
All
the
explanatory
variable
in
my
model
had
VIFs
around
or
above
4
so
there
is
evidence
of
some
multicollinearity.
To
correct
for
this
I
could
try
centering
the
variables
about
their
means.
If
that
does
not
fix
the
issue
I
could
run
a
ridge
resgression.
After
consulting
with
Dr.
Chance,
I
now
understand
that
the
multicollinearity
is
a
sign
of
the
high
linear
correlation
between
alkalinity
and
calcium,
and
that
centering
would
not
help
much.
However,
I
had
already
done
a
significant
part
of
the
project
with
the
centered
variables
and
did
not
have
adequate
time
to
redo
the
analysis.
To
see
if
I
can
drop
any
variables
from
my
model
to
simplify
it,
without
losing
accuracy,
I
ran
a
best
subsets
in
Minitab.
I
forced
the
model
to
include
log
base
10
of
alkalinity
and
the
acidic
indicator.
17
Response is sqrt(mercury)
The following variables are included in all models: log(alkalinity) acidic
l
o
c
(
c
h
l
l o
o r
g o
( p
l c h
o h y
g l l
( o l
c r )
a o *
l p a
c h c
i y i
u l d
Mallows
m l i
Vars R-Sq R-Sq(adj)
Cp
S ) ) c
1 51.0
48.0
8.7 0.17659
X
1 50.7
47.7
9.0 0.17709 X
2 55.6
51.9
5.6 0.16973
X X
2 52.7
48.7
8.8 0.17533 X X
3 57.1
52.5
6.0 0.16873 X X X
From
the
output
above
we
can
see
that
the
smallest
Mallows
Cp
is
5.6
which
is
from
the
model
that
includes
log
base
10
chlorophyll,
and
the
interaction
between
log
base
10
chlorophyll
and
the
acidic
indicator
in
addition
to
log
base
10
alkalinity
and
acidic.
Although
the
model
with
3
additional
variables
does
have
the
largest
R-
square,
and
R-square
adjusted,
the
values
are
not
much
larger
so
the
model
is
therefore
not
worth
the
extra
complexity,
and
loss
in
degrees
of
freedom.
In
light
of
this,
I
will
remove
log
base
10
calcium
from
my
model.
18
Coefficients
Term
Constant
centered log(alk)
centered log(chlor)
acidic
cent log(chlor)*acidic
Coef
0.675107
-0.235820
-0.144974
0.084728
0.174205
SE Coef
0.051917
0.059786
0.082940
0.064288
0.101250
T
13.0035
-3.9444
-1.7479
1.3179
1.7205
P
0.000
0.000
0.087
0.194
0.092
VIF
1.88764
3.96951
2.00161
2.97181
Summary of Model
S = 0.576635
PRESS = 19.7925
R-Sq = 57.24%
R-Sq(pred) = 46.97%
R-Sq(adj) = 53.67%
When
log(alkalinity)
and
log(chlorophyll)
are
at
their
means
we
predict
that
the
average,
average
mercury
concentration
in
the
muscle
tissues
of
fish
will
be
0.0675^2
=
0.005
parts
per
million
for
non
acidic
lakes.
I
would
predict
a
-0.236
change
in
sqrt(mercury)
when
we
multiply
the
centered
log
base
10
of
alkalinity
by
10.
I
would
predict
a
-0.145
change
in
sqrt(mercury)
when
we
multiply
the
centered
log
base
10
of
chlorophyll
by
10.
For
all
levels
of
alkalinity
and
chlorophyll
acidic
lakes
are
associated
with
0.085
parts
per
million
more
average
mercury
in
the
muscle
tissues
of
Largemouth
Fish.
Each
unit
increase
in
log(chlorophyll)
for
acidic
lakes
is
associated
with
a
0.174
larger
increase
in
average
mercury
concentration
in
parts
per
million
than
for
non-acidic
lakes.
57.24
%
of
the
variation
in
the
average
mercury
concentrations
of
largemouth
bass
in
Florida
lakes
can
be
explained
by
using
the
model
with
centered
log
base
10
alkalinity,
centered
log
base
10
chlorophyll,
the
acidic
indicator,
and
the
interaction
with
this
indicator
and
the
centered
log
base
10
chlorophyll.
I
observed
s
=
0.577
which
means
that
we
can
expect
an
average
predicted
square
root
mercury
concentration
to
vary
by
0.576
parts
per
million.
Descriptive Statistics: sqrt(mercury)
Variable
sqrt(mercury)
Mean
0.6844
Minimum
0.2000
Median
0.6928
Maximum
1.1533
(output
in
the
beginning
of
this
report).
The
R-Square
adjusted
for
the
multiple
regression
model
is
53.67
%.
There
is
an
increase
of
about
4
%.
After
adjusting
for
the
penalties
associated
with
adding
more
predictors
to
the
model
there
isnt
much
of
an
improvement.
DF
4
1
1
1
1
48
52
Seq SS
21.3636
18.7219
0.4548
1.2026
0.9843
15.9604
37.3240
Adj SS
21.3636
5.1732
1.0159
0.5776
0.9843
15.9604
Adj MS
5.34089
5.17324
1.01591
0.57755
0.98432
0.33251
F
16.0624
15.5582
3.0553
1.7370
2.9603
P
0.000000
0.000260
0.086869
0.193780
0.091775
20
To
test
if
the
additional
variables
significantly
improve
upon
the
simple
linear
regression
model
we
must
conduct
a
partial
F-Test.
Weighted analysis using weights in No.samples
Analysis of Variance
Source
Regression
centered log(alk)
centered log(chlor)
acidic
cent log(chlor)*acidic
Error
Total
DF
4
1
1
1
1
48
52
Seq SS
21.3636
18.7219
0.4548
1.2026
0.9843
15.9604
37.3240
Adj SS
21.3636
5.1732
1.0159
0.5776
0.9843
15.9604
Adj MS
5.34089
5.17324
1.01591
0.57755
0.98432
0.33251
F
16.0624
15.5582
3.0553
1.7370
2.9603
P
0.000000
0.000260
0.086869
0.193780
0.091775
Ho:
B(centered
log(chlor))
=
B(acidic)
=
B(cent
log(chlor)*acidic)
=
0
(the
additional
variables
do
not
significantly
improve
the
model
from
Part
1)
Ha:
at
least
one
of
the
population
slope
coefficients
in
the
null
hypothesis
have
a
non-zero
slope
(the
additional
variables
significantly
improve
the
model
from
Part
1)
-Where
B(centered
log(chlor))
represents
the
true
change
in
the
average
mercury
content
with
every
one
unit
increase
in
(centered
log(chlor)).
-Where
B(acidic)
represents
the
true
difference
in
average
mercury
content
between
acidic
and
non-acidic
lakes
with
the
same
values
of
the
other
explanatory
variables
-Where
B(cent
log(chlor)*acidic)
represents
the
true
difference
of
the
effect
of
chlorophyll
level
on
the
average
mercury
content
for
acidic
versus
non-acid
lakes
F
=
!.!"!#!!.!"!#!!.!"#$ / !
!.!!"#$
= 2.648, df = (3,48)
Cumulative Distribution Function
P( X <= x )
0.940513
The
output
above
gives
us
the
probability
less
than
F
=
2.648.
Therefore
the
p-value,
the
probability
greater
than
or
equal
to
F
=
2.648,
equals
1
-
0.941
=
0.059.
This
is
slightly
above
the
0.05
alpha
level
so
we
do
not
have
statistically
significant
evidence
at
the
95%
confidence
level
that
the
additional
variables
significantly
improve
upon
the
model
with
just
log(alkalinity).
In
other
words
we
do
not
have
evidence
at
the
5%
significance
level
that
the
full
model
is
significantly
better
than
the
reduced
model.
However,
in
order
to
draw
a
statistically
significant
conclusion
I
will
use
the
0.1
alpha
level.
At
the
90%
confidence
level
there
is
significant
evidence
that
the
additional
variables
improve
upon
the
model
with
just
log(alkalinity).
21
Coef
0.675107
-0.235820
-0.144974
0.084728
0.174205
SE Coef
0.051917
0.059786
0.082940
0.064288
0.101250
T
13.0035
-3.9444
-1.7479
1.3179
1.7205
P
0.000
0.000
0.087
0.194
0.092
VIF
1.88764
3.96951
2.00161
2.97181
Ho:
B(cent
log(chlor)*acidic)
=
0
Ha:
B(cent
log(chlor)*acidic)
0
Where
B(cent
log(chlor)*acidic)
represents
the
true
difference
of
the
effect
of
chlorophyll
level
on
the
average
mercury
content
for
acidic
versus
non-acid
lakes.
Our
observed
slope
of
0.174
led
to
a
t-statistic
of
1.721
which
follows
a
t-
distribution
with
n
p
1
=
48
df.
This
yielded
a
p-value
of
0.092
which
is
not
significant
at
the
alpha
equals
0.05
level.
From
the
graphs
in
the
Descriptive
Statistics
section
I
noticed
that
the
interaction
between
log(chlorophyll)
and
the
acidic
indicator
seemed
to
be
the
strongest
but
I
did
not
think
it
would
be
significant,
and
it
is
not
at
the
alpha
equals
0.05
level.
To
test
if
the
categorical
variable
significantly
improved
we
must
conduct
a
partial
F-Test.
(It
wouldnt
make
sense
to
just
test
if
B(acidic)
=
0
because
it
would
still
be
in
the
interaction
term)
Weighted analysis using weights in No.samples
Analysis of Variance
Source
Regression
centered log(alk)
centered log(chlor)
acidic
cent log(chlor)*acidic
Error
Total
DF
4
1
1
1
1
48
52
Seq SS
21.3636
18.7219
0.4548
1.2026
0.9843
15.9604
37.3240
Adj SS
21.3636
5.1732
1.0159
0.5776
0.9843
15.9604
Adj MS
5.34089
5.17324
1.01591
0.57755
0.98432
0.33251
F
16.0624
15.5582
3.0553
1.7370
2.9603
P
0.000000
0.000260
0.086869
0.193780
0.091775
Ho:
B(acidic)
=
B(cent
log(chlor)*acidic)
=
0
(whether
or
not
a
lake
is
acidic
did
not
significantly
improve
the
model)
Ha:
at
least
one
of
the
population
slope
coefficients
in
the
null
hypothesis
have
a
non-zero
slope
(whether
or
not
a
lake
is
acidic
significantly
improve
the
model)
-Where
B(centered
log(chlor))
represents
the
true
change
in
the
average
mercury
content
with
every
one
unit
increase
in
(centered
log(chlor)).
22
!.!"!#!!.!"#$ / !
!.!!"#$
= 3.288, df=(2,48)
Cumulative Distribution Function
F distribution with 2 DF in numerator and 48 DF in denominator
x
3.288
P( X <= x )
0.954107
The
output
above
gives
us
the
probability
less
than
F
=
3.288.
Therefore
the
p-value,
the
probability
greater
than
or
equal
to
F
=
3.288,
equals
1
-
0.954
=
0.046.
With
a
p-value
less
than
0.05
we
can
reject
the
null
hypothesis.
Whether
or
not
a
lake
is
acidic
is
information
that
is
significantly
improving
the
model.
In
other
words
the
relationship
between
alkalinity
and
chlorophyll
levels
on
the
response
are
effected
by
whether
or
not
the
lake
is
acidic.
The
true
square
root
average
mercury
concentration
for
all
acidic
lakes
with
centered
log(alkalinity)
=
0.5,
centered
log(chlorophyll)
=
0.5,
where
6
fish
were
sampled,
is
between
0.538
parts
per
million
and
0.776
parts
per
million.
We
predict
that
the
square
root
average
mercury
concentration
for
an
acidic
lake
with
centered
log(alkalinity)
=
0.5,
centered
log(chlorophyll)
=
0.5,
where
6
fish
were
sampled,
is
between
0.168
parts
per
million
and
1.145
parts
per
million.
There
was
no
reason
in
particular
why
I
chose
to
create
intervals
for
an
acidic
lake
where
6
fish
were
sampled
and
centered
log(alkalinity)
=
0.5
and
centered
log(chlorophyll)
=
0.5.
I
decided
by
looking
at
the
distribution
of
these
variables
and
picked
arbitrary
values
within
the
range
of
the
values
that
I
observed.
23
sqrt(mercury)
1.07703
0.43589
1.04881
Fit
0.858042
0.378840
0.487801
SE Fit
0.0306985
0.0533835
0.0401344
Residual
0.218991
0.057050
0.561008
St Resid
2.65754
0.77187
3.15392
R
X
R
Observations
14
and
40,
Lake
East
Tohopekaliga
and
Lake
Puzzle
respectively,
have
large
standardized
residuals.
I
stored
the
deleted
t
residuals
into
my
worksheet
so
I
could
run
tests
of
significance
on
the
outliers.
Lake
East
Tohopekaliga
has
a
deleted
t
residual
of
2.848
which
follows
a
t-distribution
with
n-p-2
=
53
4
2
=
47
degrees
of
freedom.
We
can
test
the
null
hypothesis
that
Lake
East
Tohopekaliga
is
not
an
extreme
outlier
against
the
alternative
hypothesis
that
it
is.
Cumulative Distribution Function
Student's t distribution with 47 DF
x
2.848
P( X <= x )
0.996747
Our
deleted
t-residual
of
2.848
leads
to
a
p-value
of
(1-0.997)*2
=
0.0033*2
=
0.0066
but
we
should
adjust
this
p-value
because
technically
we
should
not
just
test
the
most
extreme
residual.
Multiplying
our
p-value
by
our
sample
size
will
correct
for
this.
Therefore
our
p-value
=
0.0066*53
=
0.35.
We
do
not
have
statistically
significant
evidence
that
Lake
Tohopekaiga
is
an
extreme
outlier.
Lake
Puzzle
has
a
deleted
t
residual
of
3.505
which
follows
a
t-distribution
with
n-p-
2
=
53
4
2
=
47
degrees
of
freedom.
We
can
test
the
null
hypothesis
that
Lake
Puzzle
is
not
an
extreme
outlier
against
the
alternative
hypothesis
that
it
is.
Cumulative Distribution Function
Student's t distribution with 47 DF
x
3.505
P( X <= x )
0.999493
Our
deleted
t-residual
of
3.505
leads
to
a
p-value
of
(1-0.9995)*2
=
0.0005*2
=
0.001
but
we
should
adjust
this
p-value
because
technically
we
should
not
just
test
the
most
extreme
residual.
Multiplying
our
p-value
by
our
sample
size
will
correct
for
this.
Therefore
our
p-value
=
0.001*53
=
0.053.
We
do
not
have
statistically
significant
evidence
at
the
95%
confidence
level.
However
we
are
90%
confident
that
Lake
Puzzle
is
a
statistically
significantly
extreme
outlier.
24
Because
the
conclusions
were
not
significant
at
the
alpha
equals
0.05
level
I
would
not
remove
either
of
these
lakes
from
the
data
set.
Referring
to
the
dotplot
of
the
cooks
distances
above
there
is
no
evidence
of
highly
influential
observations.
No
cook
distances
are
even
close
to
0.5.
Based
on
what
I
have
learned
in
this
report
I
would
reduce
my
model
to
the
simple
linear
regression
of
the
square
root
of
average
mercury
on
the
log
base
10
alkalinity.
There
wasnt
much
of
a
difference
in
the
R-Square
adjusted
values
between
the
simple
linear
regression
and
the
multiple
regression.
This
indicates
that
the
extra
variables
are
not
explaining
enough
unexplained
variability
in
average
mercury
concentration
to
be
worth
the
extra
complexity,
and
loss
in
degrees
of
freedom.
To
further
support
this
decision
the
partial
F-Test
I
ran
in
the
previous
section
to
test
if
the
variables
I
added
to
the
model
were
significant,
yielded
a
p-value
of
0.059
which
would
not
allow
us
to
reject
the
null
hypothesis
with
95%
confidence.
In
other
words
we
do
not
have
evidence
that
the
full
model
is
significantly
better
than
the
reduced
model
at
the
5%
significance
level.
I
believe
the
simple
linear
regression
model
is
okay,
it
has
an
R-Square
of
about
50%
which
is
decent.
25
PART
2:
Conclusion
General Regression Analysis: sqrt(mercury) versus log(alkalinity)
Weighted analysis using weights in No.samples
Regression Equation
sqrt(mercury)
Coefficients
Term
Constant
log(alkalinity)
Coef
1.13217
-0.32652
SE Coef
0.0644895
0.0455759
T
17.5558
-7.1644
P
0.000
0.000
VIF
1
Summary of Model
S = 0.603943
PRESS = 20.3895
R-Sq = 50.16%
R-Sq(pred) = 45.37%
R-Sq(adj) = 49.18%
After
adding
the
additional
variables
to
our
model
the
accuracy
did
not
change
much,
which
is
why
I
chose
to
use
the
simple
linear
regression
model
of
the
square
root
of
average
mercury
by
log
base
10
alkalinity.
The
reduced
model
is
valid
and
certainly
significant
at
the
5%
significance
level,
with
a
p-value
of
about
0.
Using
the
model
is
significantly
better
than
using
the
mean
average
mercury
concentration
to
predict
the
mercury
concentrations
of
Largmouth
Bass
in
other
lakes.
Our
model
tells
us
that
Florida
lakes
with
higher
alkalinity
levels
tend
to
have
Largemouth
Bass
with
lower
concentrations
of
mercury
in
their
muscle
tissues.
Overall
I
think
the
final
model
is
doing
a
good
job,
but
it
can
be
much
improved.
There
is
still
about
50%
of
the
variability
in
the
average
mercury
concentration
that
is
left
unexplained.
26
I
tried
to
bring
in
variables
in
an
attempt
to
explain
this
variability
but
there
were
problems
such
as
multicollinaerity.
If
I
were
to
analyze
this
data
again
in
the
future
I
would
try
to
bring
in
additional
explanatory
variables
that
were
orthogonal
to
alkalinity,
so
they
would
explain
a
completely
different
part
of
the
variability
in
the
average
mercury
concentration.
27
PART
3:
Introduction
In
Part
I
of
this
project
I
focused
on
the
relationship
between
the
alkalinity
and
the
average
mercury
concentration
in
the
sample
of
Largemouth
Bass.
Alkalinity
is
the
name
given
to
the
quantitative
capacity
of
an
aqueous
solution
to
neutralize
an
acid.
Measuring
alkalinity
is
important
in
determining
a
stream's
ability
to
neutralize
acidic
pollution
from
rainfall
or
wastewater
(http://en.wikipedia.org/wiki/Alkalinity).
I
was
able
to
conclude
that
there
is
statistically
significant
evidence
that
there
exists
a
genuine
relationship
between
alkalinity
and
mercury
concentrations
in
Largemouth
Bass,
for
Florida
Lakes.
In
Part
II
of
this
project
I
added
additional
predictor
variables
to
the
model
such
as
calcium,
chlorophyll,
and
whether
the
lake
was
acidic.
Unfortunately
there
was
much
correlation
between
alkalinity
and
both
calcium
and
chlorophyll.
After
adjusting
for
alkalinity,
calcium
and
chlorophyll
were
not
explaining
much
more
variability
in
the
average
mercury
concentration.
My
binary
predictor
variable,
whether
or
not
a
lake
was
acidic,
was
not
significant
nor
was
its
interaction
with
chlorophyll.
The
more
complex
model
I
built
in
Part
II
was
not
significantly
better
than
the
simple
linear
regression
model
with
alkalinity.
Thus
I
reduced
my
model
to
the
square
root
of
average
mercury
regressed
on
the
log
base
10
of
alkalinity.
Weighted analysis using weights in No.samples
Regression Equation
sqrt(mercury) = 1.13217 - 0.326523 log(alkalinity)
Summary of
S
R-Sq
R-Sq(adj)
PRESS
R-Sq(pred)
Model
= 0.603943
= 50.16%
= 49.18%
= 20.3895
= 45.37%
In
this
part
of
the
project
I
will
use
predictor
variables
to
predict
whether
the
average
mercury
concentration
of
a
lake
will
be
at
a
poisonous
level.
After
much
research,
I
could
not
find
a
straightforward
answer
to
how
much
mercury
in
parts
per
million
in
the
muscle
tissue
of
Largemouth
Bass
is
poisonous
for
humans.
In
fact
the
level
of
mercury
that
would
poison
someone
depends
on
other
factors
such
as
body
weight
of
that
individual.
In
order
to
create
a
binary
response
variable
from
28
the
average
mercury
concentration
variable
I
would
have
to
pick
a
cutoff
value
for
an
acceptable
level
of
mercury.
I
will
use
the
3rd
quartile,
or
75th
percentile,
of
my
average
mercury
concentrations
as
this
cutoff
value
for
an
acceptable
mercury
level.
Histogram
of
Average
Mercury
The
distribution
of
average
mercury
is
skewed
to
the
right,
and
the
median
average
mercury
value
is
0.48.
For
the
purposes
of
this
project
I
will
consider
an
average
mercury
concentration
greater
than
0.48
parts
per
million
dangerous.
This
is
not
a
value
that
truly
separates
poisonous
from
non-poisonous
Largemouth
Bass,
I
repeat,
I
am
only
using
this
cutoff
for
the
purposes
of
this
project
so
I
can
demonstrate
logistic
regression.
I
believe
that
as
the
alkalinity
level
of
a
lake
increases
the
probability
that
the
fish
from
that
lake
will
have
a
poisonous
level
of
mercury
decreases.
29
The
graph
above
illustrates
the
distribution
of
chlorophyll
between
the
two
levels
of
the
binary
response
variable,
poison.
Harmful
indicates
an
average
mercury
concentration
above
the
median
of
0.48
parts
per
million,
and
harmless
indicates
an
average
mercury
equal
or
less
than
0.48
parts
per
million.
The
observations
have
dot
sizes
proportional
to
the
number
of
fish
that
were
sampled
to
calculate
the
given
average
mercury
concentration.
We
can
see
some
discrimination
between
successes
(harmless)
and
failures
(harmful),
however
there
doesnt
seem
to
be
much.
All
the
observations
with
chlorophyll
levels
greater
than
the
blue
line
(about
40)
have
harmless
mercury
levels,
but
below
this
line
there
are
many
harmless
and
harmful
observations.
Although
the
mean
chlorophyll
for
the
harmless
group
is
likely
larger
than
for
the
harmful
group,
I
would
like
to
see
more
discrimination
in
the
data;
perhaps
chlorophyll
is
not
a
good
predictor
of
whether
or
not
the
Largemouth
Bass
from
a
Florida
lake
with
have
a
harmless
level
of
mercury.
I
will
now
investigate
the
distribution
of
calcium
between
the
harmless
and
harmful
levels
of
average
mercury.
30
The
graph
above
illustrates
the
distribution
of
calcium
between
the
two
levels
of
the
binary
response
variable,
poison.
We
can
see
very
little
discrimination
between
successes
(harmless)
and
failures
(harmful).
Granted
that
there
are
more
harmless
observations
at
higher
calcium
levels,
there
are
still
a
few
harmful
lakes
with
high
calcium
levels.
To
the
left
of
the
blue
line
(calcium
about
30)
there
are
many
harmful
and
harmless
observations.
Calcium
levels
are
not
doing
a
good
job
of
separating
harmful
from
harmless
lakes.
Chlorophyll,
from
the
previous
graph,
seems
like
a
better
predictor
whether
a
lake
is
harmless
than
calcium.
I
will
now
investigate
the
distribution
of
alkalinity
between
the
harmless
and
harmful
levels
of
average
mercury.
31
The
graph
above
illustrates
the
distribution
of
alkalinity
between
the
two
levels
of
the
binary
response
variable,
poison.
We
can
see
decent
discrimination
between
successes
(harmless)
and
failures
(harmful).
To
the
right
of
the
blue
line
(alkalinity
of
about
35)
there
are
many
harmless,
and
only
two
harmful,
lakes.
To
the
left
of
this
line
there
are
still
many
both
harmless
and
harmful
lakes,
but
lakes
where
the
average
mercury
was
calculated
based
on
larger
sample
sizes,
and
are
likely
more
accurate,
tended
to
be
harmful.
Alkalinity
seems
to
be
the
best
quantitative
variable
I
have
to
discriminate
between
harmful
and
harmless
levels
of
mercury
concentrations
from
the
Largemouth
Fish
in
Florida
Lakes.
I
will
further
investigate
this
discrimination
with
some
numerical
summaries.
32
The
Florida
lakes
with
Largemouth
Bass
that
contained
a
harmless
average
level
of
mercury
had
an
average
alkalinity
of
58.01
mg/L
with
a
standard
deviation
of
40.69.
The
Florida
lakes
with
Largemouth
Bass
that
contained
a
harmful
average
level
of
mercury
had
an
average
alkalinity
of
16.26
mg/L
with
a
standard
deviation
of
19.76.
The
alkalinity
levels
in
lakes
with
harmless
fish
varied
almost
twice
as
much
on
average
as
the
alkalinity
levels
in
lakes
with
harmful
fish.
This
difference
in
deviation
can
be
seen
in
the
previous
plot
of
Poison
vs.
Alkalinity;
the
spread
of
the
alkalinity
levels
for
harmless
lakes
is
much
larger
than
for
harmful
lakes.
However,
despite
the
differences
in
deviation,
there
seems
to
be
a
significant
difference
in
the
alkalinity
levels
between
the
lakes
with
harmless
and
harmful
levels
of
average
mercury.
We
can
test
this
observed
difference
in
alkalinity
by
running
a
two
sample
t-test
with
the
following
hypothesis.
H0:
(harmful)
(harmless)
=
0,
Ha:
(harmful)
(harmless)
0
Our
observed
difference
of
-41.75
led
to
a
t-statistic
of
-4.78,
which
follows
a
t-
distribution
with
37.92
degrees
of
freedom
and
led
to
a
p-value
of
less
than
0.0001.
There
is
overwhelming
evidence
that
there
is
truly
a
difference
in
the
alkalinity
levels
between
Florida
lakes
containing
Largemouth
Bass
with
harmless
vs
harmful
levels
of
average
mercury.
33
The
graph
above
is
a
mosaic
plot
and
contingency
table
showing
the
differences
in
the
proportion
of
harmless
lakes
depending
on
whether
the
lake
was
acidic.
As
a
reminder
my
original
data
contained
a
pH
variable
that
I
recoded
as
acidic/non-
acidic.
There
does
seem
to
be
a
difference
in
the
proportion
of
harmless
lakes
between
these
two
groups;
non-acidic
lakes
observed
%77.27
harmless
lakes
while
acidic
lakes
observed
%32.26
harmless
lakes.
To
test
if
this
difference
is
statistically
significant
I
will
conduct
a
two
sample
z-test
with
the
following
hypothesis.
H0:
(harmless)
(harmful)
=
0,
Ha:
(harmless)
(harmful)
0
Our
observed
difference
led
to
a
z-statistic
that
yielded
a
p-value
of
0.002.
There
is
overwhelming
evidence
that
there
is
truly
a
difference
in
the
proportion
of
lakes
containing
Largemouth
Bass
with
harmless
average
mercury
levels
between
acidic
and
non-acidic
Florida
lakes.
We
observe
an
odds
ratio
of
!"/!
!"/!"
This
tells
us
that
non-acidic
lakes
odds
of
containing
largemouth
bass
with
a
harmless
average
mercury
level
were
7.14
times
larger
than
for
acidic
lakes.
Although
there
were
significant
differences
between
harmful
and
harmless
lakes
for
both
alkalinity
and
the
acidic
binary
indicator,
there
is
likely
going
to
be
much
correlation
between
these
variables
because
alkalinity
is
a
measure
of
the
capacity
of
an
aqueous
solution
to
neutralize
an
acid.
34
Poison
is
defined
as
the
level
of
average
mercury
in
the
muscle
tissue
of
Largemouth
Bass
in
a
Florida
lake,
either
a
harmless
or
harmful
level.
Above
is
the
weighted
logistic
regression
of
this
poison
variable
on
alkalinity,
where
a
harmless
lake
is
treated
as
a
success
and
observations
are
weighted
proportional
to
the
size
of
the
sample
of
bass
that
was
used
to
calculate
the
average
mercury
level.
Fitted
model
equation:
ln odds ~hat = -1.772 + 0.050(alkalinity)
This
equation
can
be
used
to
find
the
predicted
odds,
and
probability,
of
a
particular
lake
being
harmless,
as
far
as
mercury
concentration
in
Largemouth
Bass,
based
on
the
alkalinity
of
the
lake.
For
example,
we
could
predict
the
probability
that
a
lake
with
an
alkalinity
of
50
mg/L
will
contain
Largemouth
Bass
with
harmless
mercury
levels,
on
average
(there
was
no
reason
in
particular
why
I
chose
50
mg/L,
I
simple
picked
an
arbitrary
value
within
the
rage
of
my
alkalinity
values).
ln(odds)~hat= -1.772 + 0.050(50) = 0.728
(odds)~hat = !.!"# = 2.071
probability~hat =
!.!"#
!!!.!"#
= 0.674
We
predict
about
67.4%
of
all
Florida
lake
with
an
alkalinity
of
50
mg/L
will
be
harmless.
So
if
we
select
one
lake
with
an
alkalinity
of
50
mg/L,
we
would
predict
that
there
is
a
67.4%
chance
that
the
average
mercury
concentration
of
the
Largemouth
Bass
in
the
lake
is
at
a
harmless
level.
35
Above
is
a
graph
of
the
estimated
probabilities
of
harmless
vs.
alkalinity.
It
is
apparent
that
as
alkalinity
increases
the
probability
that
the
lake
contains
Largemouth
Bass
with
a
harmless
level
of
mercury
increases.
There
is
a
sharp
increase
from
about
alkalinity
0
to
80
mg/L,
and
then
we
see
the
rate
of
change
in
the
probability
with
respect
to
alkalinity
level
off.
As
a
sanity
check
all
probabilities
are
between
0
and
1,
which
is
a
good
sign.
It
may
be
worth
noting
that
it
appears
that
observations
which
come
from
larger
sample
sizes,
marked
by
their
proportional
dot
sizes,
tend
to
be
on
the
lower
end
of
the
range
of
alkalinity
levels.
JMP
reports
that
the
odds
of
a
lake
being
harmless
increases
by
a
multiplicative
factor
of
1.051
for
every
1
mg/L
increase
in
alkalinity.
I
can
verify
this
output
by
calculating
!.!"
=
1.051,
which
matches
the
odds
ratio
in
the
table
above.
This
tells
us
that
every
one
mg/L
increase
in
alkalinity
is
associated
with
a
1.051
multiplicative
increase
in
the
odds
that
the
lakes
contain
Largemouth
Bass
with
a
harmless
level
of
mercury.
I
am
95%
confident
that
the
true
change
in
the
odds
that
a
lake
contains
harmless
Largemouth
Bass
associated
with
a
1
mg/L
increase
in
36
alkalinity
is
between
1.043
and
1.06.
This
is
of
course
a
small
change
in
the
odds
ratio
because
a
1-unit
change
in
alkalinity
isnt
going
to
have
much
of
an
effect
on
the
odds
than
a
lake
contains
harmless
Largemouth
Bass.
However
we
can
look
at
the
effect
that
a
20
mg/L
increase
in
alkalinity
would
have
on
the
odds
that
the
lakes
contain
harmless
bass
by
calculating
!"!.!"
=
!
=
2.718.
This
tells
us
that
every
20
mg/L
increase
in
alkalinity
is
associated
with
a
2.718
multiplicative
increase
in
the
odds
that
the
lakes
contain
Largemouth
Bass
with
a
harmless
level
of
mercury.
Fitted
model
equation:
ln odds ~hat = -1.772 + 0.050(alkalinity)
The
intercept
of
this
model
tell
us
that
lakes
with
0
mg/L
alkalinity
have
a
!!.!!"
=
!.!"
0.170
odds,
or
a
!.!"
=
0.145
probability,
of
containing
Largemouth
Bass
with
harmless
average
levels
of
alkalinity.
To
test
the
significance
of
alkalinity
as
a
predictor
variable
I
will
run
a
test
with
the
following
hypothesis:
H0
:
(alkalinity)
=
0
In
other
words
the
true
change
in
log
odds
of
a
harmless
lake
with
respect
to
alkalinity
is
0.
This
means
each
increase
in
alkalinity
does
not
affect
the
odds
of
lakes
containing
harmless
Largemouth
Bass
(odds
=
!
=
a
multiplicative
chance
of
1).
Ha
:
(alkalinity)
0
In
other
words
the
true
change
in
log
odds
of
a
harmless
lake
with
respect
to
alkalinity
is
not
0.
This
means
each
increase
in
alkalinity
has
some
affect
on
the
odds
of
lakes
containing
harmless
Largemouth
Bass
(odds
=
!
0,
where
k
is
a
non-zero
constant).
From
the
parameter
estimates
output
above,
alkalinitys
observed
slope
of
0.050
had
a
standard
error
of
0.004;
this
lead
to
a
chi-square
statistic
of
138.93,
which
follows
a
chi-square
distribution
with
1
degree
of
freedom.
The
test
statistic
yields
a
p-value
<
0.0001,
so
we
can
reject
the
null
hypothesis
at
the
5%
significance
level.
At
the
5%
significance
level
there
is
very
strong
evidence
that
alkalinity
has
an
effect
on
the
true
odds
ratio,
and
probability,
that
the
lakes
will
contain
bass
with
a
harmless
level
of
average
mercury.
This
test
conclusion
is
consistent
with
the
confidence
interval
for
the
true
odds
ratio
I
found
above
because
1
is
not
inside
the
interval,
thus
we
are
95%
confident
that
alkalinity
has
an
effect
on
the
odds
ratio.
37
From
output
above
we
can
see
that
my
models
misclassification
rate
is
about
20%.
This
can
be
verified
from
the
confusion
matrix.
Initially
I
was
shocked
to
see
such
large
numbers
in
the
confusion
matrix
because
my
sample
size
was
only
53.
However
I
believe
it
has
to
do
with
running
the
weighted
logistic
regression.
I
think
that
a
correct
prediction
for
a
lake
where
the
average
mercury
was
calculated
based
on
a
sample
of
20
bass
for
example,
counts
as
20
correct
prediction
in
the
confusion
matrix.
The
misclassification
rate
is
the
ratio
of
incorrect
predictions
to
total
!"!!"#
predictions,
which
can
be
calculated
by
=
0.2023,
or
20.23%.
!"#!!"!!"#!!"#
Obviously
in
a
perfect
world
this
rate
will
be
close
to
zero,
which
0.2023
is
not,
but
we
should
expect
some
prediction
errors.
Overall
I
think
0.2023
is
a
sub
par
misclassification
rate,
this
means
I
am
making
about
1
wrong
prediction
for
every
5
predictions
I
make.
The
whole
model
test
in
this
case
should
be
the
same
as
testing
the
slope
of
alkalinity
because
it
was
the
only
predictor
in
the
model,
but
I
will
run
it
anyway.
H0
:
The
model
is
not
useful
in
predicting
the
odds
of
lakes
containing
harmless
bass.
Ha
:
The
model
is
useful
in
predicting
the
odds
of
lakes
containing
harmless
bass.
Our
observed
chi-square
statistic
is
297.192,
which
follows
a
chi-square
distribution
with
1
degree
of
freedom,
and
lead
to
a
p-value
of
<
0.0001
(which
matches
the
p-
value
from
the
significance
test
of
the
slope
of
alkalinity).
We
can
conclude
with
near
certainty
that
the
model
is
useful.
Next
we
will
look
at
the
Deviance
and
Pearson
residuals
to
investigate
unusual
observations.
38
Above
are
graphs
of
the
standardized
Deviance,
and
standardized
Pearson
residuals
by
the
predicted
probabilities
that
lakes
will
be
harmless.
I
have
annotated
blue
horizontal
lines
at
residual
values
-2
and
2.
Both
these
graphs
look
awful;
the
majority
of
the
residuals
in
both
graphs
are
outside
of
the
blue
lines
meaning
they
have
a
residual
with
absolute
value
greater
than
2.
It
may
be
worth
noting
that
it
seems
that
there
are
more
positive
than
negative
residuals,
so
the
logistic
model
is
underestimating
the
probability
of
success
more
often
than
it
overestimates.
I
was
not
expecting
the
residuals
to
behave
in
this
manner,
and
after
much
thought
I
could
not
find
a
reason
to
explain
why
nearly
all
the
residuals
are
greater
than
2
in
absolute
value.
Lake
Puzzle,
the
most
unusual
observation
is
circled
and
contains
Largemouth
Bass
with
a
harmful
level
of
average
mercury
despite
the
lakes
high
alkalinity
level
of
87.6
mg/L.
After
doing
some
research
I
found
the
lake
is
called
Lake
Puzzle,
because
the
navigable
portions
of
the
lake
change
seasonally
depending
on
the
amount
of
rainfall.
When
the
waters
recede,
previously
known
boat
routes
can
be
hindered
by
new,
submersed,
sandbars
and
deep-water
channels
that
are
completely
different
from
the
year
before.
This
could
explain
its
large
residual;
perhaps
fish
were
collected
from
a
new
portion
of
the
lake
where
they
were
recently
exposed
to
some
high
levels
of
mercury
despite
the
lakes
overall
high
level
of
alkalinity.
39
Below
are
coded
plots
comparing
Alkalinity
and
Calcium
(the
best
quantitative
predictors
determined
from
the
descriptive
statistics
section)
by
the
level
of
the
binary
response
variable
(harmless/harmful).
From
these
graphs
we
can
tell
that
alkalinity
is
a
better
predictor
than
calcium
(which
was
established
in
the
descriptive
statistics
section)
because
it
discriminates
more
between
harmless
and
harmless
lakes.
From
the
multiple
boxplot
graph
of
Poison
vs.
Alkalinity
&
Calcium
there
is
a
larger
difference
between
median
alkalinity
than
median
calcium
between
harmless
and
harmful
levels
of
average
mercury.
It
is
safe
to
assume
that
this
relationship
is
consistent
for
the
respective
difference
in
means
as
well.
In
an
effort
to
find
the
best
balance
between
fit
and
model
simplicity
I
will
us
an
informal
selective
backward
elimination.
Above
is
relevant
output
for
a
weighted
logistic
regression
with
all
the
explanatory
variables
I
can
make
use
of.
Alkalinity,
calcium,
and
chlorophyll
are
all
measured
from
the
water
of
a
lake.
Acidic(char)
is
a
categorical
variable,
make
from
an
originally
quantitative
pH
variable,
which
has
value
yes
if
the
lake
is
acidic
and
no
otherwise.
From
the
Parameter
Estimates
output
chlorophyll
has
the
largest
p-value
of
0.218.
This
tells
us
that
after
adjusting
for
the
other
explanatory
variables
40
After
removing
chlorophyll
from
the
model
all
other
explanatory
variables
stayed
significant.
In
fact
the
misclassification
rate
of
0.1315
did
not
change
at
all,
this
tells
us
that
we
had
the
same
prediction
accuracy
without
chlorophyll
in
the
model.
I
will
select
this
model
as
the
one
that
balances
fit
and
simplicity
the
best,
as
at
the
5%
significance
level
all
the
explanatory
variables
are
significant
predictors
of
the
odds,
or
the
probability,
of
lakes
having
bass
with
a
harmless
level
of
average
mercury.
I
will
use
this
model
to
predict
the
odds,
and
probability,
that
non-acidic
lakes
with
50
mg/L
alkalinity,
and
20
mg/L
calcium
will
contain
Largemouth
Bass
with
a
harmless
level
of
average
mercury.
Fitted
model
equation:
ln(odds)~hat= -1.536+0.104(alk)-0.09(calc)+0.634(acidic[n])
ln(odds)~hat= (-1.536+0.634)+0.104(50)-0.09(20)
ln(odds)~hat= 2.498
odds~hat = e^2.498 = 12.158
probability~hat = 12.158/13.158 = 0.924
I
predict
that
non-acidic
lakes
with
50
mg/L
alkalinity
and
20
mg/L
calcium
have
a
12.158
odds,
or
a
0.924
probability,
of
containing
Largemouth
Bass
with
a
harmless
average
level
of
mercury.
To
test
if
the
overall
model
is
useful
I
will
run
a
whole
model
test
with
the
following
hypothesis:
H0
:
(alkalinity)
=
0
and
(calcium)
=
0
and
(acidic[no])
=
0
Ha
:
at
least
one
of
the
population
slope
is
the
null
hypothesis
does
not
equal
0
Our
observed
chi-square
statistic
is
382.048,
which
follows
a
chi-square
distribution
with
3
degrees
of
freedom,
and
yielded
a
p-value
of
<
0.0001.
At
the
5%
significance
level,
we
have
overwhelming
evidence
that
at
least
one
of
the
population
slopes
is
not
equal
to
zero.
In
other
words
the
overall
model
is
useful.
41
To
measure
the
effectiveness
of
each
predictor
in
the
model,
I
will
run
3
separate
significance
tests.
Before
I
run
these
tests
I
should
adjust
my
significance
level
so
that
I
can
be
at
an
overall
5%
significance
level.
Using
the
bonferroni
adjustment
I
will
test
each
individual
parameter
at
the
5/3
=
1.667%
significance
level.
First
I
will
test
Alkalinity
with
the
following
hypothesis:
H0
:
(alkalinity)
=
0,
after
adjusting
for
calcium
and
the
acidic
indicator
Ha
:
(alkalinity)
0,
after
adjusting
for
calcium
and
the
acidic
indicator
We
observed
a
slope
of
0.104
and
a
standard
error
of
0.013.
This
led
to
a
chi-square
statistic
of
67.20,
which
follows
a
chi-square
distribution
with
1
degree
of
freedom.
This
statistic
led
to
a
very
small
p-value
of
<
0.0001.
At
the
1.667%
significance
level
we
have
overwhelming
evidence
that
the
population
slope
of
alkalinity
is
not
equal
to
zero.
Therefore
alkalinity
is
an
effective
predictor
of
the
odds,
or
probability,
that
lakes
will
contain
bass
with
harmless
levels
of
average
mercury.
Next
I
will
test
Calcium
with
the
following
hypothesis:
H0
:
(calcium)
=
0,
after
adjusting
for
alkalinity
and
the
acidic
indicator
Ha
:
(calcium)
0,
after
adjusting
for
alkalinity
and
the
acidic
indicator
We
observed
a
slope
of
-0.090
and
a
standard
error
of
0.014.
This
led
to
a
chi-square
statistic
of
41.12,
which
follows
a
chi-square
distribution
with
1
degree
of
freedom.
This
statistic
led
to
a
very
small
p-value
of
<
0.0001.
At
the
1.667%
significance
level
we
have
overwhelming
evidence
that
the
population
slope
of
calcium
is
not
equal
to
zero.
Therefore
calcium
is
an
effective
predictor
of
the
odds,
or
probability,
that
lakes
will
contain
bass
with
harmless
levels
of
average
mercury.
Finally
I
will
test
the
acidic
indicator
with
the
following
hypothesis:
H0
:
(acidic[no])
=
0,
after
adjusting
for
alkalinity
and
calcium
Ha
:
(acidic[no])
0,
after
adjusting
for
alkalinity
and
calcium
We
observed
a
slope
of
0.634
and
a
standard
error
of
0.122.
This
led
to
a
chi-square
statistic
of
27.19,
which
follows
a
chi-square
distribution
with
1
degree
of
freedom.
This
statistic
led
to
a
very
small
p-value
of
<
0.0001.
At
the
1.667%
significance
level
we
have
overwhelming
evidence
that
the
population
slope
of
acidic[no]
is
not
equal
to
zero.
Therefore
whether
or
not
lakes
are
acidic
is
an
effective
predictor
of
the
odds,
or
probability,
that
lakes
will
contain
bass
with
harmless
levels
of
average
mercury.
42
I
will
now
interpret
the
confidence
intervals
for
these
parameters
to
give
us
an
idea
on
how
much
each
of
them
affect
the
odds
that
lakes
will
be
harmless.
After
adjusting
for
calcium
and
the
acidic
indicator,
I
am
95%
confident
that
a
1
mg/L
increase
in
alkalinity
is
associated
with
between
a
!.!"!
=
1.083
and
!.!"#
=
1.139
multiplicative
increase
in
the
odds
that
the
lakes
will
contain
Largemouth
Bass
with
a
harmless
level
of
average
mercury.
After
adjusting
for
alkalinity
and
the
acidic
indicator,
I
am
95%
confident
that
a
1
mg/L
increase
in
calcium
is
associated
with
between
a
!!.!!"
=
0.889
and
!!.!"#
=
0.939
multiplicative
decrease
in
the
odds
that
the
lakes
will
contain
Largemouth
Bass
with
a
harmless
level
of
average
mercury.
After
adjusting
for
alkalinity
and
calcium,
non-acidic
lakes
are
associated
with
between
a
!!.!"#
=
2.208
and
!!.!"#
=
5.743
higher
odds
of
containing
Largemouth
Bass
with
a
harmless
level
of
average
mercury
than
acidic
lakes.
Below
is
relevant
output
for
a
drop
in
deviance
test
for
the
quadratic
alkalinity
term:
H0
:
(alk^2)
=
0,
after
adjusting
for
calcium,
alkalinity,
and
the
acidic
indicator
Ha
:
(alk^2)
0,
after
adjusting
for
calcium,
alkalinity,
and
the
acidic
indicator
We
observe
a
drop
in
deviance
of
573.356
556.812
=
16.544.
This
drop
in
deviance
is
our
chi-square
statistic,
which
follows
a
chi-square
distribution
with
1
degree
of
freedom.
I
used
an
online
chi-square
distribution
probability
calculator
at
https://www.fourmilab.ch/rpkp/experiments/analysis/chiCalc.html.
43
With
a
p-value
of
about
0
we
can
reject
the
null
hypothesis
at
the
alpha
equals
0.05
level.
In
other
words
the
quadratic
term
is
significant
and
is
helpful
in
predicting
the
odds
of
success.
The
sign
of
the
term
is
positive
so
this
tells
us
that
at
higher
alkalinity
levels,
alkalinity
has
a
greater
effect
on
the
odds,
and
probability,
that
lakes
will
contain
fish
whos
average
mercury
concentration
is
at
a
harmless
level.
Also
between
the
two
models
we
see
a
drop
in
the
AICc
of
582.189
568.088
=
14.101.
Therefore
I
will
adopt
the
quadratic
term
into
my
model.
Next
I
will
carry
out
a
drop
in
deviance
test
for
the
interaction
of
alkalinity
and
the
acidic
indicator.
H0
:
(acidic[no]*cent.
alk)
=
0,
after
adj.
for
calcium,
the
acidic
indicator,
and
alk
Ha
:
(acidic[no] cent. alk)
0,
after
adj.
for
calcium,
the
acidic
indicator,
and
alk
We
observe
a
drop
in
deviance
of
573.356
573.248
=
0.108.
This
drop
in
deviance
is
our
chi-square
statistic,
which
follows
a
chi-square
distribution
with
1
degree
of
freedom.
With
a
p-value
of
about
0.742
we
cannot
reject
the
null
hypothesis
at
any
reasonable
level
of
significance.
In
other
words
the
interaction
term
is
not
significant
predictor
of
the
odds
of
success.
The
sign
of
the
term
is
positive
so
if
it
had
been
significant
it
44
would
have
told
us
that
each
increase
in
alkalinity
would
have
a
larger
effect
on
the
odds
of
success
for
non-acidic
lakes
than
acidic
lakes.
Output
for
my
final
model
is
below:
The
misclassification
rate
of
my
final
model
is
0.146,
which
can
be
confirmed
from
the
confusion
matrix
(79+22/241+350+79+22
=
0.146).
This
an
improvement
of
about
0.06
percentage
points
from
the
misclassification
rate
from
the
model
with
only
alkalinity.
Next
we
will
look
at
plots
of
residuals
for
my
final
model.
Once
again,
like
in
the
single
predictor
model,
the
plots
of
both
the
standardized
deviance
and
the
standardized
Pearson
residuals
look
bad.
The
majority
of
the
residuals
in
both
plots
are
beyond
2
in
absolute
value
(indicated
by
the
blue
horizontal
lines).
The
most
unusual
observation,
Lake
Sampson,
is
circled
in
blue.
After
observation
I
realized
Lake
Sampson
had
an
unusually
high
calcium
level
relative
to
its
alkalinity
level,
which
could
explain
why
my
model
greatly
underestimated
the
odds
of
it
containing
harmless
Largemouth
Bass.
45
PART
3:
Conclusion
In
this
report
I
wanted
to
find
the
best
model
to
predict
the
odds,
or
probability,
of
lakes
containing
Largemouth
Bass
with
a
harmless
level
of
average
mercury
in
their
muscle
tissue.
Again
I
could
not
find
the
cutoff
value
for
a
safe
about
of
mercury
in
Largemouth
Bass
so
I
used
the
median
average
mercury
value
as
the
cutoff
between
a
harmless
and
harmful
level.
I
would
recommend
using
the
multiple
logistic
model
because
although
it
is
a
bit
more
complicated
it
reduced
the
misclassification
rate
by
about
6
percentage
points
and
dropped
the
AICc
by
about
14.
I
am
still
puzzled
why
there
were
so
many
observations
with
extremely
high
residuals
in
my
model,
when
it
seemed
to
be
fairly
accurate.
This
is
what
I
believe
is
the
weakest
point
of
my
model.
For
future
analysis
I
would
like
to
bring
in
more
variables
to
make
a
more
accurate
model
and
perhaps
reduce
the
number
of
observation
with
such
large
residuals.
A
larger
sample
size
is
always
nice,
but
if
I
could
collect
data
on
lakes
outside
of
Florida
I
may
be
able
to
create
a
more
generalizable
model.
Calcium
has
a
p-value
of
0.1198,
after
adjusting
for
the
other
predictors
so
I
will
remove
it
and
re-run
the
model.
The
misclassification
rate
is
about
0.561,
which
is
very
high.
This
means
we
are
only
correctly
predicting
whether
or
not
a
lake
contained
harmless
bass
for
about
every
other
observation.
The
second
model
I
ran
was
with
alkalinity,
the
acidic
indicator,
and
alkalinity-
squared.
The
output
is
below:
46
Now
alkalinity
become
insignificant
at
the
5%
significance
level
which
was
surprising
to
be
because
I
found
that
it
distinguished
the
best
between
harmless
and
harmful
lakes
in
the
beginning
of
my
project.
However,
because
the
p-value
of
0.081
is
not
too
far
away
from
0.05
and
the
quadric
term
(alkalinity-squared)
is
significant
I
decided
to
keep
it
in
the
model.
(Perhaps
alkalinity
was
not
significant
because
of
it
obvious
correlation
with
alkalinity-squared.)
Nevertheless
removing
calcium
did
not
improve
the
models
misclassification
rate,
which
is
still
a
large
0.561.
I
would
not
recommend
this
model
over
the
multiple
logistic
model
that
I
decided
to
use
in
my
conclusion.
The
fitted
model
equations
are
as
follows:
ln(high/med)= -0.495 0.015(alk) + 0.0002(alk^2) 0.473(acidic[no])
ln(low/med)= 0.852 0.015(alk) + 0.0002(alk^2) 0.473(acidic[no])
47