Regression Project

Identifying
Poisonous Fish

Brian Bahmanyar
Linear Regression Project, Dr.Chance
October December, 2014

This project was completed in 3 parts:
Part 1: Weighted Single Regression ~ pages 2-10
Part 2: Weighted Multiple Regression ~ pages 11-27
Part 3: Weighted Logistic Regression ~ pages 28-47

PART 1: Introduction

Mercury poisoning is a medical condition caused by exposure to mercury or its
compounds. Mercury is a heavy metal occurring in several forms, all of which can produce toxic
effects in high enough doses. Toxic effects include damage to the brain, kidneys and lungs. The
type and degree of symptoms exhibited depend upon the individual toxin, the dose, and the
method and duration of exposure. The consumption of fish is by far the most significant source
of ingestion-related mercury exposure in humans.
(http://en.wikipedia.org/wiki/Mercury_poisoning)

In my investigation I look at factors contributing to the mercury concentration of
Largemouth Bass in Florida lakes. I was interested to see if there are statistically significant
predictors for the mercury contents of these fish. This would allow fisherman to have a better
sense of whether a Largemouth Bass is safe to consume based on various factors measured
from the water.

I found data for this investigation at The Data and Story Library (DASL). Those who
collected the data studied 53 different Florida lakes to examine the factors that influence the
level of mercury contamination in Largemouth Bass. Unfortunately, there is no indication as to
whether this was a random sample of 53 lakes; we may not be able to generalize our data to
some larger population. They collected water samples from the middle of each lake in August
1990 and then again in March 1991. The pH level, the amount of chlorophyll, calcium, and
alkalinity were measured in each sample. The average of the August and March values were
recorded. Next, a sample of fish was taken from each lake with sample sizes ranging from 4 to
44 fish (also unaware if these samples are random). The age of each fish and mercury
concentration in the muscle tissue was measured. Since fish absorb mercury over time, older
fish will tend to have higher concentrations. Thus, to make a fair comparison of the fish in
different lakes, the investigators used a regression estimate of the expected mercury
concentration in a three-year-old fish as the standardized value for each lake. Finally, in 10 of
the 53 lakes, the age of the individual fish could not be determined and the average mercury
concentration of the sampled fish was used instead of the standardized value.
(http://lib.stat.cmu.edu/DASL/Datafiles/MercuryinBass.html) The observational units in this
study are the samples of Largemouth Bass collected form the various Florida lakes.

In Part I of this project I will focus on the relationship between the alkalinity and the
average mercury concentration in the sample of Largemouth Bass. Alkalinity is the name given
to the quantitative capacity of an aqueous solution to neutralize an acid. Measuring alkalinity is
important in determining a stream's ability to neutralize acidic pollution from rainfall or
wastewater (http://en.wikipedia.org/wiki/Alkalinity). I will use the alkalinity of the lake,
measured in mg/L, as an explanatory variable in an effort to explain some of the variability in
the mercury concentrations in the Largemouth Bass, measured in parts per million in the muscle
tissue. Due to the fact that alkalinity helps neutralize acidic pollution, I predict that higher
alkalinity levels in a lake will be associated with lower concentrations of mercury in the
Largemouth Bass that live there.

PART 1: Descriptive Statistics

Figure 1: Average Mercury by Alkalinity

I predicted the negative association between

average mercury and alkalinity; however, I did not
anticipate the non-linear relationship. I fit a local
smoother with smoothness (alpha) equal to 0.5 to
better visualize the curvature in the data. To take
care of this sort of monotonically decreasing
curvature in Figure 1 I could have decreased the
power of average mercury or decreased the
power of alkalinity.
Figure 2: Average Mercury by log(10) Alkalinity

Weight: No.samples
I regressed average mercury by the base 10 log of

alkalinity and it seems to be reasonably linear. I
ran a weighted my model, using the number of
Largemouth Bass accounting for the mercury
average of each observation as the weights. This
way samples with more Largemouth Bass will
have more influence on the relationship. The
smoother defiantly looks more linear however
there is still some curving around the middle of
the line in Figure 2. After look at residuals further
transformations may be needed.

Table 1: Multivariate
Weight: No.samples
Correlations

Avg_Mercury Log10-alkalinity
Avg_Mercury
1.0000
-0.6729
Log10-alkalinity
-0.6729
1.0000
According to Table 1 the correlation coefficient, r, is about 0.673. A correlation coefficient of
0.673 tells us that there is a reasonably strong, negative, linear relationship between the
average, average mercury concentration in the bass and the base 10 log of alkalinity.

Now we will look at the individual distributions of the important variables in our model.

Histogram Average Mercury
Summary Statistics Average Mercury

Mean
Std Dev
Std Err Mean
Upper 95% Mean
Lower 95% Mean
0.5271698
0.3410356
0.0468448
0.6211709
0.4331688
From the histogram of average mercury concentrations we can see that the data are skewed to
the right. This means that most of the lakes we sampled contained Largemouth Bass that had
low average mercury concentrations, and few lakes had Bass with high average mercury
concentrations. The 53 sampled lakes had contained Largemouth Bass with an estimated
average, average mercury concentration of 0.527 parts/million. (estimated not in the sense that
the mean is a prediction, estimated in the sense that researchers used a sample of bass to
estimate mercury concentrations of all bass in the lake)

Histogram Alkalinity
Summary Statistics Alkalinity

Mean
Std Dev
Std Err Mean
Upper 95% Mean
Lower 95% Mean
37.530189
38.203527
5.247658
48.060385
26.999993
From the histogram of alkalinity levels we can see that the data are skewed to the right. This
means that most of the lakes we sampled contained low alkalinity levels, and few lakes had high
levels of alkalinity. The 53 sampled lakes had an average alkalinity of 37.53 parts/million.

Histogram Sample Size

per Observation
Summary Statistics
Sample Size/Observation
Mean
Std Dev
Std Err Mean
Upper 95% Mean
Lower 95% Mean
13.056604
8.5606773
1.1758995
15.416219
10.696989
Due to the fact that observations vary by sample size, it is important to look at the distribution
of these weights. The majority of observations of mercury concentrations come from samples of
between 5 and 15 Largemouth Bass. This may or may not be representative of all the
Largemouth Bass in the lake depending on how they were selected, and how many of these fish
there are in the lake total. On average, the researchers who collected the data, measured
mercury concentrations from samples of about 13 Largemouth Bass.
PART 1: Residual Analysis

Figure 3: Normality Plot of Stu. Residuals
(for model Average Mercury by log(10) Alkalinity)

-1.64-1.28 -0.67
0.0
0.67 1.281.64
Figure 3 displays the non-normality of the studentized
3
residuals. The residuals clearly bend around the normal
line and reach outside of the 95% bands.
2
Figure 4: Stu. Residual Average Mercury
-1
-2
0.015 0.05
0.16
0.3
0.5
0.7
0.84
0.95
Normal Quantile Plot

Referring to the histogram in Figure 4 we can see that there is indeed right skew in the
studentized residuals. This suggests that I should try another transformation to correct for this. I
will now look at the base 10 log of average mercury by the base 10 log of alkalinity.

Figure 5: Log(10) Average Mercury by Log(10) Alkalinity

Weight: No.samples
Log10-avg. mercury

Figure 5 shows us a scatterplot of the data
0
after the second transformation. The data
still seems linear. Table 2 shows us the
output for a lack of fit test. We observed an
-0.5
F-Ratio of 0.8962, which led to a large
p-value of 0.664. At the alpha equals 0.05
-1
level, we dont have significant evidence
that the linear model is not appropriate.
However, there are a few observations,
-1.5
circled in blue, that seem to have much
0
0.5
1
1.5
2
Log10-alkalinity
larger residuals than the rest of the data.
The residuals for these data will be looked at in more detail when I evaluate unusual
observations.
Table 2: Lack Of Fit

Source
DF Sum of Squares Mean Square F Ratio
Lack Of Fit 49
35.057333
0.715456 0.8962
Pure Error 2
1.596713
0.798357 Prob > F
Total Error 51
36.654047
0.6642
Figure 6: Normality Plot of Stu. Residuals
(for model log(10) Average Mercury by log(10) Alkalinity)

3
-1.64-1.28 -0.67
0.0
0.67
1.281.64
Figure 6 is a normality plot of the log(10) alkalinity

studentized residuals. The residuals stay close to the
normal line and do not go outside of the 95% bands.
The studentized residuals are now approximately
normally distributed. The transformation corrected the
non-normality.

-1
-2
0.015 0.05
0.16
0.3
0.5
0.7
0.84
0.95
Normal Quantile Plot
Figure 7: Stud. Residuals by Log(10) Alkalinity

Although the relationship between log(10)
average mercury and log(10) alkalinity appears
linear and the stud. residuals follow a normal
distribution, Figure 7 shows that the residuals
seem to have unequal variance. We can see a
fan in the residuals in Figure 7. The residuals
vary more at higher values of log(10) alkalinity.
We can assume independence because the average mercury concentration in the sample of
bass in one Florida Lake should not have an effect on the mercury concentrations of the bass in
other lakes because they are isolated bodies of water.

To summarize, this model is behaving fairly well. Using a weighted regression will give more
influence to the observations that came from larger samples of Largemouth Bass. The
monotonically decreasing trend in the data was taken care of by taking the base 10 log of
alkalinity. I then ran into some non-normality in the residuals that I dealt with by taking the base
10 log of average mercury. Unfortunately, there still appears to be some unequal variance in the
residuals shows in Figure 7, I would say is the weakest point of my model.

PART 1: Linear Model

Figure 8: Regression Plot
(log(10) Average Mercury by log(10) Alkalinity)

Weight: No.samples
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.50667
0.496997
0.847766
-0.36503
692
Linear Fit
log(10)[avg. mercury] - hat = 0.2472067 - 0.4630143*log(10)[alkalinity]

A 10-fold increase in the alkalinity of a Florida lake is associated with a 100.463 = 2.904
multiplicative decrease in the predicted median, average mercury concentration. In other words
a 10-fold increase in the alkalinity of a Florida lake is associated with an estimated 190.4%
decrease in the mercury concentrations found in Largemouth Bass living in the lake.

Due to the fact that I am using a log-log model the intercept is more difficult to interpret. We
can interpret the intercept as Florida lakes with 0 log(10)alkalinity have Largemouth Fish with a
predicted average log(10)average mercury concentration of 0.247. However, In order to get this
interpretation in terms of alkalinity and average mercury we must manipulate the regression
equation. Our regression equation, which can be seen in the Linear Fit output above, is
log(10)[avg. mercury] - hat = 0.247 - 0.463*log(10)[alkalinity]. I must first remove the
log(10)[alkalinity] term from the equation and we can do this by letting alkalinity equal 1 so that
log(10)[alkalinity] = log(10)[1] = 0. Now we are left with
log(10)[avg. mercury] - hat = 0.247. After applying base 10 to both sides we get
[avg. mercury] = 100.247 = 1.766. This means that Florida lakes with 1 mg/L of alkalinity contain
Largemouth Bass with a predicted average mercury concentration of 1.766 parts/million. We
have data for Lake Trafford, which has 1.2 mg/L of alkalinity, so it is not much of an
extrapolation making a prediction for the average mercury concentrations of Largemouth Fish in
Florida lakes with 1 mg/L of alkalinity. I would say the intercept is meaningful in this context.

Referring to the Summary of Fit output we can see that R-Square 0.507. This means that 50.7%
of the variability in the base 10 log of mercury concentrations in Largemouth Fish in Florida lakes
is explained by this regression model on the base 10 log of alkalinity level (the other 49.3% is
unexplained variability).

Table 3: Top 5 Leverages
Lake
Griffin
East Tohopekaliga
Trout
Brick
Tohopekaliga
hi
0.1736525592
0.1164689823
0.102453403
0.0757299358
0.0757299358

Table 3 displays the lakes with the top 5 leverages. Lakes Griffin, East Tohopekaliga, and Trout
all have leverages greater than 5 53 = 0.094. However lakes East Tohopekaliga and Trout have
similar hat values, 0.116 and 0.102 respectively, which are not much greater than that of the
other observations. Lake Griffin, on the other hand, has a leverage of 0.174 which separates it
from the leverages of the other observations. I would consider lake Griffin an observation with
high leverage, depending on whether it has a large residual it may greatly influence the model.

Table 4: Top 5 Studentized Residuals
Lake
Puzzle
Farm-13
Apopka
Parker
Deer Point
Stud. Residual
2.6355479835
2.3036408905
2.0195020498
2.0071976662
1.9428364132
Table 5: Top 5 Cooks Distances

Lake
East Tohopekaliga
Farm-13
Puzzle
Tohopekaliga
Deer Point
Cook's Distance
0.1994561949
0.1529166686
0.1310342084
0.1087293064
0.070541742

Table 4 displays the lakes with the 5 largest studentized residuals. Lake Puzzle has the largest
studentized residual of 2.635, but it did not have a high leverage so it likely wont be the most
influential observation. The top 5 studentized residuals are all around 2, which is fairly high. This
means that there are some lakes in our data that the model does not predict Largemouth Bass
mercury concentrations for very well. From Table 3 and Table 4, we can see that most
observations with high leverages did not tend to have very large residuals, and vice versa, so
there were not any extremely influential observations. Table 5 displays the lakes with the 5
largest Cooks Distances. Although East Tohopekaliga has the largest Cooks Distance of 0.199, it
is not too far from the Cooks Distances of the other influential observations so I dont think it
was too much of a problem in the regression analysis.
PART 1: Statistical Inference

The population of interest is Largemouth Bass in all Florida lakes. Unfortunately, the researchers
who collected the data did not indicate whether the 53 selected Florida lakes were randomly
selected so dont know if our sample of lakes is representative of all Florida lakes. Therefore we
should not generalize our findings to the population. Also it was not specified if the samples of
Largemouth Bass collected from each lake were random. The researches may have
systematically selected bass from a particular part of the lake where mercury concentrations did
not represent the average mercury concentrations of the lake as a whole. Therefore we may not
even be able to generalize the average mercury concentration in a sample of Largemouth Fish to
all the Largemouth Fish in a particular Florida lake.
H0: 1 = 0
Ha: 1 0
(there is no relationship between log(10)[alkalinity] and average log(10)[average

mercury])
(there is a relationship between log(10)[alkalinity] and average log(10)[average
mercury])

Where 10^ 1 represents the true multiplicative chance in median, average mercury associated
with a 10-fold change in alkalinity. I ran a two sided test because I did not have any initial
conjectures of the direction of the relationship between log(10)[average mercury] and
log(10)[alkalinity].

Parameter Estimates
Term
Estimate Std Error t Ratio Prob>|t| Lower 95% Upper 95%
Intercept
0.2472067 0.090525 2.73 0.0087*
0.06547 0.4289433
Log10-alkalinity -0.463014 0.063976 -7.24 <.0001* -0.591451 -0.334578

For the regression of log(10)[average mercury] vs. log(10)[alkalinity] the observed slope of
-0.463 led to a test statistic of -7.24 (which follows a t-distribution with 53-2=51 degrees of
freedom) and yielded a p-value of <0.0001 (form Parameter Estimates output). If there was truly
no association between log(10)[average mercury] and log(10)[alkalinity], we would only expect
to see a sample slope as extreme as our less than 0.01% of the time due to random chance. At
the 99% confidence level, we have extremely strong evidence that there exists a genuine
relationship between the mercury concentration in Largemouth Bass, from Florida lakes, and
the alkalinity level of the lake. We are 95% confident each 10-fold increase in alkalinity (mg/L) is
associated with a predicted multiplicative decrease in expected mercury concentrations in
Largemouth Fish of between 10-0.591 = 0.256 parts/million and 10-0.225 = 0.596 parts/million.

Table 6: JMP Data

ID
Alkalinity log(10)[Alk]
53 25
1.398
Sample Lower
Upper
Lower
Size
95% Mean 95% Mean 95% Indv.
13.05
-0.466
-0.335
-0.876
Upper
95% Indv.
0.076
I wanted to estimate the mean and individual response prediction for a fictional lake in Florida
with an alkalinity of 25 mg/L. Due to the fact that I used weighted regression I had to assign a
weight to this observation and I used the average of the weights from the other observation. So
I am making predictions for a lake with 25 mg/L where an imaginary sample of 13 Largemouth
Bass was taken. We are 95% confident that all Florida lakes with alkalinity of 25 mg/L will
contain Largemouth Bass with an expected average mercury concentration of between 10-0.466 =
0.342 parts/million and 10-0.335 = 0.462 parts/million. We are 95% confident that a Florida lake
with alkalinity of 25 mg/L will contain Largemouth Bass with an average mercury concentration
of between 10-0.876 = 0.133 parts/million and 100.076 = 1.191 parts/million.
PART 1: Conclusion

There is statistically significant evidence that there exists a genuine relationship
between alkalinity and mercury concentrations in Largemouth Bass, for Florida Lakes. As
mentioned earlier we may not be able to generalize our findings to all Florida lakes depending
on how the lakes were selected. Regardless of whether we are able to generalize, we cannot
draw a case-and-effect conclusion because a randomized experiment was not conducted. There
may have been other confounding variables influencing this relationship. There is unequal
variance in the residuals that could be effecting our regression equation, parameter estimates,
and confidence intervals. None of the lakes observed were extremely unusual, relative to the
other lakes in the sample. The 3 observations circled in Figure 5 (Lake Parker, Lake Apopka, and
Lake Farm-13 from left to right) all have about the same alkalinity level and similarly large
negative residuals. This may warrant some further investigation; perhaps they have some
similarities that I could include in the analysis to improve the model. The only other question I
would ask about this data is whether this trend can truly be generalized to all Florida Lakes. If I
had more data from lakes of different states I may even be able to create a model that could be
generalize to lakes across the nation.

10

In Part I of this project I focused on the relationship between the alkalinity and the
average mercury concentration in the sample of Largemouth Bass. Alkalinity is the
name given to the quantitative capacity of an aqueous solution to neutralize an acid.
Measuring alkalinity is important in determining a stream's ability to neutralize
acidic pollution from rainfall or wastewater
(http://en.wikipedia.org/wiki/Alkalinity). I was able to conclude that there is
statistically significant evidence that there exists a genuine relationship between
alkalinity and mercury concentrations in Largemouth Bass, for Florida Lakes.
Relevant computer output supporting my conclusion is below.

Figure 1: Regression Plot
(log(10) Average Mercury by log(10)
Alkalinity)
Weight: No.samples

Summary of Model
S = 0.847766
PRESS = 39.8274
R-Sq = 50.67%
R-Sq(adj) = 49.70%
R-Sq(pred) = 46.4%

Regression Equation: log(merc)
Coefficients:
Term
Coef
Constant
0.247207
loc(alk) -0.463014
T
2.73081
-7.23734
SE Coef
0.0905251
0.0639758
Analysis of Variance:
Source
DF
Seq SS
Regression
1 37.6452
loc(alk)
1 37.6452
Error
51 36.6540
Lack-of-Fit 49 35.0573
Pure Error
2
1.5967
Total
52 74.2992
Adj SS
37.6452
37.6452
36.6540
35.0573
1.5967
0.247207 - 0.463014 loc(alk)

P
0.009
0.000
Adj MS
37.6452
37.6452
0.7187
0.7155
0.7984
F
52.3791
52.3791
P
0.000000
0.000000
0.8962
0.664189

In this part of my project I hope to build a better model to predict more of the
variability in mercury concentrations of Largemouth Bass, in Florida Lakes, by using
more explanatory variables.

11

Correlations: Avg_Mercury, Alkalinity, Calcium, Chlorophyll
Avg_Mercury
-0.594
0.000
Alkalinity
Calcium
-0.401
0.003
0.833
0.000
Chlorophyll
-0.491
0.000
0.478
0.000
Alkalinity
Calcium
0.410
0.002
Cell Contents: Pearson correlation

P-Value

Alkalinity, Calcium, and Chlorophyll (the quantitative explanatory variables) all
seem to have a strong negative concave up association with the average mercury
contents of the samples of Largemouth Bass (the response variable). Unfortunately
the Alkalinity, Calcium, and Chlorophyll levels are also associated with each other so
they might be explaining a lot of the same variability in average mercury content.
Aside from the issue of multicollinearity, the relationships between alkalinity,
calcium, and chlorophyll with average mercury are not linear. This suggests that
transformations will need to be made on the variables in our model. I had no prior
conjecture about how the associations between these variables were going to
12
behave. From the matrix scatter plot we can identify some seemingly unusual
observations (annotated with arrows). In the plot of average mercury by alkalinity
there appears to be two unusual observations; from left to right they are Lake
Harney and Lake Puzzle respectively. Both these Florida lakes appear to have higher
mercury concentrations than other lakes with similar levels of alkalinity. In the plot
of average mercury by calcium there appears to be three unusual observations;
from left to right they are Lake Talquin, Lake Harney and Lake Puzzle respectively.
These three Florida lakes appear to have higher mercury concentrations than other
lakes with similar levels of calcium. There does not appear to be any unusual
observations in plot of average mercury by chlorophyll.

My original data set did not have a categorical variable, but I did have a quantitative
pH variable. The pH measures how acidic or basic a substance is, where a pH
between 1 and 7 is acidic, 7 is neutral, and between 7 and 14 is basic. I created a
categorical variable from this variable by using 1,0 indicator parameterization for
whether a lake was acidic (1 for acidic, 0 otherwise).

From these scatterplots, because the
data is not linearly associated, it is
hard to see if whether a lake is acidic
has an effect on the average mercury
concentration; or how acidity affects
the magnitude of the change in
average mercury resulting from
constant increases in the alkalinity,
calcium, and chlorophyll levels of a
lake.
13

These are 3 scatterplots produced after by graphing the square root of average
mercury with log base 10 alkalinity, log base 10 calcium, and log base 10
chlorophyll. This corrects the nonlinearity present in the 3 previous scatterplots so
we can better see the effect of acidic lakes on average mercury concentration. If an
interaction was significant it would mean that the effect of the explanatory variable
in question (alkalinity, calcium, or chlorophyll) on average mercury concentration is
different for acidic lakes, opposed to basic or neutral lakes. In the scatterplots of
sqrt(mercury) by log(alkalinity), and sqrt(mercury) by log(calcium) it does not
seem like there is evidence of a difference in the effect of alkalinity on average
mercury depending on whether the lake is acidic or not. The lines fitting each group
in both the graphs seem about parallel, any deviation from perfect parallel lines is
likely due to random chance. The interaction between log(chlorophyll) and the
acidic indicator seems to be the most significant, as the two regression lines in the
scatterplot of sqrt(mercury) by log(chlorophyll) are not that close to parallel.
However I do not think the interaction will be significant because the data in the
scatterplot still seems to follow the same overall trend. It is also worth noting that
from these scatterplots it appears that basic or neutral lakes,7-14 on the pH scale,
14
seem to have higher levels of alkalinity, calcium, and chlorophyll than acidic lakes.
This can be seen by observing that the black dots, indicating a non-acidic lake, are
more often than not in the right half of the x-values for all these scatterplots.

PART 2: Model Formulation

Weighted analysis using weights in No.samples
Regression Equation
Avg_Mercury
0.606622 - 0.00501981 Alkalinity + 0.00283653 Calcium 0.00147748 Chlorophyll + 0.187879 acidic - 0.00216577
chlorophyll*acidic
Coefficients
Term
Constant
Alkalinity
Calcium
Chlorophyll
acidic
chlorophyll*acidic
Coef
0.606622
-0.005020
0.002837
-0.001477
0.187879
-0.002166
SE Coef
0.111369
0.001947
0.002699
0.001793
0.123221
0.004895
T
5.44696
-2.57877
1.05096
-0.82403
1.52474
-0.44247
P
0.000
0.013
0.299
0.414
0.134
0.660
VIF
4.54862
3.26451
1.84176
2.93183
1.80388
Summary of Model
S = 0.913213
PRESS = 52.4426
R-Sq = 47.98%
R-Sq(pred) = 30.41%
R-Sq(adj) = 42.45%

Above are standardized residual plots of a weighted regression of average mercury,
by alkalinity, calcium, chlorophyll, acidic, and the interaction between acidic and
chlorophyll. Weights are the number of fish sampled from the Florida lakes used to
15
calculate the average mercury concentration in the lake. Linearity is violated as

evident from the scatterplots of average mercury vs alkalinity, calcium, and
chlorophyll in the previous section Descriptive Statistics. The normal probability
plot is not terrible, but it shows some signs of non-normality by the slight curving of
the residuals about the diagonal normal line. This slight non-normality can be seen
in the histogram of standardized residual which shows evidence of some right skew.
We see fanning in the standardized residual vs fits graph which is evidence of
unequal variance in the residuals. Specifically as the fitted values increase the
variability of the residuals also increases. There is no need to refer to the versus
order plot as the observations were not sampled in any order that could violate the
independence condition.

After some trial and error, and guidance from Dr. Chance, I have concluded that the
most appropriate transformation is to regress the square root of average mercury
on the log base 10 of alkalinity, the log base 10 of calcium, the log base 10 of
chlorophyll, the acidic indicator, and the interaction between this indicator and log
base 10 chlorophyll.

Regression Equation
sqrt(mercury)
1.14335 - 0.31862 log(alkalinity) + 0.0945459 log(calcium) 0.142475 log(chlorophyll) - 0.0897938 acidic + 0.163463
loc(chlorophyll)*acidic
Coefficients
Term
Constant
log(alkalinity)
log(calcium)
log(chlorophyll)
acidic
loc(chlorophyll)*acidic
Coef
1.14335
-0.31862
0.09455
-0.14248
-0.08979
0.16346
SE Coef
0.146689
0.095922
0.085773
0.082786
0.139334
0.101493
T
7.79437
-3.32166
1.10228
-1.72101
-0.64445
1.61059
P
0.000
0.002
0.276
0.092
0.522
0.114
VIF
4.88083
3.97675
3.97249
9.44431
6.70924
Summary of Model
S = 0.575348
PRESS = 20.6954
R-Sq = 58.32%
R-Sq(pred) = 44.55%
R-Sq(adj) = 53.88%
16

After performing the transformations the graphs of the standardized residuals look
much better. The linearity assumption is now met and can be verified from the
scatterplots of the square root of mercury vs the log base 10 alkalinity, log base 10
calcium, and log base 10 chlorophyll in the previous section Descriptive Statistics.
The normal probability plot shows that the standardized residuals are more normal
as there is no pattern of curvature about the normal line. There is no longer fanning
in the standardized residual versus fits plot. In fact the standardized residuals are
randomly scattered above and below the horizontal zero line and the majority of the
standardized residuals are between -2 and 2,, so we can assume equal variance.
There is no need to refer to the versus order plot as the observations were not
sampled in any order that could violate the independence condition. This is a
suitable transformation as it fixes the assumptions that were violated in the
previous, untransformed, model.

All the explanatory variable in my model had VIFs around or above 4 so there is
evidence of some multicollinearity. To correct for this I could try centering the
variables about their means. If that does not fix the issue I could run a ridge
resgression. After consulting with Dr. Chance, I now understand that the
multicollinearity is a sign of the high linear correlation between alkalinity and
calcium, and that centering would not help much. However, I had already done a
significant part of the project with the centered variables and did not have adequate
time to redo the analysis.

To see if I can drop any variables from my model to simplify it, without losing
accuracy, I ran a best subsets in Minitab. I forced the model to include log base 10 of
alkalinity and the acidic indicator.

17

Response is sqrt(mercury)
The following variables are included in all models: log(alkalinity) acidic
l
o
c
(
c
h
l
l o
o r
g o
( p
l c h
o h y
g l l
( o l
c r )
a o *
l p a
c h c
i y i
u l d
Mallows
m l i
Vars R-Sq R-Sq(adj)
Cp
S ) ) c
1 51.0
48.0
8.7 0.17659
X
1 50.7
47.7
9.0 0.17709 X
2 55.6
51.9
5.6 0.16973
X X
2 52.7
48.7
8.8 0.17533 X X
3 57.1
52.5
6.0 0.16873 X X X

From the output above we can see that the smallest Mallows Cp is 5.6 which is from
the model that includes log base 10 chlorophyll, and the interaction between log
base 10 chlorophyll and the acidic indicator in addition to log base 10 alkalinity and
acidic. Although the model with 3 additional variables does have the largest R-
square, and R-square adjusted, the values are not much larger so the model is
therefore not worth the extra complexity, and loss in degrees of freedom. In light of
this, I will remove log base 10 calcium from my model.

18
PART 2: Model Description

The following is the regression analysis of the full model. I ran it with centered
variables to help with multicollinarity and it did reduce the VIFs a bit.

Regression Equation
sqrt(mercury)
0.675107 - 0.23582 centered log(alk) - 0.144974 centered

log(chlor) + 0.0847283 acidic + 0.174205 cent log(chlor)
*acidic
Coefficients
Term
Constant
centered log(alk)
centered log(chlor)
acidic
cent log(chlor)*acidic
Coef
0.675107
-0.235820
-0.144974
0.084728
0.174205
SE Coef
0.051917
0.059786
0.082940
0.064288
0.101250
T
13.0035
-3.9444
-1.7479
1.3179
1.7205
P
0.000
0.000
0.087
0.194
0.092
VIF
1.88764
3.96951
2.00161
2.97181
Summary of Model
S = 0.576635
PRESS = 19.7925
R-Sq = 57.24%
R-Sq(pred) = 46.97%
R-Sq(adj) = 53.67%

When log(alkalinity) and log(chlorophyll) are at their means we predict that the
average, average mercury concentration in the muscle tissues of fish will be
0.0675^2 = 0.005 parts per million for non acidic lakes. I would predict a -0.236
change in sqrt(mercury) when we multiply the centered log base 10 of alkalinity by
10. I would predict a -0.145 change in sqrt(mercury) when we multiply the centered
log base 10 of chlorophyll by 10. For all levels of alkalinity and chlorophyll acidic
lakes are associated with 0.085 parts per million more average mercury in the
muscle tissues of Largemouth Fish. Each unit increase in log(chlorophyll) for acidic
lakes is associated with a 0.174 larger increase in average mercury concentration in
parts per million than for non-acidic lakes. 57.24 % of the variation in the average
mercury concentrations of largemouth bass in Florida lakes can be explained by
using the model with centered log base 10 alkalinity, centered log base 10
chlorophyll, the acidic indicator, and the interaction with this indicator and the
centered log base 10 chlorophyll. I observed s = 0.577 which means that we can
expect an average predicted square root mercury concentration to vary by 0.576
parts per million.

Descriptive Statistics: sqrt(mercury)
Variable
sqrt(mercury)
Mean
0.6844
Minimum
0.2000
Median
0.6928
Maximum
1.1533
Considering that the average square root mercury concentration is 0.6844 an

average deviate of 0.576 is large. This means that predictions will not be very
accurate. The R-Square adjusted for the simple regression model was 49.7 %
19
(output in the beginning of this report). The R-Square adjusted for the multiple
regression model is 53.67 %. There is an increase of about 4 %. After adjusting for
the penalties associated with adding more predictors to the model there isnt much
of an improvement.

PART 2: Statistical Inference

To test if the overall model is statistically significant we must conduct an
overall F-Test.
Analysis of Variance
Source
Regression
centered log(alk)
centered log(chlor)
acidic
Error
Total
DF
4
1
1
1
1
48
52
Seq SS
21.3636
18.7219
0.4548
1.2026
0.9843
15.9604
37.3240
Adj SS
21.3636
5.1732
1.0159
0.5776
0.9843
15.9604
Adj MS
5.34089
5.17324
1.01591
0.57755
0.98432
0.33251
F
16.0624
15.5582
3.0553
1.7370
2.9603
P
0.000000
0.000260
0.086869
0.193780
0.091775
Ho: B(centered log(alk)) = B(centered log(chlor)) = B(acidic) = B(cent

log(chlor)*acidic) = 0

(the model is not statistically significant)
Ha: at least one of the population slope coefficients in the null hypothesis have a
non-zero slope

(the model is statistically significant)
-Where B(centered log(alk)/centered log(chlor)) represents the true change in the
average mercury content with every one unit increase in (centered
log(alk)/centered log(chlor)) .
-Where B(acidic) represents the true difference in average mercury content
between acidic and non-acidic lakes with the same values of the other explanatory
variables
-Where B(cent log(chlor)*acidic) represents the true difference of the effect of
chlorophyll level on the average mercury content for acidic versus non-acid lakes

The observed slopes led to an F-Statistic of 16.064 which follows an F-Distribution
with df = (4,48).
This F-Statistic leads to a p-value of about 0. With a p-value of about zero we can
reject the null hypothesis at the 5% significance level. At the 95% confidence level
there is overwhelming evidence that the overall model is statistically significant. In
other words, at least one of the slope values for the population is not equal to zero.

20
To test if the additional variables significantly improve upon the simple linear
regression model we must conduct a partial F-Test.

Source
Regression
centered log(alk)
centered log(chlor)
acidic
Error
Total
DF
4
1
1
1
1
48
52
Seq SS
21.3636
18.7219
0.4548
1.2026
0.9843
15.9604
37.3240
Adj SS
21.3636
5.1732
1.0159
0.5776
0.9843
15.9604
Adj MS
5.34089
5.17324
1.01591
0.57755
0.98432
0.33251
F
16.0624
15.5582
3.0553
1.7370
2.9603
P
0.000000
0.000260
0.086869
0.193780
0.091775

Ho: B(centered log(chlor)) = B(acidic) = B(cent log(chlor)*acidic) = 0

(the additional variables do not significantly improve the model from Part 1)
non-zero slope

(the additional variables significantly improve the model from Part 1)

-Where B(centered log(chlor)) represents the true change in the average mercury
content with every one unit increase in (centered log(chlor)).
variables
-Where B(cent log(chlor)*acidic) represents the true difference of the effect of
chlorophyll level on the average mercury content for acidic versus non-acid lakes

F =
!"#$ !" !!" / !"

!"# !"##
!.!"!#!!.!"!#!!.!"#$ / !
!.!!"#$
= 2.648, df = (3,48)

Cumulative Distribution Function
F distribution with 3 DF in numerator and 48 DF in denominator

x
2.648
P( X <= x )
0.940513

The output above gives us the probability less than F = 2.648.
Therefore the p-value, the probability greater than or equal to F = 2.648, equals 1 -
0.941 = 0.059.
This is slightly above the 0.05 alpha level so we do not have statistically significant
evidence at the 95% confidence level that the additional variables significantly
improve upon the model with just log(alkalinity). In other words we do not have
evidence at the 5% significance level that the full model is significantly better than
the reduced model. However, in order to draw a statistically significant conclusion I
will use the 0.1 alpha level. At the 90% confidence level there is significant evidence
that the additional variables improve upon the model with just log(alkalinity).
21
To test if the interaction variable is significantly significant we can conduct a t-

test on the individual interaction variable.

Coefficients
Term
Constant
centered log(alk)
centered log(chlor)
acidic
Coef
0.675107
-0.235820
-0.144974
0.084728
0.174205
SE Coef
0.051917
0.059786
0.082940
0.064288
0.101250
T
13.0035
-3.9444
-1.7479
1.3179
1.7205
P
0.000
0.000
0.087
0.194
0.092
VIF
1.88764
3.96951
2.00161
2.97181

Ho: B(cent log(chlor)*acidic) = 0
Ha: B(cent log(chlor)*acidic) 0
Where B(cent log(chlor)*acidic) represents the true difference of the effect of
chlorophyll level on the average mercury content for acidic versus non-acid lakes.

Our observed slope of 0.174 led to a t-statistic of 1.721 which follows a t-
distribution with n p 1 = 48 df. This yielded a p-value of 0.092 which is not
significant at the alpha equals 0.05 level. From the graphs in the Descriptive
Statistics section I noticed that the interaction between log(chlorophyll) and the
acidic indicator seemed to be the strongest but I did not think it would be
significant, and it is not at the alpha equals 0.05 level.

To test if the categorical variable significantly improved we must conduct a
partial F-Test.
(It wouldnt make sense to just test if B(acidic) = 0 because it would still be in the
interaction term)

Source
Regression
centered log(alk)
centered log(chlor)
acidic
Error
Total
DF
4
1
1
1
1
48
52
Seq SS
21.3636
18.7219
0.4548
1.2026
0.9843
15.9604
37.3240
Adj SS
21.3636
5.1732
1.0159
0.5776
0.9843
15.9604
Adj MS
5.34089
5.17324
1.01591
0.57755
0.98432
0.33251
F
16.0624
15.5582
3.0553
1.7370
2.9603
P
0.000000
0.000260
0.086869
0.193780
0.091775

Ho: B(acidic) = B(cent log(chlor)*acidic) = 0

(whether or not a lake is acidic did not significantly improve the model)
non-zero slope

(whether or not a lake is acidic significantly improve the model)
-Where B(centered log(chlor)) represents the true change in the average mercury
content with every one unit increase in (centered log(chlor)).
22

variables

F =
!"!" !" !!" / !"

!"# !"##
!.!"!#!!.!"#$ / !
!.!!"#$
= 3.288, df=(2,48)

F distribution with 2 DF in numerator and 48 DF in denominator
x
3.288
P( X <= x )
0.954107

The output above gives us the probability less than F = 3.288.
Therefore the p-value, the probability greater than or equal to F = 3.288, equals 1 -
0.954 = 0.046.
With a p-value less than 0.05 we can reject the null hypothesis. Whether or not a
lake is acidic is information that is significantly improving the model. In other words
the relationship between alkalinity and chlorophyll levels on the response are
effected by whether or not the lake is acidic.

The true square root average mercury concentration for all acidic lakes with
centered log(alkalinity) = 0.5, centered log(chlorophyll) = 0.5, where 6 fish were
sampled, is between 0.538 parts per million and 0.776 parts per million.
We predict that the square root average mercury concentration for an acidic lake
with centered log(alkalinity) = 0.5, centered log(chlorophyll) = 0.5, where 6 fish
were sampled, is between 0.168 parts per million and 1.145 parts per million. There
was no reason in particular why I chose to create intervals for an acidic lake where 6
fish were sampled and centered log(alkalinity) = 0.5 and centered log(chlorophyll) =
0.5. I decided by looking at the distribution of these variables and picked arbitrary
values within the range of the values that I observed.

23
PART 2: Model Refinement

Fits and Diagnostics for Unusual Observations
Obs
14
17
40
sqrt(mercury)
1.07703
0.43589
1.04881
Fit
0.858042
0.378840
0.487801
SE Fit
0.0306985
0.0533835
0.0401344
Residual
0.218991
0.057050
0.561008
St Resid
2.65754
0.77187
3.15392
R
X
R
R denotes an observation with a large standardized residual.

X denotes an observation whose X value gives it large leverage.

Observations 14 and 40, Lake East Tohopekaliga and Lake Puzzle respectively, have
large standardized residuals. I stored the deleted t residuals into my worksheet so I
could run tests of significance on the outliers. Lake East Tohopekaliga has a deleted t
residual of 2.848 which follows a t-distribution with n-p-2 = 53 4 2 = 47 degrees
of freedom. We can test the null hypothesis that Lake East Tohopekaliga is not an
extreme outlier against the alternative hypothesis that it is.

Student's t distribution with 47 DF
x
2.848
P( X <= x )
0.996747

Our deleted t-residual of 2.848 leads to a p-value of (1-0.997)*2 = 0.0033*2 = 0.0066
but we should adjust this p-value because technically we should not just test the
most extreme residual. Multiplying our p-value by our sample size will correct for
this. Therefore our p-value = 0.0066*53 = 0.35. We do not have statistically
significant evidence that Lake Tohopekaiga is an extreme outlier.

Lake Puzzle has a deleted t residual of 3.505 which follows a t-distribution with n-p-
2 = 53 4 2 = 47 degrees of freedom. We can test the null hypothesis that Lake
Puzzle is not an extreme outlier against the alternative hypothesis that it is.

Student's t distribution with 47 DF
x
3.505
P( X <= x )
0.999493

Our deleted t-residual of 3.505 leads to a p-value of (1-0.9995)*2 = 0.0005*2 = 0.001
but we should adjust this p-value because technically we should not just test the
most extreme residual. Multiplying our p-value by our sample size will correct for
this. Therefore our p-value = 0.001*53 = 0.053. We do not have statistically
significant evidence at the 95% confidence level. However we are 90% confident
that Lake Puzzle is a statistically significantly extreme outlier.

24
Because the conclusions were not significant at the alpha equals 0.05 level I would
not remove either of these lakes from the data set.

Referring to the dotplot of the cooks distances above there is no evidence of highly
influential observations. No cook distances are even close to 0.5.

Based on what I have learned in this report I would reduce my model to the simple
linear regression of the square root of average mercury on the log base 10 alkalinity.
There wasnt much of a difference in the R-Square adjusted values between the
simple linear regression and the multiple regression. This indicates that the extra
variables are not explaining enough unexplained variability in average mercury
concentration to be worth the extra complexity, and loss in degrees of freedom. To
further support this decision the partial F-Test I ran in the previous section to test if
the variables I added to the model were significant, yielded a p-value of 0.059 which
would not allow us to reject the null hypothesis with 95% confidence. In other
words we do not have evidence that the full model is significantly better than the
reduced model at the 5% significance level. I believe the simple linear regression
model is okay, it has an R-Square of about 50% which is decent.

25
PART 2: Conclusion

General Regression Analysis: sqrt(mercury) versus log(alkalinity)
Regression Equation
sqrt(mercury)
1.13217 - 0.326523 log(alkalinity)
Coefficients
Term
Constant
log(alkalinity)
Coef
1.13217
-0.32652
SE Coef
0.0644895
0.0455759
T
17.5558
-7.1644
P
0.000
0.000
VIF
1
Summary of Model
S = 0.603943
PRESS = 20.3895
R-Sq = 50.16%
R-Sq(pred) = 45.37%
R-Sq(adj) = 49.18%

After adding the additional variables to our model the accuracy did not change
much, which is why I chose to use the simple linear regression model of the square
root of average mercury by log base 10 alkalinity. The reduced model is valid and
certainly significant at the 5% significance level, with a p-value of about 0. Using the
model is significantly better than using the mean average mercury concentration to
predict the mercury concentrations of Largmouth Bass in other lakes. Our model
tells us that Florida lakes with higher alkalinity levels tend to have Largemouth Bass
with lower concentrations of mercury in their muscle tissues. Overall I think the
final model is doing a good job, but it can be much improved. There is still about
50% of the variability in the average mercury concentration that is left unexplained.
26
I tried to bring in variables in an attempt to explain this variability but there were
problems such as multicollinaerity. If I were to analyze this data again in the future I
would try to bring in additional explanatory variables that were orthogonal to
alkalinity, so they would explain a completely different part of the variability in the
average mercury concentration.

27

In Part I of this project I focused on the relationship between the alkalinity and the
average mercury concentration in the sample of Largemouth Bass. Alkalinity is the
name given to the quantitative capacity of an aqueous solution to neutralize an acid.
Measuring alkalinity is important in determining a stream's ability to neutralize
acidic pollution from rainfall or wastewater
(http://en.wikipedia.org/wiki/Alkalinity). I was able to conclude that there is
statistically significant evidence that there exists a genuine relationship between
alkalinity and mercury concentrations in Largemouth Bass, for Florida Lakes.

In Part II of this project I added additional predictor variables to the model such as
calcium, chlorophyll, and whether the lake was acidic. Unfortunately there was
much correlation between alkalinity and both calcium and chlorophyll. After
adjusting for alkalinity, calcium and chlorophyll were not explaining much more
variability in the average mercury concentration. My binary predictor variable,
whether or not a lake was acidic, was not significant nor was its interaction with
chlorophyll. The more complex model I built in Part II was not significantly better
than the simple linear regression model with alkalinity. Thus I reduced my model to
the square root of average mercury regressed on the log base 10 of alkalinity.

Regression Equation
sqrt(mercury) = 1.13217 - 0.326523 log(alkalinity)
Summary of
S
R-Sq
R-Sq(adj)
PRESS
R-Sq(pred)
Model
= 0.603943
= 50.16%
= 49.18%
= 20.3895
= 45.37%

In this part of the project I will use predictor variables to predict whether the
average mercury concentration of a lake will be at a poisonous level. After much
research, I could not find a straightforward answer to how much mercury in parts
per million in the muscle tissue of Largemouth Bass is poisonous for humans. In fact
the level of mercury that would poison someone depends on other factors such as
body weight of that individual. In order to create a binary response variable from
28
the average mercury concentration variable I would have to pick a cutoff value for
an acceptable level of mercury. I will use the 3rd quartile, or 75th percentile, of my
average mercury concentrations as this cutoff value for an acceptable mercury level.

Histogram of Average Mercury

The distribution of average mercury is skewed to the right, and the median average
mercury value is 0.48. For the purposes of this project I will consider an average
mercury concentration greater than 0.48 parts per million dangerous. This is not a
value that truly separates poisonous from non-poisonous Largemouth Bass, I repeat,
I am only using this cutoff for the purposes of this project so I can demonstrate
logistic regression. I believe that as the alkalinity level of a lake increases the
probability that the fish from that lake will have a poisonous level of mercury
decreases.

29

The graph above illustrates the distribution of chlorophyll between the two levels of
the binary response variable, poison. Harmful indicates an average mercury
concentration above the median of 0.48 parts per million, and harmless indicates an
average mercury equal or less than 0.48 parts per million. The observations have
dot sizes proportional to the number of fish that were sampled to calculate the given
average mercury concentration. We can see some discrimination between successes
(harmless) and failures (harmful), however there doesnt seem to be much. All the
observations with chlorophyll levels greater than the blue line (about 40) have
harmless mercury levels, but below this line there are many harmless and harmful
observations. Although the mean chlorophyll for the harmless group is likely larger
than for the harmful group, I would like to see more discrimination in the data;
perhaps chlorophyll is not a good predictor of whether or not the Largemouth Bass
from a Florida lake with have a harmless level of mercury. I will now investigate the
distribution of calcium between the harmless and harmful levels of average
mercury.

30

The graph above illustrates the distribution of calcium between the two levels of the
binary response variable, poison. We can see very little discrimination between
successes (harmless) and failures (harmful). Granted that there are more harmless
observations at higher calcium levels, there are still a few harmful lakes with high
calcium levels. To the left of the blue line (calcium about 30) there are many harmful
and harmless observations. Calcium levels are not doing a good job of separating
harmful from harmless lakes. Chlorophyll, from the previous graph, seems like a
better predictor whether a lake is harmless than calcium. I will now investigate the
distribution of alkalinity between the harmless and harmful levels of average
mercury.

31

The graph above illustrates the distribution of alkalinity between the two levels of
the binary response variable, poison. We can see decent discrimination between
successes (harmless) and failures (harmful). To the right of the blue line (alkalinity
of about 35) there are many harmless, and only two harmful, lakes. To the left of this
line there are still many both harmless and harmful lakes, but lakes where the
average mercury was calculated based on larger sample sizes, and are likely more
accurate, tended to be harmful. Alkalinity seems to be the best quantitative variable
I have to discriminate between harmful and harmless levels of mercury
concentrations from the Largemouth Fish in Florida Lakes. I will further investigate
this discrimination with some numerical summaries.

32

The Florida lakes with Largemouth Bass that contained a harmless average level of
mercury had an average alkalinity of 58.01 mg/L with a standard deviation of 40.69.
The Florida lakes with Largemouth Bass that contained a harmful average level of
mercury had an average alkalinity of 16.26 mg/L with a standard deviation of 19.76.
The alkalinity levels in lakes with harmless fish varied almost twice as much on
average as the alkalinity levels in lakes with harmful fish. This difference in
deviation can be seen in the previous plot of Poison vs. Alkalinity; the spread of the
alkalinity levels for harmless lakes is much larger than for harmful lakes. However,
despite the differences in deviation, there seems to be a significant difference in the
alkalinity levels between the lakes with harmless and harmful levels of average
mercury.

We can test this observed difference in alkalinity by running a two sample t-test
with the following hypothesis.

H0: (harmful) (harmless) = 0,

Ha: (harmful) (harmless) 0

Our observed difference of -41.75 led to a t-statistic of -4.78, which follows a t-
distribution with 37.92 degrees of freedom and led to a p-value of less than 0.0001.
There is overwhelming evidence that there is truly a difference in the alkalinity
levels between Florida lakes containing Largemouth Bass with harmless vs harmful
levels of average mercury.
33

The graph above is a mosaic plot and contingency table showing the differences in
the proportion of harmless lakes depending on whether the lake was acidic. As a
reminder my original data contained a pH variable that I recoded as acidic/non-
acidic. There does seem to be a difference in the proportion of harmless lakes
between these two groups; non-acidic lakes observed %77.27 harmless lakes while
acidic lakes observed %32.26 harmless lakes. To test if this difference is statistically
significant I will conduct a two sample z-test with the following hypothesis.

H0: (harmless) (harmful) = 0,

Ha: (harmless) (harmful) 0

Our observed difference led to a z-statistic that yielded a p-value of 0.002.
There is overwhelming evidence that there is truly a difference in the proportion of
lakes containing Largemouth Bass with harmless average mercury levels between
acidic and non-acidic Florida lakes.

We observe an odds ratio of
!"/!
!"/!"
= 7.14, matching the
JMP output to the left.

This tells us that non-acidic lakes odds of containing largemouth bass with a
harmless average mercury level were 7.14 times larger than for acidic lakes.

Although there were significant differences between harmful and harmless lakes for
both alkalinity and the acidic binary indicator, there is likely going to be much
correlation between these variables because alkalinity is a measure of the capacity
of an aqueous solution to neutralize an acid.

34
PART 3: Single Predictor Variable

Poison is defined as the level of average mercury in the muscle tissue of Largemouth
Bass in a Florida lake, either a harmless or harmful level. Above is the weighted
logistic regression of this poison variable on alkalinity, where a harmless lake is
treated as a success and observations are weighted proportional to the size of the
sample of bass that was used to calculate the average mercury level.

Fitted model equation: ln odds ~hat = -1.772 + 0.050(alkalinity)

This equation can be used to find the predicted odds, and probability, of a particular
lake being harmless, as far as mercury concentration in Largemouth Bass, based on
the alkalinity of the lake. For example, we could predict the probability that a lake
with an alkalinity of 50 mg/L will contain Largemouth Bass with harmless mercury
levels, on average (there was no reason in particular why I chose 50 mg/L, I simple
picked an arbitrary value within the rage of my alkalinity values).

ln(odds)~hat= -1.772 + 0.050(50) = 0.728
(odds)~hat = !.!"# = 2.071
probability~hat =
!.!"#
!!!.!"#
= 0.674
We predict about 67.4% of all Florida lake with an alkalinity of 50 mg/L will be
harmless. So if we select one lake with an alkalinity of 50 mg/L, we would predict
that there is a 67.4% chance that the average mercury concentration of the
Largemouth Bass in the lake is at a harmless level.
35

Above is a graph of the estimated probabilities of harmless vs. alkalinity. It is
apparent that as alkalinity increases the probability that the lake contains
Largemouth Bass with a harmless level of mercury increases. There is a sharp
increase from about alkalinity 0 to 80 mg/L, and then we see the rate of change in
the probability with respect to alkalinity level off. As a sanity check all probabilities
are between 0 and 1, which is a good sign. It may be worth noting that it appears
that observations which come from larger sample sizes, marked by their
proportional dot sizes, tend to be on the lower end of the range of alkalinity levels.

JMP reports that the odds of a lake being harmless increases by a multiplicative
factor of 1.051 for every 1 mg/L increase in alkalinity. I can verify this output by
calculating !.!" = 1.051, which matches the odds ratio in the table above.

This tells us that every one mg/L increase in alkalinity is associated with a 1.051
multiplicative increase in the odds that the lakes contain Largemouth Bass with a
harmless level of mercury. I am 95% confident that the true change in the odds that
a lake contains harmless Largemouth Bass associated with a 1 mg/L increase in
36
alkalinity is between 1.043 and 1.06. This is of course a small change in the odds
ratio because a 1-unit change in alkalinity isnt going to have much of an effect on
the odds than a lake contains harmless Largemouth Bass. However we can look at
the effect that a 20 mg/L increase in alkalinity would have on the odds that the lakes
contain harmless bass by calculating !"!.!" = ! = 2.718. This tells us that every
20 mg/L increase in alkalinity is associated with a 2.718 multiplicative increase in
the odds that the lakes contain Largemouth Bass with a harmless level of mercury.

Fitted model equation: ln odds ~hat = -1.772 + 0.050(alkalinity)

The intercept of this model tell us that lakes with 0 mg/L alkalinity have a !!.!!" =
!.!"
0.170 odds, or a !.!" = 0.145 probability, of containing Largemouth Bass with
harmless average levels of alkalinity.

To test the significance of alkalinity as a predictor variable I will run a test with the
following hypothesis:

H0 : (alkalinity) = 0
In other words the true change in log odds of a harmless lake with respect to
alkalinity is 0. This means each increase in alkalinity does not affect the odds
of lakes containing harmless Largemouth Bass (odds = ! = a multiplicative
chance of 1).

Ha : (alkalinity) 0
In other words the true change in log odds of a harmless lake with respect to
alkalinity is not 0. This means each increase in alkalinity has some affect on
the odds of lakes containing harmless Largemouth Bass (odds = ! 0,
where k is a non-zero constant).

From the parameter estimates output above, alkalinitys observed slope of 0.050
had a standard error of 0.004; this lead to a chi-square statistic of 138.93, which
follows a chi-square distribution with 1 degree of freedom. The test statistic yields a
p-value < 0.0001, so we can reject the null hypothesis at the 5% significance level. At
the 5% significance level there is very strong evidence that alkalinity has an effect
on the true odds ratio, and probability, that the lakes will contain bass with a
harmless level of average mercury. This test conclusion is consistent with the
confidence interval for the true odds ratio I found above because 1 is not inside the
interval, thus we are 95% confident that alkalinity has an effect on the odds ratio.

37

From output above we can see that my models misclassification rate is about 20%.
This can be verified from the confusion matrix. Initially I was shocked to see such
large numbers in the confusion matrix because my sample size was only 53.
However I believe it has to do with running the weighted logistic regression. I think
that a correct prediction for a lake where the average mercury was calculated based
on a sample of 20 bass for example, counts as 20 correct prediction in the confusion
matrix. The misclassification rate is the ratio of incorrect predictions to total
!"!!"#
predictions, which can be calculated by
= 0.2023, or 20.23%.
!"#!!"!!"#!!"#
Obviously in a perfect world this rate will be close to zero, which 0.2023 is not, but
we should expect some prediction errors. Overall I think 0.2023 is a sub par
misclassification rate, this means I am making about 1 wrong prediction for every 5
predictions I make.

The whole model test in this case should be the same as testing the slope of
alkalinity because it was the only predictor in the model, but I will run it anyway.

H0 : The model is not useful in predicting the odds of lakes containing harmless bass.
Ha : The model is useful in predicting the odds of lakes containing harmless bass.

Our observed chi-square statistic is 297.192, which follows a chi-square distribution
with 1 degree of freedom, and lead to a p-value of < 0.0001 (which matches the p-
value from the significance test of the slope of alkalinity). We can conclude with
near certainty that the model is useful.

Next we will look at the Deviance and Pearson residuals to investigate unusual
observations.
38

Above are graphs of the standardized Deviance, and standardized Pearson residuals
by the predicted probabilities that lakes will be harmless. I have annotated blue
horizontal lines at residual values -2 and 2. Both these graphs look awful; the
majority of the residuals in both graphs are outside of the blue lines meaning they
have a residual with absolute value greater than 2. It may be worth noting that it
seems that there are more positive than negative residuals, so the logistic model is
underestimating the probability of success more often than it overestimates. I was
not expecting the residuals to behave in this manner, and after much thought I could
not find a reason to explain why nearly all the residuals are greater than 2 in
absolute value.

Lake Puzzle, the most unusual observation is circled and contains Largemouth Bass
with a harmful level of average mercury despite the lakes high alkalinity level of
87.6 mg/L. After doing some research I found the lake is called Lake Puzzle, because
the navigable portions of the lake change seasonally depending on the amount of
rainfall. When the waters recede, previously known boat routes can be hindered by
new, submersed, sandbars and deep-water channels that are completely different
from the year before. This could explain its large residual; perhaps fish were
collected from a new portion of the lake where they were recently exposed to some
high levels of mercury despite the lakes overall high level of alkalinity.

39
PART 3: Multiple Predictors Model

Below are coded plots comparing Alkalinity and Calcium (the best quantitative
predictors determined from the descriptive statistics section) by the level of the
binary response variable (harmless/harmful).

From these graphs we can tell that alkalinity is a better predictor than calcium
(which was established in the descriptive statistics section) because it discriminates
more between harmless and harmless lakes. From the multiple boxplot graph of
Poison vs. Alkalinity & Calcium there is a larger difference between median
alkalinity than median calcium between harmless and harmful levels of average
mercury. It is safe to assume that this relationship is consistent for the respective
difference in means as well.

In an effort to find the best balance between fit and model simplicity I will us an
informal selective backward elimination.

Above is relevant output for a weighted logistic regression with all the explanatory
variables I can make use of. Alkalinity, calcium, and chlorophyll are all measured
from the water of a lake. Acidic(char) is a categorical variable, make from an
originally quantitative pH variable, which has value yes if the lake is acidic and no
otherwise. From the Parameter Estimates output chlorophyll has the largest p-value
of 0.218. This tells us that after adjusting for the other explanatory variables
40
chlorophyll doesnt help us explain a significant amount of the unexplained

variability in the odds, or probability, that lakes will contain Largemouth Bass with a
harmful level of average mercury. Therefore I will rerun the logistic model without
chlorophyll (assume that unless stated otherwise all logistic models are weighted by
the number of bass that was used to calculate the lakes average mercury).

After removing chlorophyll from the model all other explanatory variables stayed
significant. In fact the misclassification rate of 0.1315 did not change at all, this tells
us that we had the same prediction accuracy without chlorophyll in the model. I will
select this model as the one that balances fit and simplicity the best, as at the 5%
significance level all the explanatory variables are significant predictors of the odds,
or the probability, of lakes having bass with a harmless level of average mercury. I
will use this model to predict the odds, and probability, that non-acidic lakes with
50 mg/L alkalinity, and 20 mg/L calcium will contain Largemouth Bass with a
harmless level of average mercury.

Fitted model equation:
ln(odds)~hat= -1.536+0.104(alk)-0.09(calc)+0.634(acidic[n])
ln(odds)~hat= (-1.536+0.634)+0.104(50)-0.09(20)
ln(odds)~hat= 2.498
odds~hat = e^2.498 = 12.158
probability~hat = 12.158/13.158 = 0.924
I predict that non-acidic lakes with 50 mg/L alkalinity and 20 mg/L calcium have a
12.158 odds, or a 0.924 probability, of containing Largemouth Bass with a harmless
average level of mercury.
To test if the overall model is useful I will run a whole model test with the following
hypothesis:
H0 : (alkalinity) = 0 and (calcium) = 0 and (acidic[no]) = 0
Ha : at least one of the population slope is the null hypothesis does not equal 0

Our observed chi-square statistic is 382.048,
which follows a chi-square distribution with
3 degrees of freedom, and yielded a p-value
of < 0.0001. At the 5% significance level, we
have overwhelming evidence that at least one
of the population slopes is not equal to zero.
In other words the overall model is useful.

41
To measure the effectiveness of each predictor in the model, I will run 3 separate
significance tests. Before I run these tests I should adjust my significance level so
that I can be at an overall 5% significance level. Using the bonferroni adjustment I
will test each individual parameter at the 5/3 = 1.667% significance level.

First I will test Alkalinity with the following hypothesis:
H0 : (alkalinity) = 0, after adjusting for calcium and the acidic indicator
Ha : (alkalinity) 0, after adjusting for calcium and the acidic indicator

We observed a slope of 0.104 and a standard error of 0.013. This led to a chi-square
statistic of 67.20, which follows a chi-square distribution with 1 degree of freedom.
This statistic led to a very small p-value of < 0.0001. At the 1.667% significance level
we have overwhelming evidence that the population slope of alkalinity is not equal
to zero. Therefore alkalinity is an effective predictor of the odds, or probability, that
lakes will contain bass with harmless levels of average mercury.

Next I will test Calcium with the following hypothesis:
H0 : (calcium) = 0, after adjusting for alkalinity and the acidic indicator
Ha : (calcium) 0, after adjusting for alkalinity and the acidic indicator

We observed a slope of -0.090 and a standard error of 0.014. This led to a chi-square
we have overwhelming evidence that the population slope of calcium is not equal to
zero. Therefore calcium is an effective predictor of the odds, or probability, that
lakes will contain bass with harmless levels of average mercury.

Finally I will test the acidic indicator with the following hypothesis:
H0 : (acidic[no]) = 0, after adjusting for alkalinity and calcium
Ha : (acidic[no]) 0, after adjusting for alkalinity and calcium

We observed a slope of 0.634 and a standard error of 0.122. This led to a chi-square
we have overwhelming evidence that the population slope of acidic[no] is not equal
to zero. Therefore whether or not lakes are acidic is an effective predictor of the
odds, or probability, that lakes will contain bass with harmless levels of average
mercury.
42
I will now interpret the confidence intervals for these parameters to give us an idea
on how much each of them affect the odds that lakes will be harmless.

After adjusting for calcium and the acidic indicator, I am 95% confident that a 1
mg/L increase in alkalinity is associated with between a !.!"! = 1.083 and !.!"# =
1.139 multiplicative increase in the odds that the lakes will contain Largemouth
Bass with a harmless level of average mercury. After adjusting for alkalinity and the
acidic indicator, I am 95% confident that a 1 mg/L increase in calcium is associated
with between a !!.!!" = 0.889 and !!.!"# = 0.939 multiplicative decrease in the
odds that the lakes will contain Largemouth Bass with a harmless level of average
mercury. After adjusting for alkalinity and calcium, non-acidic lakes are associated
with between a !!.!"# = 2.208 and !!.!"# = 5.743 higher odds of containing
Largemouth Bass with a harmless level of average mercury than acidic lakes.

Below is relevant output for a drop in deviance test for the quadratic alkalinity term:

H0 : (alk^2) = 0, after adjusting for calcium, alkalinity, and the acidic indicator
Ha : (alk^2) 0, after adjusting for calcium, alkalinity, and the acidic indicator

We observe a drop in deviance of 573.356 556.812 = 16.544. This drop in deviance
is our chi-square statistic, which follows a chi-square distribution with 1 degree of
freedom. I used an online chi-square distribution probability calculator at
https://www.fourmilab.ch/rpkp/experiments/analysis/chiCalc.html.
43

With a p-value of about 0 we can reject the null hypothesis at the alpha equals 0.05
level. In other words the quadratic term is significant and is helpful in predicting the
odds of success. The sign of the term is positive so this tells us that at higher
alkalinity levels, alkalinity has a greater effect on the odds, and probability, that
lakes will contain fish whos average mercury concentration is at a harmless level.
Also between the two models we see a drop in the AICc of 582.189 568.088 =
14.101. Therefore I will adopt the quadratic term into my model.

Next I will carry out a drop in deviance test for the interaction of alkalinity and the
acidic indicator.

H0 : (acidic[no]*cent. alk) = 0, after adj. for calcium, the acidic indicator, and alk
Ha : (acidic[no] cent. alk) 0, after adj. for calcium, the acidic indicator, and alk

We observe a drop in deviance of 573.356 573.248 = 0.108. This drop in deviance
is our chi-square statistic, which follows a chi-square distribution with 1 degree of
freedom.

With a p-value of about 0.742 we cannot reject the null hypothesis at any reasonable
level of significance. In other words the interaction term is not significant predictor
of the odds of success. The sign of the term is positive so if it had been significant it
44
would have told us that each increase in alkalinity would have a larger effect on the
odds of success for non-acidic lakes than acidic lakes.

Output for my final model is below:

The misclassification rate of my final model is 0.146, which can be confirmed from
the confusion matrix (79+22/241+350+79+22 = 0.146). This an improvement of
about 0.06 percentage points from the misclassification rate from the model with
only alkalinity.

Next we will look at plots of residuals for my final model.

Once again, like in the single predictor model, the plots of both the standardized
deviance and the standardized Pearson residuals look bad. The majority of the
residuals in both plots are beyond 2 in absolute value (indicated by the blue
horizontal lines). The most unusual observation, Lake Sampson, is circled in blue.
After observation I realized Lake Sampson had an unusually high calcium level
relative to its alkalinity level, which could explain why my model greatly
underestimated the odds of it containing harmless Largemouth Bass.

45
PART 3: Conclusion

In this report I wanted to find the best model to predict the odds, or probability, of
lakes containing Largemouth Bass with a harmless level of average mercury in their
muscle tissue. Again I could not find the cutoff value for a safe about of mercury in
Largemouth Bass so I used the median average mercury value as the cutoff between
a harmless and harmful level. I would recommend using the multiple logistic model
because although it is a bit more complicated it reduced the misclassification rate by
about 6 percentage points and dropped the AICc by about 14. I am still puzzled why
there were so many observations with extremely high residuals in my model, when
it seemed to be fairly accurate. This is what I believe is the weakest point of my
model. For future analysis I would like to bring in more variables to make a more
accurate model and perhaps reduce the number of observation with such large
residuals. A larger sample size is always nice, but if I could collect data on lakes
outside of Florida I may be able to create a more generalizable model.

PART 3: Ordinal Logistic Regression Model

To demonstrate ordinal logistic regression I had to split my response, average
mercury, into three categories.

Once again I used quintiles to do this. I recoded
average mercury as low if it was below 0.339
(33.33th percentile), med if it was between 0.339
and 0.629 (66.66th percentile), and high if it was
larger than 0.629.

The first model I ran was with alkalinity, calcium, the acidic indicator, and alkalinity-
squared. The output is below:

Calcium has a p-value of 0.1198, after adjusting for the other predictors so I will
remove it and re-run the model. The misclassification rate is about 0.561, which is
very high. This means we are only correctly predicting whether or not a lake
contained harmless bass for about every other observation.
The second model I ran was with alkalinity, the acidic indicator, and alkalinity-
squared. The output is below:
46

Now alkalinity become insignificant at the 5% significance level which was
surprising to be because I found that it distinguished the best between harmless and
harmful lakes in the beginning of my project. However, because the p-value of 0.081
is not too far away from 0.05 and the quadric term (alkalinity-squared) is significant
I decided to keep it in the model. (Perhaps alkalinity was not significant because of it
obvious correlation with alkalinity-squared.) Nevertheless removing calcium did
not improve the models misclassification rate, which is still a large 0.561. I would
not recommend this model over the multiple logistic model that I decided to use in
my conclusion.

The fitted model equations are as follows:

ln(high/med)= -0.495 0.015(alk) + 0.0002(alk^2) 0.473(acidic[no])
ln(low/med)= 0.852 0.015(alk) + 0.0002(alk^2) 0.473(acidic[no])

47

Regression Project

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Regression Project

Загружено:

Авторское право:

Доступные форматы

Identifying

PART 1: Descriptive Statistics

Figure 1: Average Mercury by Alkalinity

I predicted the negative association between

Figure 2: Average Mercury by log(10) Alkalinity

I regressed average mercury by the base 10 log of

Histogram Average Mercury

Summary Statistics Average Mercury

Summary Statistics Alkalinity

Histogram Sample Size

PART 1: Residual Analysis

Figure 3: Normality Plot of Stu. Residuals

(for model Average Mercury by log(10) Alkalinity)

Figure 4: Stu. Residual Average Mercury

Normal Quantile Plot

Figure 5: Log(10) Average Mercury by Log(10) Alkalinity

Table 2: Lack Of Fit

Figure 6: Normality Plot of Stu. Residuals

(for model log(10) Average Mercury by log(10) Alkalinity)

Figure 6 is a normality plot of the log(10) alkalinity

Normal Quantile Plot

Figure 7: Stud. Residuals by Log(10) Alkalinity

PART 1: Linear Model

(log(10) Average Mercury by log(10) Alkalinity)

Table 5: Top 5 Cooks Distances

PART 1: Statistical Inference

(there is no relationship between log(10)[alkalinity] and average log(10)[average

Table 6: JMP Data

0.247207 - 0.463014 loc(alk)

PART 2: Descriptive Statistics

Cell Contents: Pearson correlation

PART 2: Model Formulation

calculate the average mercury concentration in the lake. Linearity is violated as

PART 2: Model Description

0.675107 - 0.23582 centered log(alk) - 0.144974 centered

Considering that the average square root mercury concentration is 0.6844 an

PART 2: Statistical Inference

Ho: B(centered log(alk)) = B(centered log(chlor)) = B(acidic) = B(cent

!"#$ !" !!" / !"

F distribution with 3 DF in numerator and 48 DF in denominator

To test if the interaction variable is significantly significant we can conduct a t-

-Where B(acidic) represents the true difference in average mercury content

!"!" !" !!" / !"

PART 2: Model Refinement

R denotes an observation with a large standardized residual.

1.13217 - 0.326523 log(alkalinity)

PART 3: Descriptive Statistics

= 7.14, matching the

JMP output to the left.

PART 3: Single Predictor Variable

PART 3: Multiple Predictors Model

chlorophyll doesnt help us explain a significant amount of the unexplained

PART 3: Ordinal Logistic Regression Model

Вам также может понравиться