Вы находитесь на странице: 1из 8

Consequences of the Log Transformation

Estimating the typical value of a single population

Example: Mercury Concentrations of Minnesota Walleyes


Data File: Walleyes (1990-1998) Major Waterways

These data come from walleyes sampled from major waterways in Minnesota during the
years 1990 – 1998. One the major characteristics of interest to fishery biologists is the
mercury contamination (in parts per million or PPM) found in the tissues of walleyes.

We begin by examining a histogram and summary statistics for the mercury


contaminations found in the sampled walleyes.

Clearly the distribution of mercury concentrations is extremely skewed to the right. For
variables with very skewed distributions the median is generally a better measure of
typical value than the mean because the mean is inflated by the extreme cases in the tail
of the distribution. The median mercury contamination found in the sampled walleyes
is .25 ppm while the mean is .365 ppm. When working with an extremely skewed right
distribution it is common practice to work with the characteristic of interest in the
logarithmic scale, the base of which is unimportant.

To transform a variable in JMP you must use the JMP Calculator which allows you to
perform a variety of data transformations and manipulations. To create a new column
containing a function of another column double-click to the right of the last column to
add a new column to the spreadsheet. Next double-click at the top of the column to obtain
the Column Info window. In the window change the name of the new column to
log10(Hg) and select Formula from the New Property pull-down menu and click Edit
Formula. 

1
The Column Info box in JMP

The JMP Calculator should then appear on the screen. To take the base 10 logarithm of
the HGPPM variable, first select Transcendental from the menu to the right of the
calculator keypad because the logarithm is a transcendental (non-algebraic) function. In
the list that appears in the rightmost menu select base 10 logarithm (i.e. log10). In
formula window you should see log10. Now you need supply the name of the variable
you wish to take the logarithm of, which is HGPPM in this case by selecting it from the
variable list on the left of the calculator window. 

The JMP Calculator

When finished the formula window will then look like:


Log10(HGPPM)

Finally click Apply and close the calculator window. The new column you created
should now contain the base 10 logarithm of the mercury concentrations. The histogram
and summary statistics for the log 10 Hg readings are shown below. We can clearly see
approximate normality has been achieved through the log transformation.

2
Histogram, Boxplot, and Normal Quantile Plot for log10(Hg)

Summary Statistics for log10(Hg)

Here we see that both the median and mean are approximately -.600 ppm in the log base
10 scale.

Back-Transforming the Mean and Median to the Original Scale


We can back-transform the mean and median values for the log base 10 mercury level as
follows:

Median back-transformed to the original scale = 10 .602 = .250 which is the median we
found when looking at the data in the original scale above! This is an extremely
important observation.

3
Mean back-transformed to the original scale = 10 .599831 = .2513 which is well below
the sample mean in the original scale above! This is an extremely important observation
also.
What we have seen is that the median of the data in the original scale is the same as the
back-transformed median of the data in the log scale. Put another way, we see that the
log base 10 of the sample median in the original scale is the same as the sample median
of the data in the log base 10 scale.

However, the mean in the original scale is NOT the same as the back-transformed mean
of the data in the log scale. In other words, we see that the log base 10 of the sample
mean in the original scale is NOT the same as the sample mean of the data in the log base
10 scale.

If we define the following:


X  sample mean (original scale) log10 ( X )  log of the sample mean
Med  sample median (original scale) log10 ( Med )  log of the sample median
X log  sample mean (log scale) Med log  sample median (log scale)

For the median we have:


log10 ( Med )  Med log
or equivalently,
Med
Med  10 log

In contrast for the mean we have:


log 10 ( X )  X log

Furthermore, the median in the log scale is the same as the mean in the log scale. Thus
the any inferences (e.g. CI’s & hypothesis tests) made for the mean in the log scale can
thought of as inference for the median in the log scale as well.

Using the notation above we have:


X log  Med log

A 95% CI for the population mean Hg concentration in the log scale, and hence the
population median, is given by (-.633, -.567).

Back-transforming the endpoints of this interval to the original scale gives the following
interval (.233 ppm, .271 ppm).

THIS IS A CONFIDENCE INTERVAL FOR THE MEDIAN IN THE ORIGINAL


SCALE! (Again this is because of the fact that 10 Med log  Med )

4
Hypothesis Testing Example
Suppose we wish to test:
H o : The typical mercury level of walleyes in MN < .20 ppm
H a : The typical mercury level of walleyes in MN > .20 ppm
Because our data is so right skewed the typical mercury level is best measured by the
population median. To make an inference for the median for right-skewed data we can
use the log transformation again. Restating our hypotheses in the log scale we have:
(Note: log10 .20  .699 )
H o : The typical log mercury level of walleyes in MN < -.699 log base 10 ppm
H a : The typical log mercury level of walleyes in MN > -.699 log base 10 ppm
Using the Test Mean... option from the log10(Hg) pull-down menu we obtain the
following results.

We have extremely strong evidence against the null hypothesis in favor of the alternative
hypothesis. Hence we would conclude that the median Hg concentration (original scale)
found in Minnesota walleyes exceeds .20 ppm.

5
Comparative Analyses in the Log Scale
We have seen that the consequence of the log transformation for single population
inference is that our inferences are being made about the median in the original scales vs.
the mean. When comparing two (or more) populations where the variable of interest has
a right-skewed distribution the log transformation again is frequently used. The
consequences of the log transformation on comparative analysis are similar in nature to
the single population case discussed above. Our inferences will be about how the
population medians compare in the original scale.

Example: Mercury Levels in Walleyes from Fish Lake vs. Island Lake
Data File: Walleyes Fish vs. Island

The key property of logarithms we will be using in our discussion is as follows:


 x
log( x )  log( y )  log 
 y
i.e. the differences of two variables, x and y, in the log scale is equivalent to the log of
their ratio.

Comparative Analyses in JMP

Comparative Display and Summary Statistics in the Original Scale

Both distributions appear to be


right-skewed. For both lakes the
sample mean exceeds the sample
median. It also appears that the
mercury levels in Island Lake are
more spread out, i.e. the population
variance/standard deviation appears
to be larger.

6
Comparative Display and Summary Statistics in Log 10 Scale

Both distributions in the log


scale appear to be
approximately normal,
however the Fish Lake
Flowage distribution shows
evidence of kurtosis. The
means and medians are closer
in value in the log scale.

Comparing the Population Variances/Standard Deviations (log 10 scale)

We have strong evidence that the


population variances/standard
deviations are not equal.

Independent Samples Test for Comparing Means/Medians (log 10 scale)

We have strong evidence that the population


means/medians in the log 10 scale
significantly differ (p < .0001)

7
A 95% CI for ( X log  X log ) or equivalently ( Med log  Med log ) is given by
Island Fish Island Fish

(.490 , .722). Using the fact that log10 ( Med )  Med log and difference of logarithms
property above we can say this is also a confidence interval for the following:

 Med Island 
log10 ( Med Island )  log10 ( Med Fish )  log10  Fish

 Med 

So (.490, .722) is a confidence interval for the log base 10 of the ratio of the population
median Hg level for Island Lake to the population median Hg level for Fish Lake
Flowage. If we back-transform the endpoints of this interval we will obtain a confidence
interval for the ratio of medians in the original scale, i.e. Med Island .
Med Fish
Doing this we obtain:
10 .490

, 10 .722   3.09 , 5.27  . Therefore we estimate with 95% confidence that the
median Hg level found in walleyes from Island Lake is between 3.09 and 5.27 times
larger than the median Hg level found in walleyes from Fish Lake Flowage.

We will see this type of comparative analysis of data in the logarithmic scale when we
examine pair-wise comparisons in ANOVA later in the course.

Вам также может понравиться