Академический Документы
Профессиональный Документы
Культура Документы
* When the distribution is skewed to the left, the mean gets pulled to the left, and again pulls the
median with it, while again the mode stays right where it was. Isn't that interesting?
* If we want a measure of center, and the distribution is SYMMETRICAL (it's not skewed in either
direction), then the MEAN makes sense to use. It's a common measure that everyone can
understand. In fact, ALL THREEE MEASURES of center will be actually REALLY CLOSE to
EACH OTHER (no caso de distribuio simtrica).
*But if the distribution is SKEWED, then the MEAN turns out to be a BAD IDEA. With a
POSITIVE (RIGHT) skew, the MEAN OVER ESTIMATES the measure of CENTER.
* The same thing happens with a SKEW ON THE LEFT; the MEAN UNDER ESTIMATES the
central value of the distribution.
* So which measure of center should we use? The way the measures act informs this decision.
A symmetrical variable distribution can use the mean; it's a good measure that people
understand. If the distribution of the variable is skewed, then the median is the measure of
choice. It's NOT the most common variable (or the mode); it's the measure of center that
informs the center of the distribution without being too affected by the skew as the mean is.
* This same rationale holds for the measure of spread. When the distribution is symmetrical, the
Standard Deviation is a great measure of spread. It gives us a measure of variability around the
mean.
* But, there's a hiccup - the Standard Deviation is actually describing a particular space in the
distribution as long as it's symmetrical. It's right here about, just as the curve changes. But
what happens when the distribution is skewed?
Curve changes???
* Where that curve changes is different on one side of the mean vs. the other. And remember, with a
skewed distribution, the mean is actually is a poor choice to use. And the mean is used in the
calculation of Standard Deviation. So we turn to the Interquartile Range (or IQR).
WEEK 2
Z-scores
First, "Can we compare two students from two classes, who took an exam on the same
material?" In general, the answer is yes - as long as we were confident that the exams were
close in material and difficulty. It would be even better if the exams were exactly the same.
Then we could simply compare the actual values: John got an 83 while Jayne got an 89, so
Jayne performed better on the exam.
But what if the exams were NOT all that similar? What if one was harder than the other in
terms of material? Could we compare John and Jayne then? The answer is yes.
We can do it with the help of z-scores. You see, z-scores are a way to give everyone
within a single distribution - regardless of what the values are for that distribution - the
same "ruler."
And here's how: First, we start with the difference between a single score and the mean
of the distribution that that score came from. For John, the mean of his distribution (or his
class) was 74 [83 pts], while for Jayne [89 pts] her mean was 80. Both John and Jayne show
a difference to the mean of 9 points. But that's not the whole story: each distribution has a
measure of center and a measure of SPREAD.
In our case, because we're assuming a normal, symmetrical, distribution, the Standard
Deviation is the measure of choice for spread. So let's use it. We take the difference that
we've already found, and we divide by the Standard Deviation of each distribution.
So, if we solve for both z-scores, we find that John's z-score of exam performance is 2.25,
while for Jayne, her z-score for her performance is 1.5. So, who performed better?
Well Jayne got a HIGHER raw score, but her class did better overall (maybe it was just
an easier exam). And more importantly, Jayne's class had a larger Standard Deviation.
It had more spread around the mean, more variability to each one's score.
So on the "ruler" of performance: John was 2.25 units better than average, while, for
Jayne, she was only 1.5 units better than average. And we call these particular units zscores.
The right proportion under the curve: Here's a typical "bell shape" curve:
One peak, symmetrical, most cases fairly close to the center or mean, and fewer
cases going out in both directions from the mean.
We know that for a normal distribution that mean is right in the middle, and the
Standard Deviation occurs at the point at which the "curves" sort of change - about
here and here. Now it turns out that the normal distribution, this wonderful curve, has some
really stable properties that allow us to find out the percentage of cases that fall both
above or below or inside or outside particuliar z-scores.
Here's the empirical rule: it's a great rough guideline to start with when it comes to area
under the curve.
Here's where the first Standard Deviation falls (right about here - where the curve
changes). The empirical rule says that 68% of the cases in this distribution will fall
between these first Standard Deviations, of -1 and +1, for a z-score (one Standard
Deviation below and 1 Standard Deviation above the mean).
95% of the cases will fall between -2 and +2 , in terms of z scores or Standard
Deviations.
99.7% will fall between -3 and +3.
In actuality, we can find the proportion under the curve for ANY z-score we wish.
Using a z-table or a function in R, we can find that roughly 7% of her class performed
better than her. (fig. Acima)
What's great about this is that z-scores will work on any distribution scale. So, if we
wanted to see who preformed better on an exam where John's exam was actually out of 100
and Jayne's exam was out of 75 points, we could STILL do it. We simply convert to zscores first.
So, to answer our question of who did better on the exam, and get the added bonus of
finding the percent of distribution that reflects that performance, we can use z-scores.
we can also give it a y label. And we'll just call that Frequency.
So here, with our plot function, all I'm doing is adding a couple different options. It
doesn't matter what order I add these options in. As long as I separate each one by a
comma, it'll work. So here's our bar chart now. It looks a little bit nicer. We've got some
nice titles and labels for our axes. So that was how we want to visualize a single
categorical variable.
Histograms by Groups
And we're going to actually see how we can create a histogram for different groups. So
again using our animal shelter data set, we are going to create two histograms of the age of
intake variable by splitting up the data set into male animals and female animals using
the gender variable.
So I'll create an object that contains the age of intake values for our animals where the
sex variable is equal to female and then I can do the same thing for the males.
So let's call this femaleage, and then we'll use the assignment key and, say, I want to take the
variable age of intake, and then I'm going to use those brackets to index this variable. And
I only want to pull the rows under the condition that the sex variable == female.
Now let's do the same thing for male. I'm going to call my object maleage and again take the
age of intake variable only where the sex is now male.
One final option that I'd like to mention for the histogram is that you can change the
number of bins that R uses to make its histogram. So in general you'd like to have between
5 and 15 bins. Usually R will give you a pretty good number of pins just by the default but
you can adjust it. So let's go back up here in our male histogram of ages and add one more
option to our line. So I'm going to add another comma-- and the option here is called
breaks-- and I'm going to say let's give me 5 breaks and see what happens . Now we see
fewer events than we did before. So there aren't exactly 5 bins here but R will give you the
closest approximation it can while still capturing all the data. So changing the number of
bins in your histogram can change the shape of the distribution that you see with the
histogram. Although in this case we still see this general skew, we do see less definition.
And on the flip side, what if we were to change the brakes to say 15? If we ran that,
now we'd see a lot of definition. We would see that there is one case way far away from
the other cases that we might want to go investigate, which we would have missed if we
had just stuck with our original default histogram. So sometimes it's a good idea to
adjust the number of breaks in your histogram just to see if there's anything that jumps out
In this tutorial video, we're going to learn how to run various descriptive statistics on a
single numerical variable. So just like in the histogram tutorial videos, we're going to be
using the animal shelter data set. And we're going to be specifically looking at the age-atintake variable, which is how old these animals were when they arrived at the shelter.
Another way to describe the amount of spread in a variable is to report the five-number
summary, which consists of the minimum, the maximum, the median or halfway point,
and then also the first and third quartiles, which are what you use to calculate the
interquartile range, which is sometimes reported as a measure of spread.
R also has a really simple function that will give you the five-number summary, and it's
just called fivenum. If we ask for fivenum of our age-at-intake variable, we're going to see
the output of the five numbers in our five-number summary, starting with the min, then the
first quartile, then the median which we see is 1 again, the third quartile which is 3, and then
finally the maximum age which is 17.
Pre-lab week 2
Primary Research Question: How long do animals stay in the shelter before they are
adopted?
Let's break this analysis into its required steps:
1. Determine which animals in the dataset were adopted.
2. Generate a histogram for the length of time these adopted animals were in the
shelter.
Importante para saber a forma da distribuio da varivel: simtrica ou
assimtrica, padro, normal etc
3. Select the appropriate measures of center and spread to describe the distribution.
4. Identify which animal was an outlier on this particular variable.