Вы находитесь на странице: 1из 13

WEEK 1

Categorical variable. Ex.: choise to drive in inclement weather (action question)


table of frequency: table of counts ----> visualize: bar chart, that is:
1) raw date -> 2) table of counts -> 3) bar chart (vertical: the counts; horizontal: the variable value)
Numerical variavel: Ex.: accidents THOUGHT to be due to driving in bad weather (belief question)
* Well sure - with another table. Since this new variable has a LARGE SPAN (from 0 to 100), we'll
want to use what's called a GROUOED FREQUENCY TABLE.
* GROUPED because the variable's RANGE is so LARGE, and we want to CHUNK or BIN the
data into smaller, more manageable ranges. After all, since we KNOW we want to SUMMARIZE,
does it make sense for us to make a table for each row or possible percentage value?
* Most of the time we want to shoot for ABOUT 10 BINS in our frequency tables- or 10 ROWS. So
let's chunk our data by 10. Just like we did with categorical data, we want to count the number of
rows that fit a certain criteria.
The number of rows that fit the 0 to 10% BIN IS 3; the number of rows that fit the 10 to 20% BIN
IS 2;the number of rows that fit the 20 to 30% BIN IS 9; and so on.
* And now, we can make what's called a HISTOGRAM. At the first glance, it looks like a BAR
GRAPH, and in a way it is. Both the bar graph and the histogram COUNT the number of rows (or
FREQUENCY) that a variable had a certain value afforded. For the BAR GRAPH, that VALUE is a
CATEGORY, and we show that by the SEPARATION of the bars in the graph.
But for a HISTOGRAM, that VALUE is NUMERICAL, and it's USUSUALLY CONTINUOUS.
* And we show this by REMOVING THAT SEPARATION BETWEEN THE BARS. And there you
have it. Two of the most common, and useful, graphs in UNIVARIATE STATISTICS: the BAR
GRAPH and the HISTOGRAM. Both of which are visualizations of our data that come from
aggregate summaries through tables.
* When we NUMERICALLY SUMMARIZE a variable from our dataset in statistics, we want to
give our reader TWO MAIN THINGS. We want to give them a measure of the variable's CENTER,
and we want to give them a measure of the variable's SPREAD around that center.
* Now we've got a couple of candidates to choose from when it comes to the measure of CENTER.
We've got the MEAN, the MEDIAN, and the MODE.
* For our measure of SPREAD, on the other hand, we have the candidates of STANDARD
DEVIATION or the INTERQUATILE RANGE (or IQR).
* But what about these other measures of center: the median and the mode? How are these alike or
how are these different from the mean? More importantly, when should we use each of them? Under
what conditions?
* When the distribution is skewed to the right (or a positive skew), the mean gets pulled to the right
and brings along with it the median. The mode stays right where it is, at the peak of the data, at the
most common occurring value.

* When the distribution is skewed to the left, the mean gets pulled to the left, and again pulls the
median with it, while again the mode stays right where it was. Isn't that interesting?

* If we want a measure of center, and the distribution is SYMMETRICAL (it's not skewed in either
direction), then the MEAN makes sense to use. It's a common measure that everyone can
understand. In fact, ALL THREEE MEASURES of center will be actually REALLY CLOSE to
EACH OTHER (no caso de distribuio simtrica).
*But if the distribution is SKEWED, then the MEAN turns out to be a BAD IDEA. With a
POSITIVE (RIGHT) skew, the MEAN OVER ESTIMATES the measure of CENTER.
* The same thing happens with a SKEW ON THE LEFT; the MEAN UNDER ESTIMATES the
central value of the distribution.
* So which measure of center should we use? The way the measures act informs this decision.
A symmetrical variable distribution can use the mean; it's a good measure that people
understand. If the distribution of the variable is skewed, then the median is the measure of
choice. It's NOT the most common variable (or the mode); it's the measure of center that
informs the center of the distribution without being too affected by the skew as the mean is.
* This same rationale holds for the measure of spread. When the distribution is symmetrical, the
Standard Deviation is a great measure of spread. It gives us a measure of variability around the
mean.
* But, there's a hiccup - the Standard Deviation is actually describing a particular space in the
distribution as long as it's symmetrical. It's right here about, just as the curve changes. But
what happens when the distribution is skewed?
Curve changes???
* Where that curve changes is different on one side of the mean vs. the other. And remember, with a
skewed distribution, the mean is actually is a poor choice to use. And the mean is used in the
calculation of Standard Deviation. So we turn to the Interquartile Range (or IQR).

*Distribuio simtrica: (mdia; desvio-padro)


*Distribuio assimtrica (skewed): (mediana; IQR)
* This measure of spread (IQR) is the difference between the first and the third quartiles,
which are informed by the median - the preferred measure of center for a skewed distribution.
*So, to answer our question, because a visual inspection of the distribution for both groups shows
both symmetrical and skewed data, we use the median for both - the stable measure. Private
universities have a higher out-of-state tuition than Public schools. The median cost for a Public
school is $16,500 while for a Private school that cost is around $23,000. We use the median for both
- the stable measure.
Em distribuies simtricas, a mdia, o modo e a mediana so muito prximas umas das
outras. E no caso das assimtricas, a mediana a melhor opo para representao o centro
(sem ser muito afetada pela "deformao").
Resumo at agora:

Distribuies simtrcas: mdia e desvio padro


Distribuies assimtricas: mediana e IQR (interquartile range: Q3 Q1)
Distribuies mistas: mediana e IQR (?)
nas simtricas, a mdia, a mediana e o modo so muito prximos entre si
nas assimtricas: a mediada melhor para representar a tendncia central, pois a
menos deformada pelo skew.

WEEK 2
Z-scores

First, "Can we compare two students from two classes, who took an exam on the same
material?" In general, the answer is yes - as long as we were confident that the exams were
close in material and difficulty. It would be even better if the exams were exactly the same.
Then we could simply compare the actual values: John got an 83 while Jayne got an 89, so
Jayne performed better on the exam.
But what if the exams were NOT all that similar? What if one was harder than the other in
terms of material? Could we compare John and Jayne then? The answer is yes.

We can do it with the help of z-scores. You see, z-scores are a way to give everyone
within a single distribution - regardless of what the values are for that distribution - the
same "ruler."
And here's how: First, we start with the difference between a single score and the mean
of the distribution that that score came from. For John, the mean of his distribution (or his
class) was 74 [83 pts], while for Jayne [89 pts] her mean was 80. Both John and Jayne show
a difference to the mean of 9 points. But that's not the whole story: each distribution has a
measure of center and a measure of SPREAD.
In our case, because we're assuming a normal, symmetrical, distribution, the Standard
Deviation is the measure of choice for spread. So let's use it. We take the difference that
we've already found, and we divide by the Standard Deviation of each distribution.
So, if we solve for both z-scores, we find that John's z-score of exam performance is 2.25,
while for Jayne, her z-score for her performance is 1.5. So, who performed better?

Well Jayne got a HIGHER raw score, but her class did better overall (maybe it was just
an easier exam). And more importantly, Jayne's class had a larger Standard Deviation.
It had more spread around the mean, more variability to each one's score.
So on the "ruler" of performance: John was 2.25 units better than average, while, for
Jayne, she was only 1.5 units better than average. And we call these particular units zscores.
The right proportion under the curve: Here's a typical "bell shape" curve:
One peak, symmetrical, most cases fairly close to the center or mean, and fewer
cases going out in both directions from the mean.
We know that for a normal distribution that mean is right in the middle, and the
Standard Deviation occurs at the point at which the "curves" sort of change - about
here and here. Now it turns out that the normal distribution, this wonderful curve, has some
really stable properties that allow us to find out the percentage of cases that fall both
above or below or inside or outside particuliar z-scores.

Here's the empirical rule: it's a great rough guideline to start with when it comes to area
under the curve.
Here's where the first Standard Deviation falls (right about here - where the curve
changes). The empirical rule says that 68% of the cases in this distribution will fall
between these first Standard Deviations, of -1 and +1, for a z-score (one Standard
Deviation below and 1 Standard Deviation above the mean).
95% of the cases will fall between -2 and +2 , in terms of z scores or Standard
Deviations.
99.7% will fall between -3 and +3.

In actuality, we can find the proportion under the curve for ANY z-score we wish.

Using a z-table or a function in R, we can find that roughly 7% of her class performed
better than her. (fig. Acima)
What's great about this is that z-scores will work on any distribution scale. So, if we
wanted to see who preformed better on an exam where John's exam was actually out of 100
and Jayne's exam was out of 75 points, we could STILL do it. We simply convert to zscores first.
So, to answer our question of who did better on the exam, and get the added bonus of
finding the percent of distribution that reflects that performance, we can use z-scores.

Visualizing Univariate Data

Visualizao de varivel categrica: tabela de frequncia mais grfico de barras:


Now we can use R to visualize single variables, and that's what we're going to be
learning in this video. So first, whenever I import a new data set, I like to take a look at
it in the spreadsheet view. But another way you can get a snapshot of what your data
set looks like is with a function called head. And that basically gives you a listing of
the first few cases, across every single variable, in a data frame.
We can start listing options [for the plot] and separate each one with a comma. So for the
main title of our overall graph, we're going to give it a main option. And then we're
going to specify what we want that title to be. So we do that by just hitting main=' '
and then putting whatever text we want to appear at the top of the graph in quotes. So we
could call this a Bar Chart of Animal Genders. Now we can include another option to
label our different axes. So our x-axis label is the horizontal axis. We can give an
option called xlab =, and then that will be the label for our x-axis. So this could be
"Animal Gender," and this will just go below where the female and male labels are. And

we can also give it a y label. And we'll just call that Frequency.
So here, with our plot function, all I'm doing is adding a couple different options. It
doesn't matter what order I add these options in. As long as I separate each one by a
comma, it'll work. So here's our bar chart now. It looks a little bit nicer. We've got some
nice titles and labels for our axes. So that was how we want to visualize a single
categorical variable.

Visualizao de varivel numrica:


But we can also visualize a numeric variable, and we're going to do that with a
histogram. So let's look at this Age at Intake variable here, which is how old each
animal was when it was brought into the shelter. So we see there is an old female dog,
about 10 years old, when it entered the shelter.
So if we want to visualize the distribution of this numeric variable, we can do that
with a histogram. We can use the Hist function and just give it, again, the data frame
name, dollar sign, and then the Age at Intake variable.(como era pra plot function).
Now looking at our Age of Intakes, we can see that there are a lot of animals that come
into the shelter very, very young. And then it rapidly tapers off into very few animals
coming in over the age of 5 or over the age of 10. So this distribution is what we
would call a right skew or positive skew distribution.

Histograms by Groups

And we're going to actually see how we can create a histogram for different groups. So
again using our animal shelter data set, we are going to create two histograms of the age of
intake variable by splitting up the data set into male animals and female animals using
the gender variable.
So I'll create an object that contains the age of intake values for our animals where the
sex variable is equal to female and then I can do the same thing for the males.
So let's call this femaleage, and then we'll use the assignment key and, say, I want to take the
variable age of intake, and then I'm going to use those brackets to index this variable. And
I only want to pull the rows under the condition that the sex variable == female.
Now let's do the same thing for male. I'm going to call my object maleage and again take the
age of intake variable only where the sex is now male.

One final option that I'd like to mention for the histogram is that you can change the
number of bins that R uses to make its histogram. So in general you'd like to have between
5 and 15 bins. Usually R will give you a pretty good number of pins just by the default but
you can adjust it. So let's go back up here in our male histogram of ages and add one more
option to our line. So I'm going to add another comma-- and the option here is called
breaks-- and I'm going to say let's give me 5 breaks and see what happens . Now we see
fewer events than we did before. So there aren't exactly 5 bins here but R will give you the
closest approximation it can while still capturing all the data. So changing the number of
bins in your histogram can change the shape of the distribution that you see with the
histogram. Although in this case we still see this general skew, we do see less definition.
And on the flip side, what if we were to change the brakes to say 15? If we ran that,
now we'd see a lot of definition. We would see that there is one case way far away from
the other cases that we might want to go investigate, which we would have missed if we
had just stuck with our original default histogram. So sometimes it's a good idea to
adjust the number of breaks in your histogram just to see if there's anything that jumps out

at you with more or fewer bins.


If we wanted to go find out the information from this animal here that was much older
than any of the other animalswhen it entered the shelter, we can do that with a function
called "which" and which will pull out a case from your data frame that follows a
certain condition.
So first let's see what the maximum age of intake was. We can do that with the max
function if we ask for the max of maleage that's just going to pull out the value that is the
largest from that vector. So 17 is the oldest age for the male animals.
Let's just make sure that that's larger than any of our female animals. So here we see the
max age of the females was only 15. So overall-- across all animals in our data set-- 17 is
the oldest animal and it happened to be a male. But let's go in and see if we can find out
the other characteristics of this animal.
Like whether it was a dog or cat, and whether it was sick, injured, or healthy. So what we
can do for that is call the which function and say give me which record in my animal
data set where the age dot intake equals 17.
Now if I run this I'm going to get a number that might not make sense right away. It just
gives me the number 415. So what this is actually telling you is that in row number-- or
record number-- 415 this age dot intake variable is equal to 17, so that's the animal that I
want.
Now in order to call that, I can ask for animal data-- and remember to index it, I can
give it my square brackets and then give it a [row, column]. So the row I want is
literally row 415.
And let's say I want to know every single variable associated with that animal, so I'm just
going to-- again-- leave the column space blank and that will return all columns. So this will
display the animal who was 17 years old when he entered the shelter and we can see that it
was a dog. And that he actually was injured or sick when he came into the shelter.

Univariate Descriptive Statistics

In this tutorial video, we're going to learn how to run various descriptive statistics on a
single numerical variable. So just like in the histogram tutorial videos, we're going to be
using the animal shelter data set. And we're going to be specifically looking at the age-atintake variable, which is how old these animals were when they arrived at the shelter.
Another way to describe the amount of spread in a variable is to report the five-number
summary, which consists of the minimum, the maximum, the median or halfway point,
and then also the first and third quartiles, which are what you use to calculate the
interquartile range, which is sometimes reported as a measure of spread.
R also has a really simple function that will give you the five-number summary, and it's
just called fivenum. If we ask for fivenum of our age-at-intake variable, we're going to see
the output of the five numbers in our five-number summary, starting with the min, then the
first quartile, then the median which we see is 1 again, the third quartile which is 3, and then
finally the maximum age which is 17.

Pre-lab week 2

Primary Research Question: How long do animals stay in the shelter before they are
adopted?
Let's break this analysis into its required steps:
1. Determine which animals in the dataset were adopted.
2. Generate a histogram for the length of time these adopted animals were in the
shelter.
Importante para saber a forma da distribuio da varivel: simtrica ou
assimtrica, padro, normal etc
3. Select the appropriate measures of center and spread to describe the distribution.
4. Identify which animal was an outlier on this particular variable.

Вам также может понравиться