bvbvb

© All Rights Reserved

Просмотров: 14

bvbvb

© All Rights Reserved

- MBAD6221JUD_Lecture1.
- BRM
- Coke Project
- My Research Report
- Boxplots in R
- stat3
- S2Jun01Q
- Psyc 60 Central Tendency and Variability_2.ppt
- Mean, Median, Mode
- Ch2-DataIssues
- 03 GraphicalPart2 Numerical (1)
- Output percobaan
- Output
- Assignment Statistics Q2
- Descriptive
- Class 04 - Basic Concepts of Statistics and Probability (1 of 3)
- Tugas LabKom
- T2 Tania Agustini Maharani 1310211 124
- Griddatareport Vasquez Anticona
- ARTICULO CENTRAL QuantitativeMis-Information_final

Вы находитесь на странице: 1из 11

Learning Objectives

1.

Understand and calculate three ways that the center of a distribution can be defined

2.

distribution can be determined

3.

Understand how skew and level of measurement can help determine which

measures of central tendency and variability are most appropriate for a given

distribution

Key Terms

Measures of central tendency: categories or scores that describe what is "average" or

"typical" of a given distribution. These include the mode, median and mean.

Percentile: a score below which a specific percentage of a given distribution falls.

Positively skewed distribution: a distribution with a handful of extremely large values.

Negatively skewed distribution: a distribution with a handful of extremely low values.

Measures of variability: numbers that describe the diversity or dispersion in the

distribution of a given variable.

Box plot: a graphic representation of the range, interquartile range and median of a given

variable.

The Mode

The mode is the category with the greatest frequency (or percentage). It is not the

frequency itself. In other words, if someone asks you for the mode of the distribution shown

below, the answer would be coconut, NOT 22. It is possible to have more than one mode in

a distribution. Such distributions are considered bimodal (if there are two modes) or multimodal (if there are more than two modes). Distributions without a clear mode are said to be

uniform. The mode is not particularly useful, but it is the only measure of central tendency

we can use with nominal variables. You will find out why it is the only appropriate measure

for nominal variables as we learn about the median and mean next.

Coconut = 22

Chocolate = 15

Vanilla = 7

Strawberry = 9

The Median

The median is the middlemost number. In other words, it's the number that divides the

distribution exactly in half such that half the cases are above the median, and half are

below. It's also known as the 50th percentile, and it can be calculated for ordinal and

interval/ratio variables. Conceptually, finding the median is fairly simple and entails only

putting all of your observations in order from least to greatest and then finding whichever

number falls in the middle. Note that finding the median requires first ordering all of the

observations from least to greatest. This is why the median is not an appropriate measure

of central tendency for nominal variables, as nominal variables have no inherent order. (In

practice, finding the median can be a bit more involved, especially if you have a large

number of observationssee your textbook for an explanation of how to find the median in

such situations).

Some of you are probably already wondering, "What happens if you have an even number

of cases? There won't be a middle number then, right?" That's a very astute observation,

and I'm glad you asked. If your dataset has an even number of cases, the median is the

average of the two middlemost numbers. For example, for the numbers 18, 14, 12, 8, 6 and

4, the median is 10 (12 + 8 = 20; 20/2 = 10).

One of the median's advantages is that it is not sensitive to outliers. An outlier is an

observation that lies an abnormal distance from other values in a sample. Observations that

are significantly larger or smaller than the others in a sample can impact some statistical

measures in such a way as to make them highly misleading, but the median is immune to

them. In other words, it doesn't matter if the biggest number is 20 or 20,000; it still only

counts as one number. Consider the following:

Distribution 1: 1, 3, 5, 7, 20

Distribution 2: 1, 3, 5, 7, 20,000

These two distributions have identical medians even though Distribution 2 has a very large

outlier, which would end up skewing the mean pretty significantly, as we'll see in just a

moment.

The Mean

The mean is what people typically refer to as "the average". It is the highest measure of

central tendency, by which I mean it is available for use only with interval/ratio variables.

The mean takes into account the value of every observation and thus provides the most

information of any measure of central tendency. Unlike the median, however, the mean is

sensitive to outliers. In other words, one extraordinarily high (or low) value in your dataset

can dramatically raise (or lower) the mean. The mean, often shown as an x or a y variable

with a line over it (pronounced either "x-bar" or "y-bar"), is the sum of all the scores divided

by the total number of scores. In statistical notation, we would write it out as follows:

In that equation, is the mean, X represents the value of each case and N is the total number

of cases. The sigma () is just telling us to add all the scores together. The fact that

calculating the mean requires addition and division is the very reason it can't be used with

either nominal or ordinal variables. We can't calculate a mean for race (white + white +

black/3 = ?) any more than we can calculate a mean for year in school (freshman +

freshman + senior/3 = ?)

Percentiles

A percentile is a number below which a certain percent of the distribution falls. For example,

if you score in the 90th percentile on a test, 90 percent of the students who took the test

scored below you. If you score in the 72nd percentile on a test, 72 percent of the students

who took the test scored below you. If scored in the 5th percentile on a test, maybe that

subject isn't for you. The median, you recall, falls at the 50th percentile. Fifty percent of the

observations fall below it.

A symmetrical distribution is a distribution where the mean, median and mode are the same.

A skewed distribution, on the other hand, is a distribution with extreme values on one side

or the other that force the median away from the mean in one direction or the other. If the

mean is greater than the median, the distribution is said to be positively skewed. In other

words, there is an extremely large value that is "pulling" the mean toward the upper end of

the distribution. If the mean is smaller than the median, the distribution is said to be

negatively skewed. In other words, there is an extremely small value that is "pulling" the

mean toward the lower end of the distribution. Distributions of income are usually positively

skewed thanks to the small number of people who make ungodly amounts of money.

Consider the (admittedly dated) case of Major League Soccer players as an extreme

example. The mean annual salary for an MLS player in 2010 was approximately $138,000,

but the median annual salary was only about $53,000. The mean was almost three times

larger than the median, thanks in no small part to David Beckham's then $12 million salary.

When trying to decide which measure of central tendency to use, you must consider both

level of measurement and skew. This is not so much the case for nominal and ordinal

variables. If the variable is nominal, obviously the mode is the only measure of central

tendency to use. If the variable is ordinal, the median is probably your best bet because it

provides more information about the sample than the mode does. But if the variable is

interval/ratio, you'll need to determine if the distribution is symmetrical or skewed. If the

distribution is symmetrical, the mean is the best measure of central tendency. If the

distribution is skewed either positively or negatively, the median is more accurate. As an

example of why the mean might not be the best measure of central tendency for a skewed

distribution, consider the following passage from Charles Wheelan's Naked Statistics:

Stripping the Dread from the Data (2013):

"The mean, or average, turns out to have some problems, namely, that it is prone to

distortion by "outliers," which are observations that lie farther from the center. To get your

mind around this concept, imagine that ten guys are sitting on bar stools in a middleclass drinking establishment in Seattle; each of these guys earns $35,000 a year, which

makes the mean annual income for the group $35,000 . Bill Gates walks into the bar with

a talking parrot perched on his shoulder. (The parrot has nothing to do with the example,

but it kind of spices things up.) Let's assume for the sake of the example that Bill Gates

has an annual income of $1 billion. When Bill sits down on the eleventh bar stool, the

mean annual income for the bar patrons rises to about $91 million. Obviously none of the

original ten drinkers is any richer (though it might be reasonable to expect Bill Gates to

buy a round or two). If I were to describe the patrons of this bar as having an average

annual income of $91 million, the statement would be both statistically correct and

grossly misleading [Note: the median would remain unchanged]. This isn't a bar where

multimillionaires hang out; it's a bar where a bunch of guys with relatively low incomes

happen to be sitting next to Bill Gates and his talking parrot."

Measures of Variability

In addition to figuring out the measures of central tendency, we may need to summarize the

amount of variability we have in our distribution. In other words, we need to determine if the

observations tend to cluster together or if they tend to be spread out. Consider the following

example:

Sample 1: {0, 0, 0, 0, 25}

Sample 2: {5, 5, 5, 5, 5}

Both of these samples have identical means (5) and an identical number of observations (n

= 5), but the amount of variation between the two samples differs considerably. Sample 2

has no variability (all scores are exactly the same), whereas Sample 1 has relatively more

(one case varies substantially from the other four). In this course, we will be going over four

measures of variability: the range, the inter-quartile range (IQR), the variance and the

standard deviation.

The Range

The range is the difference between the highest and lowest scores in a data set and is the

simplest measure of spread. We calculate range by subtracting the smallest value from the

largest value. As an example, let us consider the following data set:

23 56 45 65 69 55 62 54 85 25

The maximum value is 85 and the minimum value is 23. This gives us a range of 62 (85

23 = 62). Whilst using the range as a measure of variability doesn't tell us much, it does

give us some information about how far apart the lowest and highest scores are.

"Quartile" is yet another word that stats geeks use to make themselves feel important. It

basically means "quarter" or "fourth." A football game has four quartiles, as does a king-size

Twix. Finding the quartiles of a distribution is as simple as breaking it up into fourths. Each

fourth contains 25 percent of the total number of observations.

Quartiles divide a rank-ordered data set into four equal parts. The values that divide each

part are called the first, second, and third quartiles; and they are denoted by Q1, Q2, and

Q3, respectively.

Q1 is the "middle" value in the first half of the rank-ordered data set.

Q2 is the median value of the data set

Q3 is the "middle" value of the second half of the rank-ordered data set

Q4 would technically be the largest value in the dataset, but we ignore it when calculating

the IQR (we already dealt with it when we calculated the range).

Thus, the interquartile range is equal to Q3 minus Q1 (or the 75th percentile minus the 25th

percentile, if you prefer to think of it that way). As an example, consider the following

numbers: 1, 3, 4, 5, 5, 6, 7, 11. Q1 is the middle value in the first half of the data set. Since

there are an even number of data points in the first half of the data set, the middle value is

the average of the two middle values; that is, Q1 = (3 + 4)/2 or Q1 = 3.5. Q3 is the middle

value in the second half of the data set. Again, since the second half of the data set has an

even number of observations, the middle value is the average of the two middle values; that

is, Q3 = (6 + 7)/2 or Q3 = 6.5. The interquartile range is Q3 minus Q1, so the IQR = 6.5 3.5 = 3.

Boxplots

A box plot (also known as a box and whisker plot) splits the dataset into quartiles. The body

of the boxplot consists of a "box" (hence, the name), which goes from the first quartile (Q1)

to the third quartile (Q3). Within the box, a horizontal line is drawn at Q2, which denotes the

median of the data set. Two vertical lines, known as whiskers, extend from the top and

bottom of the box. The bottom whisker goes from Q1 to the smallest value in the data set,

and the top whisker goes from Q3 to the largest value. Below is an example of a positively

skewed box plot with the various components labeled.

Outliers are values are extreme values that for one reason or another are excluded from the

dataset. If the data set includes one or more outliers, they are plotted separately as points

on the chart. The above diagram has a few outliers at the bottom.

The horizontal line that runs across the center of the box indicates where the median falls.

Additionally, boxplots display two common measures of the variability or spread in a data

set: the range and the IQR. If you are interested in the spread of all the data, it is

represented on a boxplot by the vertical distance between the smallest value and the

largest value, including any outliers. The middle half of a data set falls within the

interquartile range. In a boxplot, the interquartile range is represented by the width of the

box (Q3 minus Q1).

The Variance

The variance is a measure of variability that represents on how far each observation falls

from the mean of the distribution. For this example, we'll be using the following five

numbers, which represent my total monthly comic book purchases over the last five months:

2, 3, 5, 6, 9

The formula for calculating a variance is usually written out like this:

This equation looks intimidating, but it's not that bad once you break it down into its

component parts. S2x is the notation used to denote the variance of a sample. That giant

sigma () is a summation sign; it just means we're going to be adding things together. The x

represents each of our observations, and the x with a line over it (often called "x-bar")

represents the mean of our distribution. The capital "N" on the bottom is the total number of

observations. Basically, this formula is telling us to subtract the mean from each of our

observations, square the difference, add them all together and divide by N-1. Let's do an

example using the above numbers.

1. The first step in calculating the variance is finding the mean of the distribution. In this

case, the mean is 5 (2+3+5+6+9 = 25; 25/5 = 5).

2. The second step is to subtract the mean (5) from each of the observations:

2-5 = -3

3-5 = -2

5-5 = 0

6-5 = 1

9-5 = 4

Please note: we can check our work after this step by adding all of our values together. If

they sum to zero, we know we're on the right track. If they add up to something besides

zero, we should probably check our math again (-3+-2+0+1+4 = 0, we're golden).

3. Third, we square each of those answers to get rid of the negative numbers:

(-3)2 = 9

(-2)2 = 4

(0)2 = 0

(1)2 = 1

(4)2 = 16

4. Fourth, we add them all together:

9+4+0+1+16=30

5. Finally, we divide by N-1 (the total number of observations is 5, so 5-1=4)

30/4 = 7.5

After all those rather tedious calculations, we're left with a single number that quickly and

succinctly summarizes the amount of variability in our distribution. The bigger the number,

the more variability we have in our distribution. Please note: a variance can never be

negative. If you come up with a variance that's less than zero, you've done something

wrong.

There is, however, one limitation to using the variance as our only measure of variability.

When we square the numbers to get rid of the negatives (step 3), we also inadvertently

square our unit of measurement. In other words, if we were talking about miles, we

accidentally turned our unit of measurement into miles squared. If we were talking about

comic books, we accidentally turned our unit of measurement into comic books squared

(which, needless to say, doesn't always make a lot of sense). In order to solve that problem,

we calculate the standard deviation. The formula for the standard deviation looks like this:

In other words, calculating the standard deviation is as simple as taking the square root of

the variance, reversing the squaring we did in the calculation of the variance. In our

example, the standard deviation is equal to the square root of 7.5, or 2.74. The

interpretation doesn't change; a large standard deviation is indicative of greater variability,

whereas a small standard deviation is indicative of a relatively small amount of variability. As

is the case with the variance, the standard deviation is always positive.

Remember: the key difference between the variance and the standard deviation is the unit

of measurement. We calculate the standard deviation in order to put our variable back into

its original metric. "Miles squared" goes back to being just miles, and "comic books

squared" goes back to being just comic books.

Main Points

Measures of central tendency tell us what is common or typical about our variable.

Three measures of central tendency are the mode, the median and the mean.

The mode is used almost exclusively with nominal-level data, as it is the only

measure of central tendency available for such variables. The median is used with

ordinal-level data or when an interval/ratio-level variable is skewed (think of the Bill

Gates example). The mean can only be used with interval/ratio level data.

Measures of variability are numbers that describe how much variation or diversity

there is in a distribution.

Four measures of variability are the range (the difference between the larges and

smallest observations), the interquartile range (the difference between the 75th and

25th percentiles) the variance and the standard deviation.

The variance and standard deviation are two closely related measures of variability

for interval/ratio-level variables that increase or decrease depending on how closely

the observations are clustered around the mean.

To have SPSS calculate measures of central tendency and variability for you, click

"Analyze," "Descriptive Statistics," then "Frequencies." Measures of central tendency and

variability can also be calculated by clicking on either "Descriptives" or "Explore," but

"Frequencies" gives you more control and has the most helpful options to choose from. The

dialog box that opens should be pretty familiar to you by now. As you did when calculating

frequency tables, move the variables for which you would like to calculate measures of

central tendency and variability into the right side of the box. You can uncheck the box

marked "Display frequency tables" if you'd rather not see any tables and would prefer to see

only the statistics. Then click the button on the right labeled "Statistics." From the Dialog box

that opens you may select as many statistics as you would like (Note: SPSS uses the term

"Dispersion" rather than "Variability," but the two words are synonymous). Also, please be

aware that SPSS will calculate statistics for any variable regardless of level of

measurement. It will, for example, calculate a mean for race or gender even though that

makes no sense whatsoever. Male + male + female/3 = 0.66? Totally illogical. This is one of

the many circumstances in which you will have to be smarter than the data analysis

package you are using. Just because SPSS will let you do something doesn't necessarily

mean it's a good idea.

When calculating measures of variability, it is sometimes helpful to include a box plot. To do

so, click on "Graphs," then "Legacy Dialogs" and select "Box Plot." As was the case with the

graphs you created in the previous chapter, you'll have several options from which to

choose. Generally speaking, you'll want one boxplot for each variable, so choose

"Summaries of Separate Variables." Move the variables that you would like to see displayed

as box plots to the empty box on the right and click OK. Should you desire to edit your

boxplots, you can do so in much the same way you did the graphs in Chapter 2.

- MBAD6221JUD_Lecture1.Загружено:tomfpxmf
- BRMЗагружено:Pooja Kabbur
- Coke ProjectЗагружено:Simon Dutcher
- My Research ReportЗагружено:Innocent Saadi
- Boxplots in RЗагружено:stald
- stat3Загружено:Vishal Mishra
- S2Jun01QЗагружено:Mohamed Nabeel
- Psyc 60 Central Tendency and Variability_2.pptЗагружено:Yosef Imanuel Yulius Opi
- Ch2-DataIssuesЗагружено:Bama Raja Segaran
- Mean, Median, ModeЗагружено:ape21
- 03 GraphicalPart2 Numerical (1)Загружено:argbgrb
- Output percobaanЗагружено:Marlintan Sukma Ambarwati
- OutputЗагружено:Evi Nafisah
- Assignment Statistics Q2Загружено:afferry
- DescriptiveЗагружено:Swapnil Sahoo
- Class 04 - Basic Concepts of Statistics and Probability (1 of 3)Загружено:Matthew Smith
- Tugas LabKomЗагружено:qintharamft
- T2 Tania Agustini Maharani 1310211 124Загружено:Tania AM
- Griddatareport Vasquez AnticonaЗагружено:Juanjo Vasanti
- ARTICULO CENTRAL QuantitativeMis-Information_finalЗагружено:Luis Hernando Barreto
- algebra1section9 1Загружено:api-358505269
- UKP6053 L3 Descriptive StatsitcsЗагружено:Fiq Razali
- Presentation1final Demonstration 2Загружено:Madel
- measure og central tenceyЗагружено:Faheem Wassan
- Module 1 - Descriptive StatisticsЗагружено:Miel Ross Jader
- SPSS 2Загружено:Niesya Zhahra
- ExamplesЗагружено:dhawand
- team project part 6 summary e-portfolioЗагружено:api-317374554
- jc_statistics.pdfЗагружено:Pearlie 1767027
- jc_statistics.pdfЗагружено:Pearlie 1767027

- Secure Biometrics Authentication: A brief review of the Literature, Fahad Al-harbyЗагружено:harbyf
- GRADE IV Science Quiz.docxЗагружено:Dizabelle Anteojo Delos Reyes
- bioЗагружено:Dizabelle Anteojo Delos Reyes
- Advantages of Flexible Work Schedules for EmployeesЗагружено:Dizabelle Anteojo Delos Reyes
- Johari window.docxЗагружено:Dizabelle Anteojo Delos Reyes
- Philosophy of Education(TeachingProfession).docxЗагружено:Dizabelle Anteojo Delos Reyes
- Cooperating Teacher Roles and Responsibilities.docxЗагружено:Dizabelle Anteojo Delos Reyes
- English Speaking BasicsЗагружено:Mukesh Verma
- Education.docxЗагружено:Dizabelle Anteojo Delos Reyes
- narrative report.docxЗагружено:Dizabelle Anteojo Delos Reyes
- Economic Planning 2016.docxЗагружено:Dizabelle Anteojo Delos Reyes
- Spelling words.docxЗагружено:Dizabelle Anteojo Delos Reyes
- Sentence SenseЗагружено:Walid Soliman
- Defining a Compound Noun.docxЗагружено:Dizabelle Anteojo Delos Reyes
- 3rd oral exam.docxЗагружено:Dizabelle Anteojo Delos Reyes
- Basic Korean PhrasesЗагружено:Dizabelle Anteojo Delos Reyes
- UNIT PLANЗагружено:Dizabelle Anteojo Delos Reyes
- Courtney (1) (1)Загружено:Dizabelle Anteojo Delos Reyes
- Eye Assign.Загружено:Dizabelle Anteojo Delos Reyes
- English 3 4qЗагружено:Dizabelle Anteojo Delos Reyes
- Recheck EditedЗагружено:Dizabelle Anteojo Delos Reyes
- Civil Status1Загружено:Dizabelle Anteojo Delos Reyes
- CherylЗагружено:Dizabelle Anteojo Delos Reyes
- BevЗагружено:Dizabelle Anteojo Delos Reyes
- researchЗагружено:Dizabelle Anteojo Delos Reyes

- Test BankЗагружено:Dharun Blazer
- Tutorial 3Загружено:Wan Noraiman
- Patch Analyst Metric DefinitionsЗагружено:brrntos
- 1443489379 Geometallurgy Applied in Comminution to Minimize Design RisksЗагружено:Jose Luis Valladares
- SC101 RevisionЗагружено:Daniel Folayan
- Descriptive Statistics TheoryЗагружено:marios pappas
- Sample Size Wanneer n Klein IsЗагружено:halloo999
- 6 2 artifact blog jvaughnЗагружено:api-324234986
- Boxplots in RЗагружено:Raphael Tapajós
- A Continuous Assessment Scheme for Statistics CoursesЗагружено:Brainy12345
- RISK CH 2.pdfЗагружено:Wonde Biru
- Student Work BookЗагружено:dagger21
- Hasil Uji NormalitasЗагружено:Gito P. OnoMax
- 10.1007-978-81-322-2514-0Загружено:sumasuthan
- RiskMetrics (Monitor) 2Загружено:Angel Gutiérrez Chambi
- What is a Poisson DistributionЗагружено:Toshi Parmar
- Science of Free Throw ShootingЗагружено:Kansiree Sang-Ek
- Stat 156 Hw 4 SolutionsЗагружено:teecherc2
- Kathy Six Sigma Green Belt Revision QuestionЗагружено:Yanti Rocketleaves
- STATS 250 NotesЗагружено:Zelin Wang
- Quary 2-ms4.xlsxЗагружено:Mehran
- Jntua MBA R14 SyllabusЗагружено:chavs
- Use of Computer and Data..Загружено:Sukumar
- Honaker & King - What to Do About Missing Values - 2010Загружено:Sergiu Buscaneanu
- Prof Ed 4 - EvaluationЗагружено:Rugi Vicente Rubi
- 2015 Annual Report Nunavut Court of JusticeЗагружено:NunatsiaqNews
- Assignment 2Загружено:MubynAfiq
- Numerical for PracticeЗагружено:Hk Meher
- BA301_Ch03_Quiz.docxЗагружено:ways
- Interpollating spatially varying soil property valuesЗагружено:SilvanaMontoyaNoguera

## Гораздо больше, чем просто документы.

Откройте для себя все, что может предложить Scribd, включая книги и аудиокниги от крупных издательств.

Отменить можно в любой момент.