You are on page 1of 48

# V = πr2h

h i (Imaginary)

3
r 3 + 2i
A
V = 1 πr2h
2

3 1

R (Real)
1 2 3 4

dy ?
dx
?
y
n
x
) = -1)
ƒ(x n·x(n
)=
A ƒ’(x
f(x)

a b x

## STATISTICALLY BASED REPORTS

STATISTICS Level 3
NCEA | Walkthrough Guide
Level 3 Statistics | Statistically based reports Walkthrough Guide

Introduction  3

## Key Terms and Causation  5

Explanatory, response and confounding variables  5
Observations vs Experiments  7
Causation and Correlation  8
Correlation 11
Checking causation versus correlation in a problem  12
Quick Questions  13

Samples 14
Random sampling methods  17
Non-random sampling methods   20
Quick Questions  22

Sources of Error  23
Sampling and non-sampling errors  23
Confounding variables   27
Blast through the graphs  28

Comparing Measures  33
The margin of error and the 95% confidence interval  33
Using the confidence interval for claims   36
MoE backwards   37
Dependent probabilities  38
Independent probabilities  40
Quick Questions  42

Key Terms  43
Statistically Based Reports | Introduction

INTRODUCTION
Externals have never been easier. What could be simpler than sitting back, reading
an article, and pulling it to pieces? Not much! That’s the whole gist of good old 3.12.
You’ll be given three (very) short statistically based reports to read, and then you’ll
write an evaluation as to whether it’s legit or not.

Okay, that’s not all there is to it. We’ll also be doing some calculations and looking at
things like sampling methods, survey questions, confidence intervals and the margin
of error. Some of this might be familiar to you from Level 2, but some will be brand
new!

Buckle in!

## What we’ll be covering in this walkthrough guide

Before you even breathe in an article’s direction, you have to speak their language.
You need to know how they think. It’s all about their sources and sampling methods.
What kind of studies do they use? How do they get their data? And – most importantly – is
their data legit?

Then you have to check if they’ve gone wrong anywhere. Keep an eye out for lurking
causality.

We’ll round off this three-course meal with a healthy plate of comparing measures.
That’s mostly checking to see whether the stuff they claim is true lines up with the
maths. In short: We’ll be analysing articles, and then evaluating (weighing up) whether
they’re legit or not.

## A word about exam strategy

Yep, this is the wordiest of all the stats externals, but don’t let that intimidate you! The
trick here isn’t spewing out a hundred words a minute. It’s about the quality of what
you write, not the quantity.

The trick here is memory. The better you remember all the different sampling methods,
and sources of error, the easier it’ll be to tell whether an article’s a fraud or not. If a
survey’s badly sampled, you know it’s a spoof.

And, of course, when it comes to the calculations – take it slow, check your answers,
and show your working! You can still get part marks on your calculations even if you
got the final answer wrong, so why not lay it all out there?

Statistically Based Reports | Introduction

Yes, we know reading articles may be boring, but that doesn’t mean you can skim read
them. Before you start answering questions make sure you read the articles properly
at least once, maybe even twice, to make sure you’ve picked up on all the hidden
information. Highlight the important parts such as survey size, survey group, sampling
this will save you having to read through the articles again and again relooking for
that important information you need.

Now, with all of this wording, things can get a little tricky to understand. That’s why,
this guide in plain English as much as we can. We’ve also included a glossary for some
of the key terms that you’ll need to master for your exam.

If learning key words first off scares you (or bores you), then focus on understanding
the concepts the first time around, and then memorise the definitions.

In fact, in this guide, we focus on helping you to understand the concepts first. We
use examples and analogies to help you understand statistics in a way that is fun, and
makes sense in the real world.

However, the language we use isn’t always something you can directly write in your
exam! When this is the case, we offer a more scientific definition or explanation (in
a handy blue box) underneath! These boxes are trickier to understand on your first
read through, but contain language you are allowed to write in your exam! Look out
for them to make sure you stay on target!

Statistically Based Reports | Key Terms & Sampling Methods

## KEY TERMS AND CAUSATION

This standard is all about articles and surveys!

## a) You question the articles

b) You evaluate the articles

Evaluate means you find whether they’re true, fair, and the conclusions are legitimate.

But, before we even look at any actual reports, we need to get some core concepts
sorted: the kinds of words statistically based reports use, the goals of those reports,
and the ways studies and surveys are run.

## Types of variables – Explanatory, Response and Confounding

Types of studies – Experiments vs Observations
Causation vs Correlation
Samples – what they are and what a good one looks like
Sampling methods

## Explanatory, Response and Confounding Variables

Don’t worry, even though this assessment is pretty wordy, we’ll run you through all
the specific terms you need to know – in fact, we’re going to steal some important
words from our regular maths lessons.

## A ‘variable’ is a feature that is able to change.

In this case, variables are the features your survey or experiment are investigating.
For this external, you’ll have to be able to define what explanatory and response
variables are.

The explanatory variable is the one that’s often controlled and changed.
• It’s usually the cause or the explanation of the other variable. It’s supposed to tell
the other one what to do. For that reason it’s also called the independent variable.
• It’s always put on the x axis.

Statistically Based Reports | Key Terms & Sampling Methods

The response or outcome variable is the main focus. It’s what we want to measure
and compare.
• We measure how much it changes when the explanatory variable changes.
Because it depends on the explanatory variable, it’s also called the dependent
variable.
• It’s always put on the y axis.

For example, an article might test whether people walk faster if they listen to music.
The explanatory variable would be ‘music’ and the response ‘walking speed’ because
music is the thing we think might cause a change in walking speed.

WALK
FASTER!

Of course, maybe it’s hit you that there might be problems with some of these
‘variables’. I mean, what kind of music are the people in the study listening to? What if
heavy metal makes people walk faster and jazz makes them walk slower? That would
totally change the results of the article.

Those are the sorts of things you have to ponder on with this assessment. Be critical
– question everything!

## As well as explanatory and response variables which we actually want to focus on

when we do a study, there are also confounding variables. These are variables which
can change the response variable and make our results unreliable. We also call them
lurking variables because they just lurk around ruining our study.

So if the people from our music and walking-speed study got to choose the music
they were listening to, that could be a confounding variable – maybe only the people
who chose to listen to up-beat music would walk faster. If the researchers running the
study didn’t get any information on this, they’re not going to be able to make good
conclusions from their data.

Confounding variables are any outside factors that change the experiment results.

Statistically Based Reports | Key Terms & Sampling Methods

## STOP AND CHECK

Which are the explanatory and response variables in these situations? Can you think
of any confounding variables?

An ice cream store tends to sell more ice cream on weekends than on weekdays.
People who eat more kale, quinoa and chia seeds are healthier.
Men with moustaches earn more money than men who don’t have moustaches

Observations vs Experiments
If we want to question a report, we need to know where they got their info from in the
first place. How did they test their groups? Was it using:

A statistical experiment?
A randomised statistical experiment?
An observational study?

Hold up! What are those? Let’s look more a bit more deeply at the different types of
testing we can do:

## A statistical experiment/experimental study is when two or more groups are

compared. People are divided up into groups, the groups are given different
treatments (versions or levels of the explanatory variable) and then their response
variable is measured and compared.
A control group is a group that doesn’t have any treatment applied to it – the
researchers leave this group’s explanatory variable alone.

CONTROL vs OTHER
GROUP

Think of one group being fed cheese before bed, and another getting nothing. Then
you sit back and see whether the group who ate cheese get weird cheese dreams.

CHEESE vs NO CHEESE

Statistically Based Reports | Key Terms & Sampling Methods

## A randomised experiment is just like a regular statistical experiment but instead

of the researcher simply dividing people into the different groups and testing
them, the people are randomly shuffled into each group. This is a really important
difference, and we’ll explain why soon!
An observational study is stuff like surveys and polls. We want to know the effect
of something by observing it occurring naturally (by itself) or without researcher
intervention (doesn’t have his dirty mitts on the explanatory variable).

Maybe we’re looking at whether binge-drinking effects student’s grades. We’re not
going force one group of students to start binge-drinking just to see, are we? No
way! That’s suuuper unethical. So we can’t change the explanatory variable. What
we’d do is observe students, find out about their drinking habits – the explanatory
variable - and compare their grades – the response variable.

DRINKING
&

Causal claims

## So why does it matter whether something was an observational study, an experiment

or a randomised experiment?

Well, often the whole reason a researcher bothers doing a studiy is to see whether
something directly affects something else – something we call a causal claim.

A causal claim involves stating that one event directly affects another event.

In other words, we’re saying the explanatory variable does make the response change.
Like the more pieces of gum you eat, the mintier your breath is.

No. of Gum

Fresh breath

Statistically Based Reports | Key Terms & Sampling Methods

Thing is, we can’t always justify the researcher’s causal claim – like in an observational
study, they have pretty much no control over the explanatory variable. They just watch
it happening. In cases like those, how can we ever be sure that something else isn’t
affecting the response variable?

So, if we want to make a causal claim, we have to control everything except the response
variable, as much as we can. That means we have to control the independent variable
(so we can’t use an observational study) and we also have to control confounding
variables to make sure nothing else is affecting our results.

We also have to make sure the participants in our experiment are randomly divided
into groups, so that there aren’t big differences between groups based on who they
contain. We can only compare two groups if they’re similar apart from the explanatory
variable, the thing we’re changing on purpose.

For example, you wouldn’t want all the girls in one group and all the boys in another –
those groups would have natural differences, so you couldn’t say that the differences
in the response variable were because of the explanatory variable.

This sounds like a lot to go through, but it all adds up to one simple rule:

If we went with:

## An observational study – this would mean we just took schools, observed

whether they had uniform or not, and then looked at their bullying problem.
We did not control the explanatory variable. There would be other differences
between those schools than just uniform. For example, primary schools often
don’t have uniforms, while most high schools do, and there are probably different
levels of bullying at high school versus at primary school (like primary schoolers
are probably less likely to care what clothes you wear than high schoolers).
So we can’t compare the two groups based only on their uniforms, because there
are other differences affecting bullying levels.

vs

Statistically Based Reports | Key Terms & Sampling Methods

Because we can’t control other factors, we can’t make a causal claim based on an
observational study.

A statistical experiment – The two groups – uniform and no uniform – are assigned
by the researcher, but not at random. There might be differences between the
groups depending on how they were chosen – like if more girls’ schools were
chosen to wear uniform, and more boys’ schools were in the non-uniform group.
Maybe girls are more likely to judge each other based on clothes than boys, or the
other way around, which would affect bullying levels. If the groups are different
to start with, you can’t prove that any difference between them at the end was
because of the explanatory variable.

vs vs

If the groups are similar, then the causal claim can be justified.

## A randomised experiment – This helps reduce the problem of different groups.

It shuffles them up so that we’d have a mix of ages, sizes, and genders for the non-
uniform group and the uniform group. If these affect bullying levels, this mean they
will hopefully affect both groups evenly. Basically there will always be some difference
between groups, but randomisation minimises it.

vs

This means that randomised experiments are the best study for justifying causal
claims.

To sum up, when a report makes a causal claim, we need to look at two things:

Whether there are any lurking variables that might have an influence (like age),
and how they would affect the results.
Whether the results are from a randomised experiment, not an observational study

Statistically Based Reports | Key Terms & Sampling Methods

## When can a causal claim be justified?

Why is it important to randomly divide people between the groups?

Correlation
Okay, so we can’t always say that the explanatory variable causes the response
variable – so what do we do when we can’t? What’s the point of a study then?

## Instead of causation, we can measure correlation. So how is that different from

causality, exactly?

## Correlation is an association between two variables. As one variable changes, so does

the other, but this doesn’t mean they make each other change.

Basically, correlation means the two variables are related, but we don’t know if one is
causing the other to change, or if something else is causing both of them to change.

This really important because often reports say their explanatory and response
variables have causality, when really, they only have correlation.

For an example of variables that have correlation but not causation, think of
‘temperature outside’ and ‘number of people wearing sunglasses’. On a warmer day,
you would expect to see more people wearing sunglasses, right? But they’re not
wearing sunglasses because it’s warm – they’re wearing them because it’s sunny, and
sunny days just tend to be warmer.

people
wearing
sunglasses

Temperature

The most important thing to remember here is that correlation does not mean
causation. Just because both variables are changing, it doesn’t mean that one is
making the other change! There might be causation, but we would have to do a
randomised study to know that for sure.

We’ll get into seeing what correlation looks like on a graph later, when we go through
all the graph types you’re likely to see. But for now, remember that it’s when two

Statistically Based Reports | Key Terms & Sampling Methods

## What’s the difference between causality and correlation?

Can you think of any more correlated variables?

## Checking Causation vs Correlation in a Problem

Let’s do an example – that’s right, we actually get to look at a question! Here’s a study:

‘Researchers conduct a study on 100 random beachgoers on a sunny day. They ask
the beachgoers how many hours they spent in the sun and then measure the severity
of their sunburn on a scale of 1 to 10 – 10 being ‘fried to a crisp’. The researchers
found that the longer the beachgoers spent in the sun, the more severe their sunburn.
They concluded that staying out in the sun longer causes more severe sunburn.’

## Now, we’d probably get asked something like:

‘Explain whether there is statistical evidence to support the claim and identify and
discuss the effect of a possible confounding variable.’

First of all, we’d go through and write down the explanatory and response variables,
the type of study, and then a possible confounding/lurking variable. That’s a good
place to start with any report you get given, so let’s do it now.

## Explanatory variable: Hours spent in sun

Response variable: severity of sunburn
Type of study: Observational – the researchers didn’t change how long the people
stayed in the sun
Confounding variable: Whether they slip-slop-slapped n’ wrapped or not.

Time to ‘explain’. You would probably guess that the more time you spend in the sun,

Statistically Based Reports | Key Terms & Sampling Methods

the more burnt you’ll get. However, just because the claim makes sense, it doesn’t
mean the report is valid! It may just happen to have that result.

You’re not meant to be asking whether the report got the right answer, you’re meant
to be asking if they got it in the right way - is there statistical evidence to support the
claim?

## Because it’s an observational study, not an experiment, which can’t support a

causal relationship claim.

Now, you need to go a step further, and explain why it’s an observational study.
We know it is because the number of hours the beachgoers spent in the sun wasn’t
controlled by the researchers. And then there’s the confounding variable. We’ve
identified it, now we need to discuss its effect.

Some people may have worn sunscreen, hats, or sat in the shade while on the
beach, which would have significantly impacted whether they got sunburn or not.
Maybe people who stayed out longer were also less likely to wear sunscreen, so
they would have gotten more burnt. The beachgoers were not asked to identify
these variables to the researchers, so there is no knowledge to what effect they
may have had on the results of the study.

We do all this to dismiss the report’s causal claim. Basically, we can say that the
evidence in the report does not support the researchers’ causal claim, it can only be
used to show a correlation between time spent in the sun and how badly someone
gets sunburnt.

Statistical evidence gives a degree of certainty towards certain results or claims from
a statistical survey or study. This can be in the form of statistical data used to support
a claim about a population and inferences drawn from this data.

## What do we need to do when a report makes a causal claim?

? Quick Questions:
Explain whether or not there is statistical evidence to support the claims. Identify and
discuss the effect of a possible confounding variable, and whether or not there may
be causation.

Statistically Based Reports | Samples

## As part of a new positive-attitude campaign, researchers make one group of

50 randomly-selected students fill out a reflection sheet each week. The other
control group of 50 randomly-selected students do not fill out a worksheet
each week. The researchers found that students who filled out the reflection
sheet participated more in class and graded higher.

## Researchers investigate caffeine’s effect on energy levels by measuring the

speed at which students complete a statistics test. Students were asked
how many cups of coffee, tea, or energy drink they had that morning once
the test was complete. Researchers found that students who had one or more
servings of caffeine that morning completed the test faster.

SAMPLES
Okay, so we’ve had a look the way we can get data, and whether we can make causal
claims about it. But before we can run a survey, or an experiment, or whatever, to
collect our data, we have to decide who we’re going to get it from – our sample.

So what are samples, and why do we use them in the first place?

Usually when someone runs a survey, it’s because they want to find information that
applies to everyone in a certain group – for example, all teenagers or all men with
moustaches.

## A population is the entire group of people or things we want information about.

But we can’t get information on absolutely everyone! That would take ages, and it
would be really inconvenient and expensive. In fact, that’s called a census – and it’s
pretty rare to have the chance to do one.

## A census is when every member of the population is surveyed or tested. It is not

usually practical to do a census.

That’s why the people who write these articles or run these surveys choose a group
of people – a sample - from the wider population, then ask them certain questions
or get them to undertake an experiment to gather data. For example, we might
select only a few men with moustaches, and then use the results to make claims

Statistically Based Reports | Samples

A sample is group taken from the overall population, which we use to make estimates

## What does a good sample look like?

So, if we want to make generalisations about the population, our sample has to be as
similar to the population as possible. In fact, it has to represent the population. This is
the golden rule of sampling:

## “The sample has to be representative of the population.”

To get a representative sample, we have to first make sure we’re sampling enough
people (or things)! For a sample to be big enough to make claims about the population,
it has to have a sample size of at least 30.

## If the method of sampling we use gives us an unrepresentative sample, the results

won’t be the true population value; they’ll just be a guess from whatever sample
there was – and that means we can say no deal to any claim the report is making

The true population value is the statistic we’d get if we could test the whole population.

In general, if the report is going to be legit, whatever sampling method the researcher
used has to be unbiased.

Bias in general means that because of some problem with our sample or our test, we
tend to get a particular type of answer or result that doesn’t fit the actual population.

Bias is where because of some fault in our method, the responses we get tend to be
of a particular kind that does not match the truth.

For a sample, bias is where the sample isn’t representative, like if it has too many
people from a particular group within the population, or completely excludes a group.
This causes us to get results that don’t match with the population in general.

Statistically Based Reports | Samples

For a good unbiased sample, everyone should have the same chance of getting
selected for our sample – that way every group can be represented!

Sampling bias is where there is a specific preference towards one group over others
being selected for the sample.

An unbiased sample means samples are taken at random, with no preference over
certain groups in the population, and everyone has a fair chance of being chosen for
the study.

Researchers can be biased when sampling a certain group of people to survey from
the population, because they might have a particular outcome they want.

Like for example, if the researchers wanted to get a result for their report that shows
that New Zealanders are bad at spelling, they could overly choose six-year-olds for
their sample.

Sure, they’re probably going to get the result they wanted – but their survey will
be biased, because six-year-olds usually are worse at spelling than the general
population, and most people are not six years old. So their result won’t be reliable.

ornge
orringe
orenge
oranngi

They could also be biased accidentally, by not realising they’re only choosing
from a certain group – like if they just survey their friends, who probably all have
stuff in common.

So to see where all this bias can come from and how to avoid it, let’s run through
the possible sampling methods. As we meet each one, we’ll tell you how legit it is,
and why.

## How much should a sample of the entire population be?

Why do we need a representative sample?
What is a biased sample? Try to explain it in your own words.

Statistically Based Reports | Samples

## Random Sampling Methods

All the good sampling methods involve some randomness – like we said, everyone has
to have a chance of being included in the sample for it to be fair and representative!

To take a random sample, we need some list or way of seeing all the things or
people in the population. Like if I wanted to take a sample of kids at school, for
everyone to have a chance of being in my sample, I’d need the whole school roll,
or some people would get missed out. The list you choose members from for your
sample is called a sample frame, which just means “this is a list of all the people that
I chose from to get a sample”.

## Simple random sampling

Simple random sampling involves numbering off your whole list of the population
and then using a random number generator to find which members of the population
will be included in your sample.

It’s super legit because it’s so unbiased (since it’s completely random) and makes sure
we don’t get any wild cards. Everyone gets a fair chance.

Say we were looking at the conditions of flats in Dunedin and we needed a simple
random sample of flats – we’d number them off and randomly choose them using a
random number generator. If we got the number 11, say, house number 11 would be
included in our sample.

01 02 03 04 05 06

07 08 09 10 11 12

## Systematic random sampling

Systematic sampling is where the parts of a population are ordered, and then every
nth one is picked, starting from a random point.

This means you need to order your sample frame somehow. You could order them
alphabetically, by age, by seating order – and then pick every 2nd or 3rd, or whatever,
part – depending on how many you want in your sample and how many there are on

Statistically Based Reports | Samples

the list. With our Dunedin flats, we could order them according to street numbers and
pick every 4th house.

Wait – if we do that, won’t some houses not ever get picked? Like if we start at house
4, we’ll always get house 8 next, and then house 12. Didn’t we say everyone needs a
chance of being in the sample?

Luckily there’s a smart way around this problem. Basically, if we’re picking every 2nd, every
3rd, every 4th, every kth house – we just use a random number generator to pick a number
between 1 and k. If we’re picking every 4th house, we’d generate a number between 1
and 4, and start there. Say we got the number 3, then we’d pick the 3rd house, the 7th
house, the 11th house, etc… so any house could get picked to be in the sample!

Although this sounds pretty random, BEWARE of natural patterns. For example, if we
start at house 4, every 4th house will have an even street number. That means they’ll
all be houses from the same side of the street, and that’s biased!

A better way would be to pick every 3rd house, so that we’d get a mix of even and odd
numbers. Systematic sampling can therefore involve a little extra thinking to ensure
the sample is actually random. Because of this, it’s often pretty legit, but not as good
as other random sampling methods!

01 02 03 04 05 06

07 08 09 10 11 12

## Stratified random sampling

Stratified sampling is when the researcher tries to make the sample look as much like
the population as possible, by grouping it using some other variable.

For example, if the researcher knew that 1/3 of Dunedin houses were red, 1/4 were
blue and yellow, and 1/6 were green. Then he would want to make 1/3 of his sample
houses red, and so on.

1/3 1/6

1/4 1/4

Statistically Based Reports | Samples

Basically you divide your sample frame up based on some other feature, like house
colour, and then sample within each group based on how big the group is.

This method is super legit, because it deliberately makes sure groups get represented
to avoid bias, and the sample is still random because within each group you take a
random sample (e.g., by simple random sampling).

Cluster sampling

Cluster sampling is where you grab a whole group and use that as your sample.

take one ‘cluster’ to act as a representative – a cluster being a group in the population
that’s close together.

You might randomly choose a few streets in Dunedin and then grab a bunch of
houses all along those streets to be your sample. Not all that credible, is it? Very
biased. It’s actually quite a scandal, in fact, because the group isn’t typical of the
whole population.

In the picture below, a cluster sample has been taken. BUT – gasp – there are no
green houses in the sample! The poor green houses are going to be horribly
underrepresented. No one’s going to even know they exist!

This is the worst of the random sampling methods, but it can still be okay if you take
quite a few clusters and make sure you’re choosing clusters at random.

## Use your own words to explain them!

Statistically Based Reports | Samples

## Non-random Sampling Methods

As well as random sampling methods, there are non-random sampling methods. All
of these have problems, so if you see them used in a report, that’s something to
discuss!

Quota sampling

Quota sampling is where the researcher has to have a at least a certain amount of
particular groups in their sample so they’re not underrepresented or ignored.

This method has good intentions, as it looks to ensure the survey represents ALL of the
population. It especially tries to make sure minority groups get represented, because
a random sample could just by chance not contain people from small minorities.

These groups are often based on race, age, or gender. Maybe for one survey, 25% of
their sample needs to be Asian, 25% under 18, and 25% female.

Asian

Under 18

Female

Although this sounds good in theory, there are some tricks pulled by researchers
which can undermine how legit these surveys end up being.

What the sneaky old researcher will do is find people who have as many of those traits
as possible to fill their quota super-fast – like they might try to find people who are Asian
and under 18 and female. If a quarter of the sample are all teenage Asian women, that’s
not fair, is it? Although it may efficiently fill the quota, it doesn’t accurately represent
the population - it doesn’t represent any women who are over 18, for example.

The other problem is that this method isn’t usually random – maybe the researcher is
just trying to spot as many females as possible that walk past, and asking them the
questions. That’s also a source of bias, because only females who walk past can get
picked into the sample!

Basically, if you see the words quota sampling, you should be suspicious of that report!

Statistically Based Reports | Samples

## Person in the Street sample

Person in the Street has a cool name, but it’s not a cool sampling method.

This is where people are accosted by TV crews and such to get their opinion. It’s a
classic news gimmick – ‘We went out on the streets to see what people really think
about the new flag referendum!’ – but there’s so much that can go wrong.

Certain groups of people, like professionals or schoolkids, are usually at one location
at a particular time. Like if the sample is taken on the street during school time, anyone
who’s in class (as they should be) won’t be able to be represented!

The interviewer is also biased and will only ask people who look like they’ll give a
good answer – and come on, that’s not the true population value, is it?

Self-selected sample

A self-selected sample is basically what it sounds like – where people choose whether
they want to be in the sample or not.

This is often the case in web research, text-votes, phone polls, or voluntary surveys.

It sounds okay, but the thing is, the people who actually bother to text in and vote or
fill out surveys are those with strong opinions, lots of free time, or who’re invested in
the survey somehow.

Think of all the email links to surveys you’ve probably been sent at school – how many
of those did you actually answer?

In other words, the people who participate in the survey, probably don’t reflect the
population as a whole.

Usually, the report you’re given won’t use the words “self-selected” – you’ll have to
figure it out, like if it says people had to phone in to give their answers, you’ll know it
was probably self-selected.

For example, remember how on American Idol (what a throwback) people used to
vote for their favourite singer to get through? Then they’d say, ‘America has chosen our
next idol!’ and the glitter cannons would burst and everyone would start screaming
and crying and partying and whatever.

Statistically Based Reports | Samples

American
Watches American Idol
Voter

Thing is, America as a whole hadn’t chosen the winner – only the Americans who
bothered to watch the show, and of those, only the Americans who bothered to text
in and vote. They had to really care about the show to bother doing that. It would be
more accurate to say ‘The fans of the singers on American idol have chosen our next
winner!’ than ‘America has chosen!’.

## Use your own words to explain them!

? Quick Questions
Identify the (implied) population and the sample method used by the following
surveys:

A school would like to know whether all students would like the option of
movies being played in the auditorium during rainy lunchtimes. They survey
the two Year 11 classes in D Block, period 2 on Friday.
The local Mayor wanted to know the opinion of his electorate on banning 1080.
He rang up all of the people he personally knew from the electorate to ask
The local council is conducting a survey to see how many people drink and
drive over Queen’s Birthday weekend. The police breath-test every 5th car until
their sample size of 30 is met.
A TV programme asked viewers the question, ‘Do you like Pina Coladas,
OR getting caught in the rain?’ Viewers were asked to text or email in their
response.

Statistically Based Reports | Sources of Error

! SOURCES OF ERROR
Now that we’ve covered the basics of what samples are, how we get them, and the
different types of studies – we can start looking at all the things that can go wrong,
and cause error in a report.

These are the kinds of things you’re gonna be discussing when you evaluate a report –
did it have lots of error, or was it pretty good?

Here, we are going to look deeply into the eyes of the variables, find out their inner
secrets and the type of relationship they share. We’ll also teach you how to read
through the lines in those graphs, figure out whether or not their data is misleading,
and if so, where the fibs are.

## Sampling and non-sampling errors

A closer look at confounding variables – stuff that’s on the lurk to wreck the sample!
Blast through the graphs

## Sampling and Non-sampling Errors

You have to know how to spot errors in a survey, because if the data’s screwed up,
it won’t accurately represent the whole population. More importantly, you can’t just
state that there’s something wrong with data, instead, you need to be able to state
why it isn’t accurate. That’s the guts of this assessment. No guts, no glory.

Errors are the reasons why the sample isn’t like the true population value. They come
in two forms:

Sampling errors happen because data is collected from a sample rather than the
whole population – and one sample will never perfectly reflect the population.
Non-sampling errors are basically everything else than can make a sample different
from the population – like bias and non-random sampling methods, or bad surveys.

There’s nothing you can do about sampling errors. Sampling errors can’t be avoided,
unless you somehow study the entire population rather than just a sample – do a
census. This does actually happen in some cases – every five years or so, the NZ census
sends out a survey to everyone in the country to get data of the whole population!

Statistically Based Reports | Sources of Error

Like we said though, a census is usually pretty expensive and inconvenient – so if you
can’t do one, your sample is always going to be a bit different from the population
because of sampling variation.

Sampling variation means that every sample is different, so every sample will give
slightly different results. No sample will exactly match the population.

The best thing you can do to avoid sampling errors, is make your sample as large as
you can, and opt for one of the more ‘legit’ sampling methods we discussed. This will
at least keep your sampling errors to a minimum, because it’ll make the sample more
representative.

Ahhh, but non-sampling errors, that’s where the real problem lies, because they’re
harder to see. You can never get rid of non-sampling errors, but you can minimise
them. Errors could be:

## 1. When groups/minorities are excluded or underrepresented on purpose or by

2. When people select themselves – only those with strong opinions or with stakes in
the results will bother to respond – also leading to bias
3. The wording or order of questions asked
4. False answers given due to social pressure.

## Let’s break it down a bit more.

1. Excluded Groups

Think of the sampling methods we just used. In some cases, groups were excluded by
accident and sometimes on purpose. For example,

Cluster sampling, where none of the green houses were in the sample
Person-on-the-street sampling, were TV interviewers might not bother to interview,
say, little kids or elderly people
Online surveys, where only those who have computers can respond. What about
all the others who don’t have that stuff?

Statistically Based Reports | Sources of Error

2. Self-Selection

## Self-selection errors only really apply to self-selected sampling. Think of American

Idol again – only those people who really, really want their favourite singer to win will
text in and vote. Or maybe the owner of a flag-making company really, really wants
the NZ flag to change so he gets more business – so he votes for change at every
opportunity. People who aren’t that invested might not get around to casting their
vote.

This will make a sample unrepresentative because it will only represent the people
who went out of their way to be represented!

3. Iffy Questions

The wording or order of questions may not seem important at first glance, but it has a
lot of influence on how people in the sample respond. If a question isn’t simple yes or
no, or it takes more than a few seconds to answer – get suspicious. Examples include:

## Double-barrelled questions that ask two things at once.

‘do you like Pina Coladas and getting caught in the rain?’

These are hard to answer because people don’t know which part they’re
answering! Do I say yes if I like both? What if I like Pina Coladas but not getting
caught in the rain?

## Questions that make a judgement by phrasing something in a positive or negative

way. People will feel they need to answer in a particular way.

## ‘Do you abuse alcoholic substances on the weekend?’

Most people aren’t going to say yes to that, even if they do – especially if the

Leading questions that make a statement before asking, which can sway people
towards thinking in a particular way.

## ‘Pina Coladas are considered delicious – do you like them?’

delicious, so I probably like them’

Statistically Based Reports | Sources of Error

Double negative questions. These are tricky because they ask things in a no-no
way.

## ‘Do you not like Pina Coladas?’

To say yes, they do like Pina Coladas, the person would have to say no. Weird.

## Questions that ask information a person wouldn’t be expected to know.

‘How many times did you drink a Pina Colada and get caught in the rain in your whole
life?’

No one keeps track of their whole life that carefully, so they might give a wrong
answer by accident or just have a wild guess.

## ‘Do you like Pina Coladas?

a) Yes
b) Maybe
c) Possibly
d) No

What’s the difference between ‘Maybe’ and ‘Possibly’? How do I know which one
to pick?

## Each of these questions have the potential to influence someone to answer in a

particular way. Even if just a small amount of people are guided by the way questions
are asked instead of the question itself, we have a pretty major problem.

4. Peer/Social Pressure

Social pressure is pretty much what it sounds like. If researchers were interviewing a
group of students on how many hours of homework they do each week, some of the
students might have lied to sound more impressive to the interviewer, especially if

It sounds stupid, but come on, if you were being asked, ‘How many hours of Statistics
homework do you do per week?’ would you really be inclined to say, ‘None at all’ ?

This kind of error creates bias, because it means our answers tend to be more ‘socially
acceptable’ – even if they don’t match the truth!

5. Non-response error

This one’s pretty much what it sounds like – when people don’t answer, either by
missing some questions or just deciding not to do the survey.’

Statistically Based Reports | Sources of Error

If the survey says ‘In your opinion, what was the most important event in the Vietnam
War?’ a lot of people are going to think ‘well, I don’t know’ and leave the question
completely blank. People can also decide not to answer personal questions, like ‘do you
sleep naked?’ – they might not want to give away that information! That means you’re
only going to get answers from people who knew enough and were comfortable

## What’s the difference between sampling and non-sampling errors?

What are five types of non-sampling errors?

Confounding Variables
Remember confounding variables from our first section? Well, they’re back, causing
error in our results!

## Just a refresher on what exactly they are:

Confounding variables are any outside factors that change the experiment results.

## Confounding variables need to be controlled or measured so they don’t get in the

way. Otherwise, because there’s all these external factors lurking around changing
our results, we get error because we aren’t sure if what we see in the results is really
because of our explanatory variable!

## Let’s look at an exam-ish example. Say we had to evaluate this study:

‘A random sample of 100 college students were tested to find the relationship between
the students’ weight and their parents’ weight. Researchers were able to show that
there was a relationship between student and parent weight.’

Let’s say the researchers wanted to claim that change in student weight was caused
by their parent’s genes (that’s right, a causal claim).

What are some lurking variables here? Think about it – what are some factors that
might affect student weight, apart from their parent’s genes?

Statistically Based Reports | Sources of Error

Some families may have grown up in rural areas, while others may be from the city
… so their kids might have access to different food, which could influence their
weight. Not too hard, yeah? It’s pretty common sense. When looking for confounding

‘What are some factors that might affect the response variable, not including the
explanatory variable?’

## STOP AND CHECK

What should you ask yourself when looking for confounding variables?
Why do they cause error?

## Blast Through the Graphs

Reports aren’t always words – sometimes they have pretty pictures too! On the off
chance you do get some graphs in your exam, you have to know how they’re read.
You would have come across most of these in previous years, but here’s a refresher
just in case.

Box-and-whisker plot

## minimum LQ median UQ maximum

0 1 2 3 4 5 6 7 8 9

The Upper Quartile, or UQ, is the value the upper 25% of the data lies above.
In the same way, the Lower Quartile, or LQ, is the value the lower 25% of the data
lies below.
The Interquartile Range, or IQR, is simply the middle 50% of the data, between
the LQ and UQ. This is the rectangular box in the middle!

If you had two box-and-whisker plots, it’d be a simple matter to compare whether one
a higher maximum, higher median, and so on.

Statistically Based Reports | Sources of Error

0 10 20 30 40 50 60 70 80 90 100

## Scatter Plots and Correlation

We promised earlier we’d show you how correlation can be measured on a graph,
and here we go!

Scatter plots are used to display the values of a pair of variables measured from the
same source, or bi-variate data.

For example, you might say the more technology before bed, the less sleep. You’d
have measured ‘amount of technology before bed’ and ‘amount of sleep’ for each
person, and plotted them. That’s a negative relationship, and looks like this:

Linear regressions are the fancy name for patterns in these graphs. Basically, they
show relationships between variables.

## If you can spot a pattern, there is likely to be a relationship (either a causation or a

correlation) between the two variables. Just as they show negative relationships, they
can also show positive ones, like this lot:

## Strong Moderately strong Weak No relationship

Positive relationships (e.g. more exercise = more sleep) slope upwards towards the
right. Negative relationships (e.g. more technology = less sleep) slope downwards
towards the right.

Statistically Based Reports | Sources of Error

Depending on how spread out the dots are, we say the variables have a strong,
moderately strong, weak or no relationship, like the graphs show.

So if a report gave you a scatter plot and said there was a strong positive relationship
between the two variables, you’d expect to see quite a straight line of dots sloping
upwards – otherwise you could disagree with the report.

Bivariate data is data which includes measurements of two different variables of the
same population. The name comes from the fact that ‘bi-‘ means ‘two’ and ‘variate’
refers to ‘variables’.

## What do you call the upper and lower 25% of data?

If a relationship between two variables is positive, which way will its graph slope?

No doubt it’s pretty exciting to see some pictures in your exam. Finally! A graph!
Unless, of course, that graph is misleading. This is when the graphs don’t represent
their data fairly or are purposely biased towards a view. These can specifically be used
to make the results appear more favourable then they actually are. So watch out!

Some examples of misleading graphs (and their effects on the validity of the study!)
include…

Incorrect Proportions

50

40

30
No. accidents
20

10

0
2015 2016
Year

In this graph, the pictures used to represent each year get larger horizontally as well
as vertically. The problem with this is that we’re only supposed to be looking at the

Statistically Based Reports | Sources of Error

fact that x is twice as much as x – measuring vertically – but x looks HUGE because it’s
also twice as wide. This leads the viewer to think that x is much larger than it really is.

## Vertical axis (y) doesn’t start at 0

60

50

40
No. accidents
30

20

10 Year
2015 2016

This isn’t a true representation because the y-axis doesn’t start at zero. Therefore, the
proportions are skewed, and the viewer gets the wrong idea about the relationship
between these groups. This is what the graph should look like – notice how the
difference between the two bars looks smaller?

50

40

30
No. accidents
20

10

0 Year
2015 2016

Incomplete data

## Global Warming on the decline

16
14
12
10
8
6
4
2
0
1990M09 1990M10 1990M11 1990M12 1990M01 1990M02

Statistically Based Reports | Sources of Error

It looks like global warming’s set permanently on the decline – or not! This is only a
tiny fraction of the overall data, and you don’t dare take such a massive conclusion
from such a small piece of information! It only appears as though global warming’s on
the decline because only 6 months of data is shown! If you looked at a bigger graph,
you’d see a much different overall trend!

As well as that, the y-axis is really confusing – what does it represent? What’s the unit?
We can’t read a graph if we don’t know the units!

## Two Vertical Axes

Crime rate vs Unemployment rate
40 800
35 700
30 600
25 500
20 400
15 300
10 200
5 100
0 0
99 00 00
1 02 03 00
4 05 06 00
7 08 09 010
19 20 2 20 20 2 20 20 2 20 20 2

The biggest effect of this graph is that it’s extremely confusing to read! One y-axis is
meant to apply to each group – so basically, this is two very different graphs in one. The
huge problem here is they’re not being compared on the same scale! That means what
appears to be a relationship is might really only be due to the mixed-up scale. Sneaky!

3D Skewed

## PokemonGo League of Legends Runescape Candy Crush

GTAV Fallout 4 Battlefield 3 Skyrim
Flappy Bird The Sims 4 Farmville Super Mario 64

Most graphs, especially pie graphs, are misleading when they’re presented in a 3D

Statistically Based Reports | Comparing Measures

format. This skews the shape of the graph so that some sections appear larger or
smaller than they really are! Here, the Skyrim section looks huge compared to the
others – but it actually isn’t.

Pay attention to the link between the article’s claim and what the graph’s showing.
For example, a study might use a graph on theft rates to prove that crime is on the
decline – but this is a very poor link to make since the graph only represents one kind
of crime. Look out for those sort of bogus matches!

## Name three ways a graph could be misleading.

COMPARING MEASURES
So far you’ve been thrown into the deep end with definitions, but now we can get to
the fun stuff, calculations!

The previous sections have gone through all the definitions you need to interpret and
analyse studies, to prove whether or not the data is reliable and the claims from the
study are justified.

In this section we’ll look at how to calculate statistical measures (basically just numbers)
statements, so you can really prove those misleading surveys are fibbing!

## The margin of error and 95% confidence intervals

MoE backwards
MoE for dependent probabilities
MoE for independent probabilities

## The Margin of Error and the 95% Confidence Interval

In a perfect survey, our sample statistic should match the population statistic.

Because of sampling errors, the value in the sample may will always be a little different
to the population value – like we said, samples don’t exactly match the population.

Statistically Based Reports | Comparing Measures

## We want our sample to accurately represent the population, don’t we?

Well, it’ll never be perfect, but how do we tell whether it is ‘close enough’? How do
we see how much sampling error we have?

## The margin of error

The margin of error is a measure of how uncertain we are about where the truth is –
basically it’s how we say, yeah, we know we have sampling error, but here’s how close
we think we are. We can say:

‘Well, I’m 95% sure the population proportion is within this distance from the sample
proportion.’

A margin of error is a small allowance which is made to allow for normal variability or
small unavoidable errors. It tells us how far the true value may be from the sample value.

## We might want to translate the margin of error into a confidence interval:

‘We can be pretty sure the true population value lies between ____ and ____’

The 95% confidence interval is the sample’s statistic plus or minus the margin of error.

## And what is it for?

To give a range that we’re 95% confident the population statistic will be in,
because if we were to repeat more samples, 95% of those would be in this range.

So that’s easy, then! If we have the margin of error, we can find the confidence interval.
Take your original p – probability of success, and add and minus the margin of error.

It’s called a 95% confidence interval because we’re confident that if we were to take
hundreds more samples, 95% of them would have their proportions fall within the
margin of error, so we’re also pretty sure the population proportion is in that range.

A confidence interval describes the values we’re 95% confident the true population
value lies between.

Statistically Based Reports | Comparing Measures

When you calculate the margin of error and confidence intervals in this standard,
you’ll be working with proportions or percentages. We’ll show you how to do this with
an example:

## How to find the margin of error and confidence interval

Let’s look at the sort of question you might need to find a margin of error for.

Say a sample of high school kids ended up telling us that 0.65 or 65% of students have
iPhones. Does this mean that 65% of the whole population of high school students
have iPhones?

Not necessarily! We’d make a margin of error and say that the real statistic (the
population’s) is 95% likely to be a small amount below or above 0.65 or 65%.

But how do we find that small amount? What’s the range above and below 65% of
students that we are 95% sure the population mean falls within?

We need to find the margin of error, and we usually do that using what’s called the
rule of thumb.

1
Margin of Error = ±
√n

## n is the sample size – for example, 100 students

The ± is plus or minus – remember, our confidence interval is (sample proportion
+ margin of error) as well as (sample proportion – margin of error)

Now let’s plug in our numbers and go. There were 100 students surveyed.

1
Margin of Error = ±
√100

## Margin of error = ± 0.1 or 10%

So we our two end points for the confidence interval are 65-10 = 55%,
and 65+10 =75%.

In other words, we can say that, based on the sample, ‘We are 95% certain that
between 55% and 75% of the people at this school have an iPhone’.

Statistically Based Reports | Comparing Measures

## This rule of thumb is actually a simplification of a more complicated formula, and

because of that, it only works well for proportions between 0.3 and 0.7. For probabilities
outside that range, it overestimates the MoE. But reports often use it anyway, so it’s
also called the ‘reported margin of error’.

So, if the question asks you if the reported margin of error was accurate, you need to
check if the sample probability is between 30% and 70% or not!

## What is the margin of error?

How does a confidence interval relate to the margin of error?
When SHOULDN’T you use the rule of thumb method?

## Using the confidence interval for claims

Calculating is easy, but then come the questions. Examiners like to ask you things like,

‘Is the claim that the majority of high school students have an iPhone justified?’

What do you say? Well, first of all – what defines majority? Obviously, the majority is
going to be anything over half – if more than half your friends want to eat at Nando’s,
then you go to Nando’s. It’s common sense.

So if the percentage of the event is over 50%, AND – so is the confidence interval
– then yes, the claim is justified!

## For us, we’d say:

‘The claim that ‘the majority of high school students have an iPhone’ is justified as the
95% confidence interval is between 55% and 75%, which does not include anything
less than 50%.’

## What do we need to have for a majority claim to be justified?

Statistically Based Reports | Comparing Measures

MoE backwards
Sometimes the examiners won’t ask you to find the MoE. Sometimes they’ll give you
the margin of error, and tell you to find the number of people in the study!

‘In an ink blot test, ink blots on cards are held up and the individual is asked to identify
the shape. 35% identified an L&P bottle. The test has a margin of error of 3.0%. How
many people were tested?’

So what do we do?? We have the margin of error, we have some statistics – but no
number of people in the study! Fear not – let’s just go back to our MoE equation!

1
Margin of Error = ±
√n

We simply plug our number into this equation, rearrange, and solve to find n!

## The Margin of Error is 0.03.

1
0.03 = ±
√n

First thing we’ll do is times both sides by √n to get rid of that nasty fraction!

0.03√n = 1

## Now, let’s divide both sides by 0.03.

√n = 1/0.03

√n = 33.33

Finally, we square both sides, because squaring is the opposite of square rooting!

## n = 1110.89 or 1,111 people

Deep sigh of relief – we made it! We know there’s 1,111 people in the ink blot study,
not 1,110.89, because you can’t have 0.89 of a person so we round up. Let’s bullet
point those steps!

Statistically Based Reports | Comparing Measures

## Plug your MoE into the MoE equation

Bring the √n up by timesing it on both sides
Divide both sides by the MoE
Square both sides
Throw a party because you’re on the golden road to excellence!

## How do you find the backwards MoE?

Dependent Probabilities
Cool, so we’ve covered how to find margins of error and use them to answer questions

## These next two sections are all about comparing statistics.

Dependent probabilities are when we compare two groups that are connected
under one question. Basically, we compare the proportions of two options for the
same question.

• Do men with moustaches earn more money than men without moustaches?
• Did people who used ‘This’ brand of sunscreen get less burns than people who
used ‘That’ brand?

Say we conducted a study on sunscreen brands, where our margin of error is something
like ± 2%. In the study, 30% of people used Burn-Be-Gone, and 60% used Sayonara-
Sunburn. All the percentage left over is just no sunscreen, or other brands.

We might want to compare how many people use those two brands. We claim,

‘The percentage of people who use Sayonara-Sunburn is 30% greater than the
percentage of people who use Burn-Be-Gone.’

Now for the margin of error. If it’s ±2%, that gives us a confidence interval of:

## 28 - 32% for Burn-Be-Gone

58 - 62% for Sayonara-Sunburn.

THAT means Burn-Be-Gone usage could boost up to 32% (30% + 2%). And if it
does, then that’s likely to result in the usage of Sayonara-Sun dropping down to
58% (60% - 2%).

Statistically Based Reports | Comparing Measures

Burn-Be-Gone 30%

Burn-Be-Gone 32%

## Sayonara Sunburn 58%

They’re dependent on each other. If one goes up, the other might drop down, which
means the difference between them is going to vary.

Since the difference changes we can’t make an exact claim on the difference.

• We can’t claim, ‘The percentage of people who use Sayonara-Sunburn is exactly
30% greater than the percentage of people who use Burn-Be-Gone.’

## So what do we do? We find a confidence interval for the difference.

If there are dependent probabilities, then the margin of error of the difference
between these two answers is around 2 x the margin of error for the entire study.

## Margin of Error of Difference = 2 x Total Margin of Error

Example? We know our difference is 30%. We know our total margin of error is ±2%.
So our margin of error of that difference is:

## Margin of Error of Difference = 2 x ±2%

Margin of Error of Difference = ±4%

Aha! So our confidence interval for the difference between the two percentages is:

## Confidence interval = ±Margin of Error

Confidence interval = (30% - 4%) to (30% + 4%)
Confidence interval = 26 – 34%

30%
30%
26%

Statistically Based Reports | Comparing Measures

This means the confidence interval for the actual difference between the number of
people who use Burn-Be-Gone or Sayonara-Sunburn is between 26 – 34%!

Makes sense, right? We said if Burn-be-Gone went down to 28% then Sayonara-
Sunburn would go up to 62%, so the difference would be 62-38 = 34% - and if Burn-
be-Gone went up to 32% then Sayonara-Sunburn would go down to 58%, so the
difference would be 58%-32% = 26%.

## How do you know if two stats are dependent?

Why do we need to find a confidence interval for the difference between the stats?

Independent Probabilities
If you got two different questions and want to compare their stats – you’ve got
independent probabilities! The two proportions aren’t going to affect each other if
they’re two separate surveys or questions.

Say you conducted a survey way back in 2005 to see if employers are more likely to
hire people with tattoos. 45% of employers showed support, with a MoE of 3.5%. Ten
years later, you conduct another survey looking at the same thing, and this time 48.5%
of employers showed support with a MoE of 2.0%. Is support for tattoos dropping?

For starters, the surveys are independent. They happen ten years apart! So how do
we compare their stats? How do we know if support for tattoos is going up or down?
We can’t use the same rule as for dependent probabilities, because these things don’t
affect each other!

We have a shiny new rule of thumb method for this exact dilemma.

Find the margin of error of the DIFFERENCE between the two surveys, and see if
it is entirely above 0, entirely below 0, or crosses 0!

At the moment, the difference between the two sample statistics are:

48.5 – 45 = 3.5

## Difference in support = 3.5%

Statistically Based Reports | Comparing Measures

Difference

45%

48,5%

Our aim is now to find the margin of error of this 3.5%. The two surveys don’t have the
same MOE – so what do we do? Use this:

MoE1+MoE2
MoE of difference = ±1.5 ×
2

That fraction on the end is finding the average of the two margin of errors we’ve been
given. Let’s take this equation for a spin – oh, hold up, what are we plugging into it?

MoE1 = 3.5%, because this is the margin of error for the 2005 study
MoE2 = 2.0%, because this is the margin of error for the 2015 study

3.5 + 2.0
MoE of difference = ±1.5 ×
2

## MoE of difference = ±4.125

Now find the confidence range/interval of this difference. If the range is fully positive,
or fully negative, then we’re pretty confident there’s a real difference between the two
statistics. If the difference includes zero, we know there’s not much difference at all –
there might not even be one - and so we wouldn’t be able to confirm that support for
people with tattoos is going up in this case.

## (48.5 – 45) ± 4.125

3.5 ± 4.125

–0.575% to 7.625%

Look at that range. One end of it is a negative number and on the other end is a
positive number! That means it includes 0 – it includes the possibility that there might
be no difference between the surveys.

Statistically Based Reports | Comparing Measures

All up, this range means that the second statistic might actually have been up to
7.625% higher than the first, but it also might have been up to 0.575% lower – so we
cannot support the claim that support for people with tattoos has risen over ten years!

## What’s the point of comparing statistics?

You might be asking by now, What’s the point of learning all this? or how am I gonna
use this in my exam? or why am I still reading this walkthrough guide?

## So what is the point of comparing statistics? You’ll use it when:

An article compares two statistics, something like, ‘silver cars are less likely to get in a
serious crash than white cars’, we’d calculate the margin of error of their difference.
Then we’d use this as evidence to support or not support the claim made!
These two statistics need to be CLOSE – like we just had 45% and 48.5%. If there’s
a huge gap – say, 20% and 70% - there’s obviously a difference there, that doesn’t
need a bunch of calculations to back it up.

Always keep the purpose of these things in mind – that’s the reason you’ll be using
all this comparing statistics stuff – you’re looking at whether the people in the articles
compared their statistics correctly! Do your numbers match up with their numbers?
Can any of their claims be justified?

## What does comparing statistics give evidence for?

How do you find a MoE for the difference between two dependent proportions?
How do you find a MoE for the difference between two independent proportions?

? Quick Questions:
A street survey found that 41% of the 500 people surveyed have both a body
piercing and a tattoo. Find the margin of error and the confidence interval for
this poll.
Out of the 59% who do not have both a body piercing and a tattoo, 45% say
it makes a person less attractive. What is the margin of error and confidence
interval of this 45% statistic?
It is likely that the majority of people without tattoos and piercings feel a tattoo
makes a person less attractive?

Statistically Based Reports | Key Terms

KEY TERMS
Variable:
A feature that’s able to change.

Explanatory Variable:
The thing that’s controlled and changed. It’s usually the cause or the explanation
of the other variable.

Response/outcome Variable:
The focus. We measure how much it changes when the explanatory variable changes.

Confounding variables:
Outside factors that affect the study results and aren’t controlled.

Evaluate:
This means you find whether surveys are true, fair, unbiased and well-represented,
or not.

Causal claim:
Saying that the explanatory variable directly causes the response variable to change.

Sampling:
When we take a small group from the population and use it to represent the entire
population.

Representative:
A sample is representative if it has the same kind of mix of people as the population,
so it can be used to make claims about the population.

Sample frame:
A list of all the members of the population, used to choose a sample.

Bias:
When a sample tends to get a particular kind of answer that is different from the

Sampling bias:
When a sample overrepresents or underrepresents certain groups in the population,
so the sample is not representative.

## True Population Value:

The real statistics of the real population. This is what we want to have a guess at
using our sample.

Statistically Based Reports | Key Terms

Sampling Errors:
These happen because data is collected from a sample rather than the whole
population – and one sample will never perfectly reflect the population.

Non-sampling errors:
These happen if a sample has bias or doesn’t accurately represent the population.

Dependent Probabilities:
When we compare two groups that are connected under one question.

Independent Probabilities:
When you’ve two different questions and want to compare their stats. They’re not
going to affect each other if they’re two separate surveys or questions – which is
what makes them independent.

Majority:
Anything over half.

## 95% Confidence Interval:

The sample’s statistic plus or minus the margin of error. This gives a range that we’re
95% confident the population statistic will be in, because if we were to repeat more
samples, 95% of those would be in this range.

Margin of Error:
A distance from the sample statistic that we’re pretty 95% sure the population
statistic falls within.

Causality:
When one variable causes another. As one changes, it makes the other change.

Correlation:
When both variables change. As one variable changes, the other one tends to, but
this doesn’t mean they make each other change.

Sampling Methods
Random sampling:
Where each bit of the population is numbered off and has an equal chance of
being selected.

Systematic sampling:
Where the parts of a population are ordered, and then every nth one is picked from
a random starting point.

Statistically Based Reports | Key Terms

Stratified sampling:
When the researcher tries to make the sample look as much like the population as
possible by grouping it using a different variable and then sampling proportionately
within each group.

Cluster sampling:
When you grab whole groups and use that as your sample.

Quota sampling:
Where the researcher has to have a certain amount of minority groups in their
sample so they’re not underrepresented or ignored.

## Person in the Street:

This is where people are accosted by TV crews and such to get their opinion.

Self-selected sample:
Think web research, text-votes, phone polls, or voluntary surveys. Where people
choose to be part of the sample.

Equations
The Rule of Thumb for a Margin of Error:
1
Margin of Error = ±
√n
Dependent Probabilities:

## Margin of Error of Difference = 2 x Total Margin of Error

Independent Probabilities:

MoE1+MoE2
MoE of difference = ±1.5 ×
2