Вы находитесь на странице: 1из 59

Statistics for Data Science

- Sonal Ghanshani
Statistics
Why Statistics?
▪ Data are everywhere
▪ Statistical techniques are used to make many decisions that affect our lives
▪ No matter what your career, you will make professional decisions that involve data. An
understanding of statistical methods will help you make these decisions effectively
What is Statistics?
▪ Statistics is the science of assigning a probability to an event based on experiments. It is the
application of quantitative principles to the collection, analysis, and interpretation numerical
data.

▪ Statistics presents a rigorous scientific method for gaining insight into data.

▪ For example, suppose we are studying the age on engineering graduates in a research.

▪ With so many measurements, simply looking at the data fails to provide an informative account.

▪ However, statistics can give an instant overall picture of data based on graphical presentation or
numerical summarization irrespective to the number of data points.

▪ Besides data summarization, another important task of statistics is to make inference and predict
relations of variables.
How Does Statistics Work?
▪ Statistics utilizes data from a population to analyse and draw conclusions. Based on these
conclusions, a decision is taken.
▪ The population may be a community, an organization, sales details, weather details, etc.
▪ Statisticians determine the quantitative model that suits a given type of problem. Then
they decide the kind of data that should be collected and examined.
Types of Statistics
Statistical Methods

Descriptive Inferential
Types of Statistics
▪ Descriptive statistics
• Methods of organizing, summarizing, and presenting data in an informative way

▪ Inferential statistics
• The methods used to determine something about a population on the basis of a sample
o Population –The entire set of individuals or objects of interest or the measurements obtained from all
individuals or objects of interest
o Sample – A portion, or part, of the population of interest
Descriptive statistics

Collect Data

Summarize Data

Present Data
Descriptive statistics

Collect Data

Summarize Data

Present Data
Population and Sample
Population and Sample
▪ A population is the set of all measurements of interest to the study.
▪ A sample is a selected subset of measurements of a population to represent the
population.
Population and Sample
▪ Market Share of a Product
• For example you need to estimate the market share of a detergent product specifically, say, Tide
• Population here is the entire population
• Sample is the a set of Supermarkets/shops
• Market Share is calculated on the sample, not the population
Sources of Data
▪ Primary Data
• Surveys
o Mail: Lowest rate of response, usually the lowest cost
o Web: Faster response and inexpensive
o Telephone: Fastest response
o Personal Interview: Usually focus groups. Most costly. Interviewer effects can be seen

▪ Secondary Data
• This is the data that has been compiled or published elsewhere
• Example: Census Data
• Advantages: It can be gathered quickly and inexpensively
• Disadvantages: May be outdated. May not be accurate
Errors
▪ Response Errors
▪ Subject lies
▪ Subject makes a mistake
▪ Interviewer makes a mistake
▪ Interviewer effects

▪ Non Response Errors


▪ If the rate of response is low, then the sample is not representative
▪ Might get a biased view of the population
Which is better?
Sample 1
▪ N = 2000
▪ Response rate = 90%

Sample 2
▪ N = 1,000,000
▪ Response rate = 20%
Which is better?
▪ Small but representative sample can be useful in making inferences
▪ A large sample which is unrepresentative, which makes them biased, is useless. There is no
way to correct for it
▪ Therefore, sample 1 is better than sample 2
Example
▪ Television Ratings
• Nielsen publishes TRP (Television Rating Points) ratings for media content
• Ratings are based on a sample, not the population
• Population size is around 400 million and sample size is usually 50,000
• If a show has 15.2 rating, it means 15.2% of the sample were watching the show

▪ Broadcast Audience Research Council (BARC)


• Broadcasters, advertisers, and advertising and media agencies need to have credible information about television
viewing habits in a country with an estimated television audience of 183 million homes and growing
• Doordarshan emerged as the most watched Hindi channel in terms of time spent per viewer ahead of top General
Entertainment Channels
Selecting a Sample
▪ In real world situations, it’s practically difficult to study the entire population.
▪ For example, a company that manufactures mobile phones cannot interact with every
customer who purchased the product. In such cases, the company selects a sample of the
population.
▪ In order to apply statistics and study the population, the sample must be random.
▪ A random sample is one in which each sample of a population has an equal chance or
probability of being selected. It reduces the chances of human bias, making it highly
representative of the population.
Selecting a Sample
▪ Non-probability Samples
▪ Convenience Sample
▪ Students in a class, people in a mall, people in a neighborhood
▪ Judgement Sample
▪ Based on the researchers judgement as what constitutes “representativeness”
▪ Example: He/she might say these 20 stores are representative of whole chain
▪ Quota Sample
▪ Quotas based on demographics for instance
▪ 100 subject interviews – 50 male and 50 female. Of the 50, 10 nonwhite and 40 white

▪ Problem here is we don’t know how representative our sample is of the population
Selecting a Sample
▪ Probability Samples

▪ A sample collected in such a way that each point in the population has a known “chance” of getting selected
• Simple Random Sample(SRS)
o Every population element has equal chance of getting selected
• Systematic Random Sample
o Choose the first sample randomly and then select every kth element where k = N/n
• Stratified Sample
o The population is sub-divided based on a characteristic and a SRS is conducted within each stratum
• Cluster Sample
o First take a random sample of clusters from the population of clusters. Then, SRS within each cluster. Example:
Election district, Orchard
Types of Data
Types of Data
Categorical Data
▪ This refers to data that can be classified into separate groups.
▪ It is also called qualitative data.
▪ This data represents characteristics.
▪ For example, gender of a person can be male or female. It can also have numerical values
like 1 for male and 0 for female.
▪ Categorical data can be further classified as nominal or ordinal.
Types of Data
Numerical Data
▪ Data that can be measured is called numerical data.
▪ It is also called quantitative data.
▪ Discrete Data:
▪ If the values can be clearly separated from each other, then it is discrete data.
▪ Example: Number of children

▪ Continuous data
▪ Example: height of a person
Types of Data
Numerical Data
▪ One simple way to check if the data is continuous or discrete is to check whether if we can
add more decimal points to the data
▪ You might say you are 5’11’’ tall. But in actuality you may be 5’11.23432” tall
▪ If you say you have 2 children, you cannot have 2.234545 children
Types of Data
Scales of Measurement
Scales of Measurement - Nominal
▪ All we can say is that one is different from each other
▪ Gender: Male, Female, Transgender
▪ Eye color: Blue, Green, Brown, Hazel
▪ Type of house: Bungalow, Duplex, Ranch
▪ Type of pet: Dog, Cat, Rodent, Fish, Bird
Scales of Measurement - Ordinal
▪ Ordinal scale of measurement refers to ordered series of relationships or rank order.
▪ The ordinal scale contains data that can be placed in order.
▪ Ordinal scales do not represent a measurable quantity. It is difficult to measure the
interval between the values.

▪ High school class rankings: 1st, 2nd, 3rd, etc


▪ Social economic class: working, middle, upper
▪ The Likert Scale: agree, strongly agree, disagree
Scales of Measurement - Interval
▪ An interval scale has measurements where the difference between the values are equal.
▪ The Fahrenheit scale - the difference between 100 degrees and 80 degrees is the same
difference as between 50 degrees and 70 degrees.
▪ Interval variables do not have a meaningful zero-point.
▪ For example, zero degrees does not mean that there is no temperature at all.
▪ If the zero becomes meaningful, then it is termed ratio scale.
Scales of Measurement - Ratio
▪ In ratio scale, zero is meaningful.
▪ In this scale, no numbers below zero exist, i.e., it has absolute zero.
▪ Arithmetic operations can be performed on a ratio scale.
▪ If the length of a piece of cloth is measured in inches, then the measurement
cannot become zero or less than that. A negative length is not possible.
Scales of Measurement
▪ The differences between the four scales of measurement can be easily understood from the table:

▪ It is clear from the table that ratio scale satisfies all the four properties of scales of measurements
What do we do with the data?
Descriptive statistics

Collect Data

Summarize Data

Present Data
Taxonomy of Statistics
Statistical
Methods

Descriptive Inferential

Univariate

Measure of Central Tendency

Measure of Dispersion

Measure of Shape
Measure of Central Tendency
Measures of Central Tendency
▪ A measure of central tendency is a summary measure that attempts to describe a whole
set of data with a single value that represents the middle or centre of its distribution.
▪ There are three main measures of central tendency: the mean, median, and mode. Each
of these measures describes a different indication of the typical or central value in the
distribution.
Measures of Central Tendency
Mean
▪ The arithmetic mean is the most widely used average.
▪ For any set of data on the variable x, the mean is denoted by 𝑥 ̅ and is obtained by
dividing the sum of observations by their number
1 𝑛
▪ 𝑥ҧ = σ1 𝑥𝑖
𝑛
Example
Calculate the mean for the following dataset:

1 2 2 4 5 10

Solution:
1 + 2 + 2 + 4 + 5 + 10 24
𝜇= = =4
6 6

Now change just one data point in the dataset

1 2 2 4 5 70

Solution:
1 + 2 + 2 + 4 + 5 + 70 84
𝜇= = = 14
6 6
Properties of Mean
▪ It is greatly affected by extreme values

▪ Sum of differences about the mean is zero

෍(𝑥𝑖 − 𝜇) = 0
𝑖=1

▪ Mean is unique for a given dataset


Measures of Central Tendency
Median
▪ The middlemost value of the set when the elements are arranged in either ascending or
descending order is called the median of a set of observations on a variable x.
𝑥 𝑚 +𝑥 𝑚+1
▪ 𝑥෤ =
2
Measures of Central Tendency
Median
▪ It is the “middle” of the data
▪ The median is the number such that exactly half of the dataset is less than or equal to it
and exactly half is greater than or equal to it
▪ To get the median, we must first rearrange the data into an ordered array
▪ If 𝑛 is odd, the median is the middle observation of the ordered array.
▪ If 𝑛 is even, it is midway between the two central observations
Measures of Central Tendency
Median
▪ It is the “middle” of the data
▪ The median is the number such that exactly half of the dataset is less than or equal to it
and exactly half is greater than or equal to it
▪ To get the median, we must first rearrange the data into an ordered array
▪ If 𝑛 is odd, the median is the middle observation of the ordered array.
▪ If 𝑛 is even, it is midway between the two central observations
Measures of Central Tendency
Median
▪ Robert hit 11 balls at Grimsby driving range. The recorded distances of his drives,
measured in yards, are given below. Find the median distance for his drives.
85, 125, 130, 65, 100, 70, 75, 50, 140, 135, 95, 70

▪ Ordered data
50, 65, 70, 70, 75, 85, 95, 100, 125, 130, 135, 140

▪ Single middle value


▪ Median drive = 85 yards

• In golf stroke mechanics, a drive, also known as a tee shot, is a long-distance shot played from the tee box, intended to move the ball a great
distance down the fairway towards the green.
• A tee is a stand used to support a stationary ball so that the player can strike it
Measures of Central Tendency
Median
▪ Robert hit 12 balls at Grimsby driving range. The recorded distances of his drives,
measured in yards, are given below. Find the median distance for his drives.
85, 125, 130, 65, 100, 70, 75, 50, 140, 95, 70

▪ Ordered data
50, 65, 70, 70, 75, 85, 95, 100, 125, 130, 140

▪ Two middle values so take the mean


▪ Median drive = 90 yards

• In golf stroke mechanics, a drive, also known as a tee shot, is a long-distance shot played from the tee box, intended to move the ball a great
distance down the fairway towards the green.
• A tee is a stand used to support a stationary ball so that the player can strike it
Properties of Median
▪ Median is unique for a dataset
▪ Median is not affected by extreme values
▪ Any observation selected at random is just as likely to be greater than the median as less
than the median
Measures of Central Tendency
Mode
▪ The mode is the most commonly occurring value in a distribution.

▪ Mode is the value that occurs most frequently

▪ Might not be unique. More than one mode per dataset

▪ Example:
1112345

Mode = 1
5 5 5 6 8 10 10 10

Mode = 5, 10

▪ This is a “𝑏𝑖𝑚𝑜𝑑𝑎𝑙” dataset


Comparisons of Measures of Central Tendency
Comparison
▪ The “Hotshot” Sales Executive
▪ Kurt works as a sales manager at vsellhomes.com. In the monthly sales review, Kurt reports that he will
achieve his quarterly target of $1M.
▪ Kurt claims his average deal size is $100,000 and he has 10 deals in his pipeline.
▪ At the end of quarter, even after closing 8 deals Kurt fails to meet his target number and falls short by
more than $500,000.
Comparison
Deal # Deal Value Deal Status
▪ The Reality of the “Hotshot” Sales Executive 1 70,000 Open

▪ Average deal size in pipeline = $100,000 2 50,000 Closed


3 55,000 Closed
▪ Deal #10 is of significantly higher value than all the
4 60,000 Closed
▪ other deals and impacts the average calculation 5 55,000 Closed
▪ Median = $55,000 more realistic measure 6 50,000 Closed
7 50,000 Closed
8 60,000 Closed
9 50,000 Closed
10 5,00,000 Open
Comparison
▪ A report says that “the median credit card debt of American households is zero.’’ We know that many
households have large amounts of credit card debt. In fact, the mean household credit card debt is close to
$8000.

▪ Is the report correct?


Comparison
▪ The mean and the median are rigidly defined. The position of the mode is somewhat similar to that of the
median.

▪ All the measures are easy to interpret and not too difficult to compute.

▪ Only the mean directly depends on all the observations. A change in any one of the observations influences
the value of the mean. The median and mode are not so sensitive.

▪ The mean is, generally, the best measure of central tendency. In case of extreme values, median is better
measure of central tendency.
Scale of Measurement and Measure of Central Tendency

Nominal Mode
Ordinal Mode and Median
Interval Mode, Median and Mean
Ratio Mode, Median and Mean
Example
▪ For studying smoking habits,
▪ Do you smoke? Yes or No
▪ How many cigarettes did you smoke in the last 3 days?

▪ A is nominal and hence we can get frequencies


▪ B is ratio data, we can get mean, median and mode
▪ Always select the highest level of measurement possible
Example
Data Type - Taste test

▪ Please rank the taste of following soft drinks:


▪ Coke
▪ Pepsi
▪ Thumsup
▪ Sprite

▪ Please rate the taste of following soft drinks:


▪ Coke
▪ Pepsi
▪ Thumsup
▪ Sprite

A is ordinal data
B is interval data better than ordinal
Measures Of Non-central Tendency
Quartiles
▪ Splits the dataset into four equal quarters

▪ To cut a dataset into 4 parts we need 3 markers


▪ 𝑄1 or the First Quartile
▪ 𝑄2 or the Second Quartile
▪ 𝑄3 or the Third Quartile

▪ The quartiles, like the median, either take the value of one of the observations, or the value
halfway between two observations
Quartiles
▪ 𝑄1 - 25% of the observations are smaller than 𝑄1 and 75% of the observations are greater than 𝑄1

▪ 𝑄2 - 50% of the observations are smaller than 𝑄2 and 50% of the observations are greater than 𝑄2

▪ 𝑄3 - 75% of the observations are smaller than 𝑄3 and 25% of the observations are greater than 𝑄3

▪ If n/4 is an integer, the first quartile has the value halfway between n/4th observation and the next
observation

▪ If n/4 is not an integer, the first quartile has the value of the observation whose position
corresponds to the next highest integer
Quartiles
▪ Score of students in a test
6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10

▪ No. of observations = 11; 11/4 = 2.75 is not an integer. So, the 3rd value

▪ The 3rd value from the left and the 3rd value from the right in the ordered data will be 𝑄1 and 𝑄3

▪ Ordered Data
3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15

▪ 𝑸𝟏 = 4; 𝑸𝟐 = 8; 𝑸𝟑 = 10
Quartiles
▪ Score of students in a test
6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10, 8

▪ No. of observations = 12; 12/4 = 3 is an integer. So, the average of the 3rd & 4th value

▪ The average of the 3rd & 4th value from the left and the right in the ordered data will be 𝑄1
and 𝑄3

▪ Ordered Data
3, 4, 4, 6, 8, 8, 8, 8, 9, 10, 10, 15

▪ 𝑸𝟏 = 5; 𝑸𝟐 = 8; 𝑸𝟑 = 9.5