0 оценок0% нашли этот документ полезным (0 голосов)

3 просмотров59 страницData Science

Jul 23, 2019

© © All Rights Reserved

PDF, TXT или читайте онлайн в Scribd

Data Science

© All Rights Reserved

0 оценок0% нашли этот документ полезным (0 голосов)

3 просмотров59 страницData Science

© All Rights Reserved

Вы находитесь на странице: 1из 59

- Sonal Ghanshani

Statistics

Why Statistics?

▪ Data are everywhere

▪ Statistical techniques are used to make many decisions that affect our lives

▪ No matter what your career, you will make professional decisions that involve data. An

understanding of statistical methods will help you make these decisions effectively

What is Statistics?

▪ Statistics is the science of assigning a probability to an event based on experiments. It is the

application of quantitative principles to the collection, analysis, and interpretation numerical

data.

▪ Statistics presents a rigorous scientific method for gaining insight into data.

▪ For example, suppose we are studying the age on engineering graduates in a research.

▪ With so many measurements, simply looking at the data fails to provide an informative account.

▪ However, statistics can give an instant overall picture of data based on graphical presentation or

numerical summarization irrespective to the number of data points.

▪ Besides data summarization, another important task of statistics is to make inference and predict

relations of variables.

How Does Statistics Work?

▪ Statistics utilizes data from a population to analyse and draw conclusions. Based on these

conclusions, a decision is taken.

▪ The population may be a community, an organization, sales details, weather details, etc.

▪ Statisticians determine the quantitative model that suits a given type of problem. Then

they decide the kind of data that should be collected and examined.

Types of Statistics

Statistical Methods

Descriptive Inferential

Types of Statistics

▪ Descriptive statistics

• Methods of organizing, summarizing, and presenting data in an informative way

▪ Inferential statistics

• The methods used to determine something about a population on the basis of a sample

o Population –The entire set of individuals or objects of interest or the measurements obtained from all

individuals or objects of interest

o Sample – A portion, or part, of the population of interest

Descriptive statistics

Collect Data

Summarize Data

Present Data

Descriptive statistics

Collect Data

Summarize Data

Present Data

Population and Sample

Population and Sample

▪ A population is the set of all measurements of interest to the study.

▪ A sample is a selected subset of measurements of a population to represent the

population.

Population and Sample

▪ Market Share of a Product

• For example you need to estimate the market share of a detergent product specifically, say, Tide

• Population here is the entire population

• Sample is the a set of Supermarkets/shops

• Market Share is calculated on the sample, not the population

Sources of Data

▪ Primary Data

• Surveys

o Mail: Lowest rate of response, usually the lowest cost

o Web: Faster response and inexpensive

o Telephone: Fastest response

o Personal Interview: Usually focus groups. Most costly. Interviewer effects can be seen

▪ Secondary Data

• This is the data that has been compiled or published elsewhere

• Example: Census Data

• Advantages: It can be gathered quickly and inexpensively

• Disadvantages: May be outdated. May not be accurate

Errors

▪ Response Errors

▪ Subject lies

▪ Subject makes a mistake

▪ Interviewer makes a mistake

▪ Interviewer effects

▪ If the rate of response is low, then the sample is not representative

▪ Might get a biased view of the population

Which is better?

Sample 1

▪ N = 2000

▪ Response rate = 90%

Sample 2

▪ N = 1,000,000

▪ Response rate = 20%

Which is better?

▪ Small but representative sample can be useful in making inferences

▪ A large sample which is unrepresentative, which makes them biased, is useless. There is no

way to correct for it

▪ Therefore, sample 1 is better than sample 2

Example

▪ Television Ratings

• Nielsen publishes TRP (Television Rating Points) ratings for media content

• Ratings are based on a sample, not the population

• Population size is around 400 million and sample size is usually 50,000

• If a show has 15.2 rating, it means 15.2% of the sample were watching the show

• Broadcasters, advertisers, and advertising and media agencies need to have credible information about television

viewing habits in a country with an estimated television audience of 183 million homes and growing

• Doordarshan emerged as the most watched Hindi channel in terms of time spent per viewer ahead of top General

Entertainment Channels

Selecting a Sample

▪ In real world situations, it’s practically difficult to study the entire population.

▪ For example, a company that manufactures mobile phones cannot interact with every

customer who purchased the product. In such cases, the company selects a sample of the

population.

▪ In order to apply statistics and study the population, the sample must be random.

▪ A random sample is one in which each sample of a population has an equal chance or

probability of being selected. It reduces the chances of human bias, making it highly

representative of the population.

Selecting a Sample

▪ Non-probability Samples

▪ Convenience Sample

▪ Students in a class, people in a mall, people in a neighborhood

▪ Judgement Sample

▪ Based on the researchers judgement as what constitutes “representativeness”

▪ Example: He/she might say these 20 stores are representative of whole chain

▪ Quota Sample

▪ Quotas based on demographics for instance

▪ 100 subject interviews – 50 male and 50 female. Of the 50, 10 nonwhite and 40 white

▪ Problem here is we don’t know how representative our sample is of the population

Selecting a Sample

▪ Probability Samples

▪ A sample collected in such a way that each point in the population has a known “chance” of getting selected

• Simple Random Sample(SRS)

o Every population element has equal chance of getting selected

• Systematic Random Sample

o Choose the first sample randomly and then select every kth element where k = N/n

• Stratified Sample

o The population is sub-divided based on a characteristic and a SRS is conducted within each stratum

• Cluster Sample

o First take a random sample of clusters from the population of clusters. Then, SRS within each cluster. Example:

Election district, Orchard

Types of Data

Types of Data

Categorical Data

▪ This refers to data that can be classified into separate groups.

▪ It is also called qualitative data.

▪ This data represents characteristics.

▪ For example, gender of a person can be male or female. It can also have numerical values

like 1 for male and 0 for female.

▪ Categorical data can be further classified as nominal or ordinal.

Types of Data

Numerical Data

▪ Data that can be measured is called numerical data.

▪ It is also called quantitative data.

▪ Discrete Data:

▪ If the values can be clearly separated from each other, then it is discrete data.

▪ Example: Number of children

▪ Continuous data

▪ Example: height of a person

Types of Data

Numerical Data

▪ One simple way to check if the data is continuous or discrete is to check whether if we can

add more decimal points to the data

▪ You might say you are 5’11’’ tall. But in actuality you may be 5’11.23432” tall

▪ If you say you have 2 children, you cannot have 2.234545 children

Types of Data

Scales of Measurement

Scales of Measurement - Nominal

▪ All we can say is that one is different from each other

▪ Gender: Male, Female, Transgender

▪ Eye color: Blue, Green, Brown, Hazel

▪ Type of house: Bungalow, Duplex, Ranch

▪ Type of pet: Dog, Cat, Rodent, Fish, Bird

Scales of Measurement - Ordinal

▪ Ordinal scale of measurement refers to ordered series of relationships or rank order.

▪ The ordinal scale contains data that can be placed in order.

▪ Ordinal scales do not represent a measurable quantity. It is difficult to measure the

interval between the values.

▪ Social economic class: working, middle, upper

▪ The Likert Scale: agree, strongly agree, disagree

Scales of Measurement - Interval

▪ An interval scale has measurements where the difference between the values are equal.

▪ The Fahrenheit scale - the difference between 100 degrees and 80 degrees is the same

difference as between 50 degrees and 70 degrees.

▪ Interval variables do not have a meaningful zero-point.

▪ For example, zero degrees does not mean that there is no temperature at all.

▪ If the zero becomes meaningful, then it is termed ratio scale.

Scales of Measurement - Ratio

▪ In ratio scale, zero is meaningful.

▪ In this scale, no numbers below zero exist, i.e., it has absolute zero.

▪ Arithmetic operations can be performed on a ratio scale.

▪ If the length of a piece of cloth is measured in inches, then the measurement

cannot become zero or less than that. A negative length is not possible.

Scales of Measurement

▪ The differences between the four scales of measurement can be easily understood from the table:

▪ It is clear from the table that ratio scale satisfies all the four properties of scales of measurements

What do we do with the data?

Descriptive statistics

Collect Data

Summarize Data

Present Data

Taxonomy of Statistics

Statistical

Methods

Descriptive Inferential

Univariate

Measure of Dispersion

Measure of Shape

Measure of Central Tendency

Measures of Central Tendency

▪ A measure of central tendency is a summary measure that attempts to describe a whole

set of data with a single value that represents the middle or centre of its distribution.

▪ There are three main measures of central tendency: the mean, median, and mode. Each

of these measures describes a different indication of the typical or central value in the

distribution.

Measures of Central Tendency

Mean

▪ The arithmetic mean is the most widely used average.

▪ For any set of data on the variable x, the mean is denoted by 𝑥 ̅ and is obtained by

dividing the sum of observations by their number

1 𝑛

▪ 𝑥ҧ = σ1 𝑥𝑖

𝑛

Example

Calculate the mean for the following dataset:

1 2 2 4 5 10

Solution:

1 + 2 + 2 + 4 + 5 + 10 24

𝜇= = =4

6 6

1 2 2 4 5 70

Solution:

1 + 2 + 2 + 4 + 5 + 70 84

𝜇= = = 14

6 6

Properties of Mean

▪ It is greatly affected by extreme values

(𝑥𝑖 − 𝜇) = 0

𝑖=1

Measures of Central Tendency

Median

▪ The middlemost value of the set when the elements are arranged in either ascending or

descending order is called the median of a set of observations on a variable x.

𝑥 𝑚 +𝑥 𝑚+1

▪ 𝑥 =

2

Measures of Central Tendency

Median

▪ It is the “middle” of the data

▪ The median is the number such that exactly half of the dataset is less than or equal to it

and exactly half is greater than or equal to it

▪ To get the median, we must first rearrange the data into an ordered array

▪ If 𝑛 is odd, the median is the middle observation of the ordered array.

▪ If 𝑛 is even, it is midway between the two central observations

Measures of Central Tendency

Median

▪ It is the “middle” of the data

▪ The median is the number such that exactly half of the dataset is less than or equal to it

and exactly half is greater than or equal to it

▪ To get the median, we must first rearrange the data into an ordered array

▪ If 𝑛 is odd, the median is the middle observation of the ordered array.

▪ If 𝑛 is even, it is midway between the two central observations

Measures of Central Tendency

Median

▪ Robert hit 11 balls at Grimsby driving range. The recorded distances of his drives,

measured in yards, are given below. Find the median distance for his drives.

85, 125, 130, 65, 100, 70, 75, 50, 140, 135, 95, 70

▪ Ordered data

50, 65, 70, 70, 75, 85, 95, 100, 125, 130, 135, 140

▪ Median drive = 85 yards

• In golf stroke mechanics, a drive, also known as a tee shot, is a long-distance shot played from the tee box, intended to move the ball a great

distance down the fairway towards the green.

• A tee is a stand used to support a stationary ball so that the player can strike it

Measures of Central Tendency

Median

▪ Robert hit 12 balls at Grimsby driving range. The recorded distances of his drives,

measured in yards, are given below. Find the median distance for his drives.

85, 125, 130, 65, 100, 70, 75, 50, 140, 95, 70

▪ Ordered data

50, 65, 70, 70, 75, 85, 95, 100, 125, 130, 140

▪ Median drive = 90 yards

• In golf stroke mechanics, a drive, also known as a tee shot, is a long-distance shot played from the tee box, intended to move the ball a great

distance down the fairway towards the green.

• A tee is a stand used to support a stationary ball so that the player can strike it

Properties of Median

▪ Median is unique for a dataset

▪ Median is not affected by extreme values

▪ Any observation selected at random is just as likely to be greater than the median as less

than the median

Measures of Central Tendency

Mode

▪ The mode is the most commonly occurring value in a distribution.

▪ Example:

1112345

Mode = 1

5 5 5 6 8 10 10 10

Mode = 5, 10

Comparisons of Measures of Central Tendency

Comparison

▪ The “Hotshot” Sales Executive

▪ Kurt works as a sales manager at vsellhomes.com. In the monthly sales review, Kurt reports that he will

achieve his quarterly target of $1M.

▪ Kurt claims his average deal size is $100,000 and he has 10 deals in his pipeline.

▪ At the end of quarter, even after closing 8 deals Kurt fails to meet his target number and falls short by

more than $500,000.

Comparison

Deal # Deal Value Deal Status

▪ The Reality of the “Hotshot” Sales Executive 1 70,000 Open

3 55,000 Closed

▪ Deal #10 is of significantly higher value than all the

4 60,000 Closed

▪ other deals and impacts the average calculation 5 55,000 Closed

▪ Median = $55,000 more realistic measure 6 50,000 Closed

7 50,000 Closed

8 60,000 Closed

9 50,000 Closed

10 5,00,000 Open

Comparison

▪ A report says that “the median credit card debt of American households is zero.’’ We know that many

households have large amounts of credit card debt. In fact, the mean household credit card debt is close to

$8000.

Comparison

▪ The mean and the median are rigidly defined. The position of the mode is somewhat similar to that of the

median.

▪ All the measures are easy to interpret and not too difficult to compute.

▪ Only the mean directly depends on all the observations. A change in any one of the observations influences

the value of the mean. The median and mode are not so sensitive.

▪ The mean is, generally, the best measure of central tendency. In case of extreme values, median is better

measure of central tendency.

Scale of Measurement and Measure of Central Tendency

Nominal Mode

Ordinal Mode and Median

Interval Mode, Median and Mean

Ratio Mode, Median and Mean

Example

▪ For studying smoking habits,

▪ Do you smoke? Yes or No

▪ How many cigarettes did you smoke in the last 3 days?

▪ B is ratio data, we can get mean, median and mode

▪ Always select the highest level of measurement possible

Example

Data Type - Taste test

▪ Coke

▪ Pepsi

▪ Thumsup

▪ Sprite

▪ Coke

▪ Pepsi

▪ Thumsup

▪ Sprite

A is ordinal data

B is interval data better than ordinal

Measures Of Non-central Tendency

Quartiles

▪ Splits the dataset into four equal quarters

▪ 𝑄1 or the First Quartile

▪ 𝑄2 or the Second Quartile

▪ 𝑄3 or the Third Quartile

▪ The quartiles, like the median, either take the value of one of the observations, or the value

halfway between two observations

Quartiles

▪ 𝑄1 - 25% of the observations are smaller than 𝑄1 and 75% of the observations are greater than 𝑄1

▪ 𝑄2 - 50% of the observations are smaller than 𝑄2 and 50% of the observations are greater than 𝑄2

▪ 𝑄3 - 75% of the observations are smaller than 𝑄3 and 25% of the observations are greater than 𝑄3

▪ If n/4 is an integer, the first quartile has the value halfway between n/4th observation and the next

observation

▪ If n/4 is not an integer, the first quartile has the value of the observation whose position

corresponds to the next highest integer

Quartiles

▪ Score of students in a test

6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10

▪ No. of observations = 11; 11/4 = 2.75 is not an integer. So, the 3rd value

▪ The 3rd value from the left and the 3rd value from the right in the ordered data will be 𝑄1 and 𝑄3

▪ Ordered Data

3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15

▪ 𝑸𝟏 = 4; 𝑸𝟐 = 8; 𝑸𝟑 = 10

Quartiles

▪ Score of students in a test

6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10, 8

▪ No. of observations = 12; 12/4 = 3 is an integer. So, the average of the 3rd & 4th value

▪ The average of the 3rd & 4th value from the left and the right in the ordered data will be 𝑄1

and 𝑄3

▪ Ordered Data

3, 4, 4, 6, 8, 8, 8, 8, 9, 10, 10, 15

▪ 𝑸𝟏 = 5; 𝑸𝟐 = 8; 𝑸𝟑 = 9.5

## Гораздо больше, чем просто документы.

Откройте для себя все, что может предложить Scribd, включая книги и аудиокниги от крупных издательств.

Отменить можно в любой момент.