Вы находитесь на странице: 1из 44

Data Science with Python

Module 1: Introduction to Statistics


and Analytics

1
Course Plan
Module Titles
Current Focus: Module 1 – Introduction to Statistics and Analytics
Module 2 – Databases and SQL
Module 3 – Introduction to Python and Numpy
Module 4 – Pandas and Matplotlib
Module 5 – Storytelling with Data

2
Class Introductions

• Name
• Title / Function / Industry
• What do you hope to learn in this class?

3
Reading and Resources
• Recommended Reading
– From Data to Insight
• A. Scott and K. Rogers

4
Course Evaluations
• Assignment 1 – 30%
• Assignment 2 – 30%
• Assignment 3 – 30%
• Participation – 10%

5
Topics for this Module

• 1.1 Introduction to Statistics


• 1.2 Summary Statistics
• 1.3 Introduction to Regression
• 1.4 Introduction to Analytics
• 1.5 The Analytics Methodology

6
Learning Outcomes for this Module

• Calculate and interpret descriptive statistics about


a data set
• Understand and explain types of analytics
• Apply the analytics methodology to plan an
analytics project

7
Module 1 – Section 1

Introduction to Statistics

8
What is statistics?
Statistics is the process of deriving an understanding from
data.

Presentation

Interpretation

Analysis

Data
Collection

9
Descriptive vs. Inferential Statistics
Descriptive statistics summarizes information about a
dataset to provide basic insights. This includes
• Measures of Central Tendency (Mean, Median, Mode)
• Measures of Dispersion (Variance, Standard Deviation,
Range)
• Measures of Relationship (Correlation)

Inferential statistics uses data from a sample to determine


the characteristics of a population where one can make a
statement about the population with a well-defined level of
confidence.

10
Populations and Samples
A sample is a group of data points which represent a broader
population. We use samples to gather statistics which helps
us estimate the parameters of a population. Generally, the
greater the sample size, the more closely the statistics
represent the population.

Parameter: Statistics:
• Population Mean • Sample Mean
• Population Size • Sample Size

11
Sampling Techniques
Simple random sample selects a fixed number of members
of a population where each member has an equal probability
of being selected.

Stratified sampling divides a population into distinct groups,


then samples the groups randomly.

Cluster sampling groups the population in clusters based on


characteristics easy to sample, such as location or time.

Convenience sampling draws a sample based on those


members of the population which are easiest to access.

12
Types of Data

Data Type Definition Example


Categorical Data where numeric values Type of apple:
represent category membership. Granny Smith = 1
Red Delicious = 2
Fuji = 3
Ordinal Data which captures a meaningful Level of difficulty:
sequence, but the difference Easy = 1
between values does not have a Medium = 2
numerical interpretation. Hard = 3
Interval Data which captures meaningful Celsius scale:
sequence and the distance between 15C
values in a scale is meaningful, such 20C
that numeric operations can be 25C
performed.

13
Types of Data

Data Type Definition Example


Ratio Data which has an absolute 0. Also Weight
considered continuous, it allows for Currency
most meaningful analysis. Length
Cross Data where no meaningful time Earnings in 2018
Sectional dimension is associated with the
observations (e.g. because all data Gas consumed in one
is collected at the same point in month
time, or because time is irrelevant).
Time Series Data which has a meaningful time Closing stock prices
pattern associated with the Earnings by year
observations. Deaths per month

14
Group Activity
In your groups, determine the type of data for the following:
Data Type
Car makes and models
Total monthly expenses
Credit card transactions
Relationship status
Customer address
Monthly product P&L

Time: 10 minutes

15
Module 1 – Section 2

Summary Statistics

16
Measures of Central Tendency
The mean (or average) represents a central, or typical value
for the dataset. To calculate the mean, add all observations for
a single variable and divide by the number of observations.

Example: Consider the following set of observations and


calculate the mean.

19, 18, 10, 9, 7, 12, 13, 19, 20

Mean = (19+18+10+9+7+12+13+19+20) / 9
= 14.1

17
Measures of Central Tendency
The median finds the middle value of a set of observations,
when ordered from lowest to highest. It separates the higher
half of data from the lower half.

Example: Consider the following set of observations and


identify the median.

19, 18, 10, 9, 7, 12, 13, 19, 20

To find the median, re-order the data from smallest to largest


and find the middle number.
7, 9, 10, 12, 13, 18, 19, 19, 20

18
Measures of Central Tendency
The mode measures the number most frequently occurring in
the data set. This is the number most likely to be sampled from
your distribution.

Example: Consider the following set of observations and


identify the mode.

19, 18, 10, 9, 7, 12, 13, 19, 20

To find the mode, re-order the data from smallest to largest


and find the number which occurs most frequently.
7, 9, 10, 12, 13, 18, 19, 19, 20

19
Measures of Dispersion
Variance measures how far a set of numbers are spread from
their mean. It is represented by the symbol 𝜎𝜎 2 for population
variance and 𝑠𝑠 2 for sample variance.

To calculate variance, sum the squared distance between


each point and the mean of the dataset, and divide by the
number of observations.

20
Example: Variance Calculation
Consider the following set of observations in a sample and
identify the mode.

19, 18, 10, 9, 7, 12, 13, 19, 20

To calculate variance:

(19-14.1)2+(18-14.1)2+(10-14.1)2+(9-14.1)2+(7-14.1)2+(12-14.1)2+(13-14.1)2+(19-14.1)2+(20-14.1)2
=
(9-1)

= 24.61

21
Measures of Dispersion
Standard deviation is the square root of the variance. It is
often expressed as 𝜎𝜎 for population variance and 𝑠𝑠 for sample
variance. It is similar in nature to variance, but is expressed in
relation to the mean rather than a squared value.

From our previous example, we can calculate the standard


deviation by taking the square root of 24.61.

24.61 = 𝟒𝟒. 𝟗𝟗𝟗𝟗

22
Measures of Dispersion
Ranges, maximums and minimums are numerical ways to
understand how the data is dispersed. To calculate range,
subtract the minimum value of the data set from the maximum
value.

Quartiles are a more complex way to evaluate dispersion of


data, by reviewing how many data points fall between the 1st,
2nd, 3rd and 4th quartile in a data set.

23
Measures of Relationship
Correlation refers to the extent that two variables have a
linear relationship. For example, if a child grows taller, they
also grow heavier.

The correlation coefficient, denoted by p helps us determine


the strength of the relationship. It always lies between -1 and
1.

-1 = perfect negative relationship


0 = no relationship
1 = perfect positive relationship

24
Module 1 – Section 3

Introduction to Regression

25
What is regression modeling?
Regression modeling allows us to understand and model the
reactions between two or more variables in order to predict
potential outcomes. Typically the modeling process begins
with a theory. For example: sales will increase with a greater
advertising budget.

Sales = β0 + β1Ad_Budget + β2 Price + Other Factors (i.e. the error term)

Upon greater understanding of the data, further factors can be considered.


For example: competitor price.

Sales = β0 + β1Ad_Budget + β2 Price + β3 Competitor_Price + Other Factors

26
Types of Regression Models
Common regression models used by data scientists:

Type Description
Linear Regression Establishes a relationship between a
dependent (Y) and independent (X) variable
using a best fit straight line.
Logistic Regression Finds the probability of event = success or
event = failure. Used when a dependent
variable is binary in nature (e.g. categorical
data)
Ridge Regression Used when data suffers from multicollinearity
(independent variables are highly correlated).
Lasso Regression Least Absolute Shrinkage and Selection
Operator is used to reduce the variability and
improve the accuracy of a linear model.

27
Module 1 – Section 4

Introduction to Analytics

28
What is analytics?
Analytics is the identification and use of meaningful patterns in
data which inform decision-making.

Business People
Decisions Process
Tools

Analytics

Statistics

29
Analytics in Business
Analytics can help organizations learn more about their
business, their clients, and the environment in which they
operate.

Descriptive Diagnostic Predictive Prescriptive


What Why did it What could How to make
happened? happen? happen? it happen?

30
Group Activity
In your groups, brainstorm 1-2 applications of analytics that
can help achieve the following organizational objectives:

• Increase revenue
• Reduce costs
• Manage risk

Discussion: What type of analytics is required for the


brainstormed applications? How could you evolve the model to
use predictive or prescriptive analytics?

Time: 15 minutes

31
What is Artificial Intelligence?
Artificial intelligence is a broad term which encompasses all
machine-based learning and insights.

Artificial Intelligence
The use of computers to mimic the cognitive function of humans.

Machine Learning
A subset of AI focused on the ability of machines to receive a set of
data and learn for themselves.

Deep Learning
A subset of Machine Learning involving multiple layers of
neural networks to achieve an output.

32
Module 1 – Section 5

Analytics Methodology

33
The Methodology
The analytics methodology is a problem-solving approach to
guide your thinking in deriving insights from data in a business
setting.

Define Discover Explore Analyze Communicate Operationalize

Understand the Assess data Understand the Determine the Communicate Assess
business. required to solve data set with type of analytics insights to key operational
Define the scope the problem. summary required. stakeholders. requirements to
of business statistics. support model
Determine how Identify Determine
problem. deployment.
the data will be Validate that data meaningful business outcome
Determine the sourced. requirements are patterns and based on the data Embed the model
data required for met. relationships. in business
Collect the data.
the problem. processes.
Transform the Build the model.
Clean the data.
Align on the data to suit your Maintain and
Evaluate the
desired outcomes. analysis. support the
model.
model.
Create project
plan. Improve the
model.

34
Define
Organizations use analytics to better understand and address
current or foreseeable business problems.

Phase Considerations
Define
o What are the business priorities (e.g. revenue generation, cost
reduction, risk management)?
Understand the
business. o What is the business problem? Can the problem be quantified?

Define the scope o What will the outcomes and corresponding benefits be from the
of business analytics project?
problem. o Who are the stakeholders?
Determine the o What are the timelines?
desired outcomes. o What is the scope of analysis?
Identify the o How will project success be measured?
stakeholders.
Create the project
plan.

35
Discover
Organizations use analytics to better understand and address
current or foreseeable business problems.

Phase Considerations
Discover
o What data is required to solve this problem?

Assess data o Is the data available?


required to solve o Can it be used in accordance with legal requirements (e.g.
the problem. privacy)?
Determine how o How will the data be collected (e.g. from which systems, with
the data will be which methods)?
sourced. o How will it be cleaned?
Collect the data. o How will missing data be handled?
Clean the data.

36
Explore
Organizations use analytics to better understand and address
current or foreseeable business problems.

Phase Considerations
Explore
o What can be determined about the data set (e.g. central
tendency, dispersion, relationship)?
Understand the
data set with o Does the data require normalization? Any other
summary statistics. transformations?
Validate that data o Are business stakeholders aligned to the approach and
requirements are available data?
met. o What type of analytics would help solve the business problem
Transform the (e.g. descriptive, diagnostic, predictive or prescriptive?)
data to suit your
analysis.

37
Analyze
Organizations use analytics to better understand and address
current or foreseeable business problems.

Phase Considerations
Analyze
o What insights and relationships can be identified between the
variables in the data set? Are these meaningful?
Determine the
type of analytics o How will the insights or relationships contribute to solving the
required. business problem?
Identify o Which model is most appropriate for this problem (e.g. linear
meaningful model, logistic regression, OLS)?
patterns and o What data and tools are required to build the model?
relationships.
o How will the model be structured?
Build the model.
o What train/test ratio is appropriate?
Evaluate the
o Against which other models should this model be evaluated?
model.
o What reviews and approvals are required on the final model?

38
Communicate
Organizations use analytics to better understand and address
current or foreseeable business problems.

Phase Considerations
Communicate
o How will model insights be communicated?
Communicate o Who is the critical audience?
insights to key o How can insights be visualized?
stakeholders.
o Which storytelling method will support the presentation best?
Determine
o What business decisions will be made as a result of these
business outcome
insights?
based on the data

39
Operationalize
Organizations use analytics to better understand and address
current or foreseeable business problems.

Phase Considerations
Operationalize
o In which process or workflow will this model be implemented?
o Are there training requirements for the required resources?
Assess
operational o Will any other cycles, processes or projects be impacted?
requirements to o Who will maintain the model and how frequently?
support model
o How will new data be integrated in the model? Will access
deployment.
requirements be met?
Embed the model
o When and by whom will the model be reviewed and improved?
in business
processes.
Maintain and
support the
model.
Improve the
model.

40
Group Activity
In your groups, select one (1) of the business problems
brainstormed during the first exercise. Now that you have
learned more about statistics and analytics, determine the
following:
1. What data will be required to solve your business problem?
2. What descriptive statistical analysis will you need and what
will it tell you about your data?
3. What other considerations are there for this project (e.g.
availability of data, privacy considerations, etc.)?
4. How will you measure success?

Time: 25 minutes

41
Follow us on social

Join the conversation with us online:

facebook.com/uoftscs

@uoftscs

linkedin.com/company/university-of-toronto-school-of-continuing-studies

@uoftscs

42
Any questions?

43
Thank You
Thank you for choosing the University of Toronto
School of Continuing Studies

44

Вам также может понравиться