Вы находитесь на странице: 1из 10

Statistics Reviewer

CHAPTER 1

Data – Are facts and figures collected, analyzed and summarized for presentation and interpretation.

data set – All data that are in a particular study

Elements – Are the entities on which data are collected

Variable – Is a characteristic of interest for the elements.

Observation – Set of measurements obtained for a particular element (is always the same as the
number of elements)

Scale of Measurement – Determines the amount of info contained in a the data and indicates the most
appropriate data summarization and statistical analyses

- Nominal
o Consists of labels or names used to identify an attribute.
- Ordinal
o Order and rank and scaling is meaningful
- Interval
o Values is expressed in terms of a fixed unit of measure
o Has no true zero
- Ratio
o The ratio of two values is meaningful
o True zero exist which indicates that nothing exists

Categorical and Quantitative Data

- Categorical
o Use nominal and ordinal scale of measurement
o Data that can be grouped into specific categories
- Quantitative
o Uses ratio and interval data
o Data referred to as how much or how many
o Can be discrete (Measures on how many) and continuous (how much)

Cross-Sectional and Time Series Data

- Cross-Sectional
o Are data collected at the same or approximately the same point in time
- Time series
o Are data collected over several time period

Data Sources - Obtained from existing sources, by conducting an observational study or by conducting
an experiment

- Existing sources
o Data commonly available from internal company records
- Observational Study
o Simply to observe what is happening in a particular situation
o Records one or more variable of interest
- Experiment
o Study that is conducted under controlled conditions

Time and Cost issues


- Time and cost is required to obtain the data
- Existing sources is desirable when data is needed right away
Errors from Data acquisition
- Data value obtained is not equal to the true or actual value that would be obtained with a
correct procedure
Descriptive Statistics
- Summaries of data which may be tabular, graphical, or numerical.
Statistical Inference – Uses data from a sample to make estimates and test hypotheses about the
characteristics of a population.
- Population
o Large group of elements in a particular study is called the population.
- Sample
o The smaller group of elements
Analytics -Scientific process of transforming data into insight for making better decisions. And is used for
data-driven or fact-based decision making, which is often seen as more objective than alternative
approaches to decision making
- Descriptive Analytics
o Includes the set of analytical techniques that descrive what has happened in the past.
- Predictive Analytics
o Uses models constructed from past data to predict the future or to asses the impact of
one variable on another
- Prescriptive Analytics
o This yields the best course of action
Big Data and Data Mining
- Big Data
o Large and complex data sets
- Data Warehousing
o Refer to the process of capturing, storing, and maintaining data
- Data Mining
o Deals with methods for developing decision-making information from large databases.
o “Automated extraction of predictive information from large databases”
o Relies heavily on statistical methodology
Computer and Statistical Analysis
- Statisticians uses computer software to perform statistical computations

CHAPTER 2
Summarizing Data for Categorical Variable
- Frequency Distribution
o Tabular summary of data showing the number (frequency) of observations in eact of
several nonoverlapping categories of classes
- Relative Frequency (rf)
o Fraction or proportion of observations belonging to a class
o Tabular summary of rf Relative Frequency Distribution
- Percent Frequency
o Summarizes the percent of the data for each class
- Bar Chart
o Graphical display for depicting categorical data summarized in the f,rf,prf
- Pie Chart
o A circular chart that represents all data
Summarizing Data for Quantitative Variables
- Frequency distribution
o With quantitative data we must be careful with the nonoverlapping classes to be used
▪ Number of Classes – formed by specifying ranges that will be used to group the
data.
▪ Class width – (Largest data Value – Smallest Value) / Number of classes
▪ Class Limits
- Relative frequency and Percent Frequency Distributions
o Rf = Freq. of the class / n
o Prf = rf * 100
- Dot Plot
o A horizontal axis showing the range of the data
o Each value is represented by a dot
- Histogram
o Constructed by placing tha variable of interest on the horizontal axis and the f,rf and prf
on the vertical axis
o Provides the information about the shape, or form of a distribution

- Cumulative Distributions
o Uses the number of classes, class widths, and class limits
- Stem-and-leaf-display
o Used to show simultaneously the rank order and shape of a distribution of data
o Might offer same info as a histogram but its more easier to do by hand and provides
more info because it shows actual data

Summarizing Data for Two variables Using tables


- Cross Tabulation
o Is a tabular summary of data for two variables.
- Simpson’s Paradox
o Conclusions drawn from two or more separate cross tabulations that can be reversed
when data are aggregated into a single cross tabulation
Summarizing Data for Two Variables using graphical displays
- Scatter Diagram/ Trendline – Line that provides approximation of the relationship between two
quantitative variables
- Side-by-side Bar Chart
o Depicts multiple bar charts on the same display

CHAPTER 3
Measures of Location
- Mean
o Provides measure of central location
o Can be influenced by extreme values
- Weighted Mean
o Giving each observation a weight that reflects importance
- Median
o Value in the middle when arranged in ascending order (smallest to largest)
o When the # of observations are even get the average of the two middle observation
- Geometric Mean
o Calculated by finding the nth root

- Mode
o Value that has the greatest frequency
o Bimodal – has 2 modes
o Multimodal – has >2 modes
- Percentiles
o Provides information about how data are spread over the interval from smallest value to
the largest value

o If there is a decimal point use this formula


Nth percentile = Nth Percentile + (decimal)[(n+1th)-nth]

- Quartile
o Same as percentile but only by Q1=25th Percentile Q2=50th Percentile (Also the median)
Q3=75th Percentile
Measures of Variability
- Range
o Simplest measure of variability
o Commonly not used because only based in two values and is highly influenced by
extreme values
- IQR Interquartile Range
o Is the difference of Q3-Q1
- Variance
o Measure of Data that utilizes all the data
o Sample variance


o Population Variance


o Sum of all deviations ∑(𝑥𝑖 − ̅̅̅
𝑥) = 0
- Standard Deviation
o Square root of the variance
o √𝑠 2
- Coefficient of Variation
o Relative measure of variability measures standard deviation and relative mean
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
o ( ) ∗ 100
𝑀𝑒𝑎𝑛

Measure of Distribution Shape, Relative Location, and Detecting Outliers


- Distribution Shape
o Skewness
▪ For a symmetric distribution, the mean and the median are equal.
▪ positively skewed, the mean will usually be greater than the median
▪ negatively skewed, the mean will usually be less than the median
- zScores
o Often called standardized value
o Indicates that the value of the observation is equal to the mean
- Chebyshev’s Theorem
o Makes statements about proportion of data values that must be within a specified
number of sd of the mean
o At least .75, or 75%, of the data values must be within standard devitions of the mean.
Z=2
o At least .89, or 89%, of the data values must be within standard deviations of the mean.
Z=3
o At least .94, or 94%, of the data values must be within standard deviations of the mean.
Z=4

- Empirical Rule
o Can be used to determine the percentages of data values that must be within a
specified number of sd of the mean
o Example

- Detecting Outliers
o Sometimes a data set will have one or more observations with unusually large or
unusually small values. These extreme values are c2alled outliers.
Five-Number Summaries and Boxplots
- 5 Numbers to summarize data
o Smallest Value
o Q1
o Q2
o Q3
o Largest Value
- Boxplot
o Display of data based on five number summary
o Lower Limit = Q1 – 1.5IQR
o Upper Limit = Q3 +1.5IQR

o
Measures of Association between two variables
- Covariance

o
o To get covariance in calcu =r*sx*sy
- Interpreting covariance

o
o Hence, a positive value for sxy indicates a positive linear association between x and y;
that is, as the value of x increases, the value of y increases. If the value of sxy is
negative, however, the points with the greatest influence on are in quadrants II and IV.
Hence, a negative value for sxy indicates a negative linear association between x and y;
that is, as the value of x increases, the value of y decreases. Finally, if the points are
evenly distributed across all four quadrants, the value of will be close to zero, indicating
no linear association between x and y
- Correlation Coefficient

o
- Interpretation of Correlation of Coefficient
o if all the points in a data set fall on a positively sloped straight line, the value of the
sample correlation coefficient is +1; that is, a sample correlation coefficient of +1
corresponds to a perfect positive linear relationship between x and y. Moreover, if the
points in the data set fall on a straight line having negative slope, the value of the
sample correlation coefficient is −1; that is, a sample correlation coefficient of −1
corresponds to a perfect negative linear relationship between x and y.

Вам также может понравиться