Вы находитесь на странице: 1из 1

Parameter: a number summarizing some feature of the population (usually an average or proportion); Statistic: the corresponding number summarizing

the feature of interest for the sample; Data: pieces of information about individuals organized by variables. Individuals: people or objects in the dataset. Variables: a characteristic, which varies from person to person (usually in the column) Distribution: what values the variable takes and how often
EDA with one variableCATEGORICAL (ex. Body image) Graphical Display: pie chart or bar graph or pictogram (misleading) Numerical Summaries: category counts and percent EDA with one variableQUANTITATIVE (ex. Actress age and Oscar) Graphical Display: Histogram** (stem plot, dot plot), box plot Numerical: Standard deviation, IQR

Measures of center:
Median: not sensitive to outliers o Isnt as mathematically usefuldoesnt use all values, only emphasizes relative order o Used if distribution is highly skewed Skewed right: mean > median Skewed left: mean < median Mean: sensitive to outliers Used with symmetric or mildly skewed distribution Used when distribution only has a few distinct values

Spread about the mean Average deviation from the mean (ADM): typical distance between data point and the mean o Ex. Standard Deviation: measurement of spread about the mean (average distance of data from mean) Larger standard deviation larger spread in data, s=0 no spread in data (all same value) Empirical Rule: range of typical values is within at most 2 standard deviations Skewed data or data with outliers: Roughly symmetrical data with no strong outliers: Center: Median Spread: IQR Center: Mean Spread: Standard Deviation Small variability=more consistent (difference in Q3 Q1) center of distribution is more meaningful when there is little variability Large variability=less consistent Explanatory Variable: claims to explain predict or affect the response (x) Response Variable: outcome of the study (y)
Type CQ C C (ex. Gender and body image) Display Side by side box plots Bar graph Numerical Measure Descriptive statistics of the response for each level of the explanatory Two-way contingency table with conditional percentages of the response for each level of the explanatory variable SEPARATELY - Column percentages if explanatory variable defines columns (and vice versa) Table

Q Q (ex. Age and memory)

Scatter plot

QQ Correlation coefficient r: numerical measure of the direction and strength of the linear relationship between two quantitative variables value of r: measures the degree of linearity o Close to zero plot is less linear (knowing one variable does not help in predicting the other) o Close to 1 or -1 plot more linear (knowing one variable helps a lot in predicting the other) o Does not tell if the data is linear! (Could be curved) o R is not resistant to outliers! Slope: for every one-unit increase in the explanatory variable (x) there is, on average, a 1.73 units decrease in the response variable. Ex. B=-1.73 Extrapolation: using the regression line to do a prediction for an x value that falls beyond the range of the data Regression line: can be influenced by outliers Standard Error of the Regression (S): measure of the size of the typical error of regression line (due to outliers), Measure of the average (typical) distance of the points about the regression line R2 = measure of how good a predictor our explanatory variable is o Represents the proportion of variation in the response variable that can be explained by the linear relationship with the explanatory variable o Ex. How much of the variability in memory score can be explained by the negative relationship with age? Answer: R2 = 61.2 % o Larger the R2 , the stronger the predictor Predictor: Choose the variable with the larger R2 or the variable with the smaller S Sampling representative of the population (no sampling bias) Simple Random Sampling (SRS) (1) Equal likelihood (2) independence (of prior selection)); Avoid under coverage (not representative of the population) Stratified Sampling Divides the population in separate groups called strata, then random sample from each strata Cluster Sampling Divide the sample into large number of samples (ex. City blocks) then select simple random sample of clusters (adv: do not need list of subjects, inexpensive) Multistage Sampling ex. Cluster then stratified BIASED SAMPLING convenience sampling (ex. At mall, over represent and underrepresent); voluntary response sampling; response bias (subject gives an incorrect response because the question is confusing) Sensitive Questions use Randomized Responsetoss a fair coin in privateheads then tell truth, tails then say yes Wording of Questions and Order of Questions
Controlled experiment (some treatment is deliberately imposed) - Make sure: randomly assignedso lurking variables are ruled out and causal conclusions are possible Observational Study (just observe, cannot influence response) Never can conclude causation!

Random Selection v. Random Allocation R.S.to get a sample that is sampling bias-free so that results can be generalized to general population v. R.A.to rule out the effect of any lurking variable so that we can draw causal conclusions (ex. Blind and double blind)

Вам также может понравиться