0 оценок0% нашли этот документ полезным (0 голосов)
111 просмотров1 страница
1. The document discusses different types of variables in statistics including count, ordinal, nominal, and continuous variables. It also discusses different graphs that can be used to visualize data like scatterplots, histograms, and box plots.
2. Formulas are provided for calculating measures of central tendency like the mean and measures of variability like variance. The normal distribution and how it relates to calculating probabilities is also covered.
3. The document explains how to calculate confidence intervals which provide a range of values that is likely to contain the population parameter being estimated.
1. The document discusses different types of variables in statistics including count, ordinal, nominal, and continuous variables. It also discusses different graphs that can be used to visualize data like scatterplots, histograms, and box plots.
2. Formulas are provided for calculating measures of central tendency like the mean and measures of variability like variance. The normal distribution and how it relates to calculating probabilities is also covered.
3. The document explains how to calculate confidence intervals which provide a range of values that is likely to contain the population parameter being estimated.
1. The document discusses different types of variables in statistics including count, ordinal, nominal, and continuous variables. It also discusses different graphs that can be used to visualize data like scatterplots, histograms, and box plots.
2. Formulas are provided for calculating measures of central tendency like the mean and measures of variability like variance. The normal distribution and how it relates to calculating probabilities is also covered.
3. The document explains how to calculate confidence intervals which provide a range of values that is likely to contain the population parameter being estimated.
BASIC STATS When area is > something Interval Estimate
Types of variables: Expected Value/ Mean E.g. P(Z>1.08), 2 x ~ N , . xix Use 1 – P(Z<1.08) E( E Count: no of bedrooms/children, manufacture year Ordinal: categories, income brackets X()X) i p p((xxii )) n all all xi xi x ~ N 0,1 . Nominal: gender, yes/no, manufacture model When area is in middle of graph So, Z Continuous: distance, time, age, best cruising speed 0 * 18 1* 83 2 * 83 3* 81 E.g. P(1.08<Z<1.51) n Expected = P(Z<1.51) – P(Z<1.08) 12 Values/Mean 1.5 / Variance Graphs E(c) = 8c V(c) = 0 Other Scatterplots: shows ifmeasures there is a linearof centre(positive relationship E(cX) = cE(X) V(cX) = c2V(X) Symmetry → P(Z<-a) = P(Z>a) increasing(continuous/interval data) left to right, cov>0), doesn't measure strength E(X-Y) = E(X) – V(X+c) = V(X) Histogram: modality skewness (positive long tail to right), E(Y) V(X+Y) = V(X) + V(Y) Suppose we require P(X<a). modal class, symmetry E(X+Y) = E(X) + [If X and Y are independent] X Know that Z – average E X ~ N (0,1). Arithmetic mean E(Y) Box Plot: Skewness (short/long whisker, short 2 2 V(X-Y) = V(X) +V(Y) bottom=positive), symmetric E(XY) =E(X) x E(Y) [If X and Y are independent] X a If observations are labelled X 1 , X 2 ,..., X n So, P X a P Empirical CDFs E X 2 2 then sample average is called X V ofXX Variance x 2 p xi 2 PZ a Confidence Interval V X xxii2 pp xxiii 2 i Location Location of Percentiles of percentiles 2 2 where Z ~ N (0,1). Calculated as all x P x Z / 2 x Z / 2 100(1 )% n 1 P 1 n X 2 ... X n 02 18 12 83 22 83 32 81 1.52 n all xi Lp = ___th X1 X observation all xi Or suppose we require P(a<X<a). 100 = what figure? n E.g. 02 1 12 3 22 3 32 1 1.52 Find the LCL and UCL. Arithmetic Mean N=number of data points; n 8 0.75 8 8 8 100(1 )% height E.g. Average CI: x ofZasample /2 of 25 men is found to be 1 Xi Lp is the location of the pth percentile. n 0.75 178cm. SD of male heights is 10cm, and heights follow a n i 1 Std Dev X 0.75 0.866 (to 3dp) When asking max or min amount normal distribution. Find a 95% confidence interval. Mean vs Median Std Dev X 0.75 Marginal 0.866 (to Distribution 3dp) of X E.g. rackets last an average of 16months, SD of 3months. If symmetric, mean ≈ median p ( x ) P X x p ( x, y ) Only wants to replace a max of 4% under warranty. How long If positive skew, mean > median should warranty period be? P x 1.96 x 1.96 0.95 If negative skew, mean < median all yi n n Covariance 10 10 Variance xy cov X , Y E X x Y y P 178 1.96 178 1.96 0.95 Measures spread 25 25 xi y j p xi , y j x . y m n Larger SD – ↑ risk – ↑ rate of return i 1 j i P 174.08 181.92 0.95 1 n s2 ( X i X )2 n 1 i 1 Correlation of Coefficient Minimum sample size Measures strength of linear relationship between variables, Use the CLT to find the min sample size if the SD of the 1 1 n2 2 2 2 n
s2 n 1 X i
i nX XnX X&Y (-1 to +1) population is known. n i11 i 1 cov( x, y) E.g. = Estimating the length of bus trip in the morning. Example: 2 proportions – z tables 1 ; 1 1 4. SAMPLING DISTRIBUTIONS St.dev. of the trip length is 5 mins. Estimate the true Eg. A random sample of 80 voters were surveyed in one Eg 25 49 1 4 16 5*3.8 2 x y As sample size ↑, distribution of the sample mean becomes population mean length to within 3 minutes, with 99% region, and 70 voters in another. 45% of the voters surveyed 4 Linear Combinations - Portfolio Variance closer to the normal distribution certainty. in the first region said they would vote for candidate A, and 1 Deviation Standard 95 5*14.44 s s 2 E aX bY aE X bE Y Step 1: set up the equation needed. 40% of the voters surveyed in the second region said they 4 1 Coefficient of variation V aX bY a 2V X b 2V Y 2ab cov X , Y The SD of the sample mean is the model standard deviation, , (the theoretical SD) divided by n, that is, /n. P X 3 0.99 would vote for A. Do these surveys indicate that the level of support for A is the * 22.8 5.7 s Measures 4 spread cv Step 2: standardise. same in both regions? a b 2ab x y 2 2 2 2 X x y Central Limit Theorem X 3 H 0 : p1 p2 Covariance Expected Return If X is a random variable with a mean µ and variance σ², then P 0.99 Measures the strength (and direction) of linear relationship RP = E(RA) x investment + E(RB) x investment 2 n n H A : p1 p2 between 2 variables. X N , n n 3 Test Test Statistic: Statistic: ( X i X )(Yi Y ) Variance of Return P Z 0.99 cov( X , Y ) i 1 V(RP) = (investment)2 x E(RA) + (investment)2 x E(RB) X 5 n n1 pˆ1 n2 pˆ 2 80*0.45 70*0.4 64 n 1 Z ~ N 0,1 as n . pˆ 1 n n 3 n n1 n2 80 70 150 X iYi nXY n 1 i 1 3. CONTINUOUS PROBABILITY DISTRIBUTIONS P Z 0.99 Types of random variables: SEMean → Usually don’t know the theoretical value , 5 pˆ1 pˆ 2 0.45 0.4 If cov > 0, then as X increases, Y increases (positive slope). If Uniform therefore estimate it using the sample SD, s, and s/n is called Z cov < 0 = opposite. If cov=0, not linearly related Step 3: solve for n. Example: Proportions 1 1 64 64 1 1 1 - Parameters: End point the standard error of the mean pˆ (1 pˆ ) Normal: requires probabilities P(X<a) or P(a<X<b) P Z 2.575 0.99 A random sample is taken over a week. 625 of the 945 users n1 n2 150 150 80 70 Coefficient of correlation - Parameters: Variance, mean Mentions ‘sample’ - E.g. The probability that a sample of 10 are found to be cyclists. Find a 90% confidence interval for Binomial: the number of “successes” in the n trials students will have an average mark over 78 if mean is 72 and 3 n the true proportion of users who are cyclists. 0.6177 to 4dp. COV ( X , Y ) cov( X , Y ) 2.575 , r 1. A fixed number of trials, n. SD is 9. 5 pˆ (1 pˆ ) pˆ (1 pˆ ) XY sX sY P pˆ z / 2 p pˆ z / 2 1 Decision Rule: 2. 2 possible outcomes for each trial; success and failure. 92 n (2.575*5) / 3 n n If r=-1, perfect negative linear relationship 3. Probability of success is p, probability of failure is (1-p). X ~ N 72, 2 P-value: P(|Z|>0.6177) = 0.5352. 10 n P( Z 1.645) 0.05 If r=+1, perfect positive relationship 4. Trails are independent. Rejection Region: For 95% significance and two-sided X 78 72 pˆ (1 pˆ ) pˆ (1 pˆ ) P X 78 P If r=0, no LINEAR relationship 6. HYPOTHESIS TESTING P pˆ 1.645 p pˆ 1.645 alternative, reject null hypothesis if test statistic is less than - From Minitab: |r| = √R-sq 0.90 n 9 10 Null hypothesis: something WILL happen 1.96 or greater than +1.96. Binomial n n Notation: If x is a binomial random variable with n trials and Alternative hypothesis: something WILL NOT happen 625 Conclusion 6 10 pˆ 0.6614 (to 4dp) (xi–x)2 x p is the probability of success in each trial, then we write, P Z Errors 945 p-value is large or test statistic does not lie in the rejection xi yi xi - x (xi–x)2 yi - y (yi–y)2 (yi–y)2 9 Type 1 error Reject H0 when it is true region. X~Bin(n,p) 1 7 -2.5 6.25 3 9 -7.5 Mean and variance: P Z 2.11 Type 2 error Accept false H0 E.g. In the same data set, 204 of the 945 people are found to be walking. It is supposed that overall, 20% of users are 2 5 -1.5 2.25 1 1 -1.5 If x~Bin(n,p) 1 P Z 2.11 Example: 𝞼 is unknown – t distribution walkers. Test this supposition using the data. 3 5 -0.5 0.25 1 1 -0.5 µx2 = E(x) = np 1 0.9826 Government claims that mean noise level is no more than H 0 : p 0.2 4 4 0.5 0.25 0 0 0 𝞼x2 = np(1-p) 0.0174 60dB. A test is conducted, measuring noise levels on 18 5 2 1.5 2.25 -2 4 -3 Binomial formula: H A : p 0.2 E.g. 2 question that says “will differ from the population occasions and obtain an average of 72dB with a standard pˆ c 6 1 2.5 6.25 -3 9 -7.5 P(x=k) = nCk.(1-p)n-k mean by less than __units” deviation of 10dB Test Statistic: Z c(1 c) n H0: µ=60 21 24 0 17.5 0 24 -20 Uniform Distribution f x 1 , a xb E ( X ) and HA: µ>60 204 0.2 945 Y mean = 4 ba 0.2(1 0.2) 945 ( a b) var( X ) X 0 72 60 2 12 X mean = 3.5 EX n T 5.091 1.2199 to 4 decimal places. x y 2 5. ESTIMATION s n 10 18 2.357 x 3.50000 b a Compare to t distribution with n-1 df. Decision Rule 2 Point estimate: a single value or point, i.e. sample mean = 4 y -4.00000 4.80000 V X is a point estimate of the population mean, µ. For 5% error, want to find upper 5% tail – marked “one sided Rejection Region: For 5% significance, two tailed test, reject = covariance of x, y 12 Interval estimate: Draws inferences about a population by (area to the left) = 0.95. Rejection region is t>1.7396. H0 if test statistic is less than -1.96 or greater than +1.96. Probability estimating a parameter using an interval (range). OR 2. PROBABILITY (b-a) x f(x) E.g. We are 95% confidence that the unknown mean score E.g. with CI: Find a 99% confidence interval for the P-value= P(|Z|>1.2199)=2*P(Z>1.22) Outcomes must be mutually exclusive (No two outcomes can lies between 56 and 78. population mean noise level given the resident’s data (that is, =2*[0.5-P(0<Z<1.22)] both occur on any one trial) and collectively exhaustive (Each Normal Distribution Unbiased and consistent from a sample of size 18, an average of 72dB and standard =0.2224 trial must result in one of the outcomes in the sample space) Different means – shift curve up and down x-axis. Different deviation of 10dB). Conclusion P(A) = P(A∩B) + P(A∩Bc) variances – curve becomes more peaked or more squashed E X , so X is an unbiased estimator of . X 72, s 10, n 18 We fail to reject the null hypothesis- test statistic did not lie in P(A or B) = P(A) + P(B) - P(A∩B) the rejection region; p-value is large. That is, the data support An unbiased estimator is consistent if the difference between t / 2, n 1 t0.005,17 2.8982 P(A|B) = P(A∩B) / P(B) [Conditional Probability] We require probabilities P(X<a) or P(a<X<b) estimator and the parameter gets smaller as the sample gets the claim that 20% of users are walking. s P(A)=1-P(Ā) larger. CI: X t / 2, n 1 If A and B are independent, then n Standardising - produces a Z-score. P(A∩B) = P(A) x P(B) 72 2.8982 * 10 ∴P(A|B) = P(A) If X~N(μ,σ²), 18 P(A and B) = P(A|B) x P(B) 72 6.8311 = P(B|A) x P(A) [Multiplication rule] X (65.1689, 78.8311) Z ~ N (0,1)