Вы находитесь на странице: 1из 15

Practice for Final - Stats 12 Fall 2008

Ryan Rosario - Sections 1AB


1. According to a 2007 data mining project performed by your TA, 4900 UCLA Facebook proles and friend lists were downloaded to a hard disk using a Python crawler. The crawler starts by choosing a random Facebook UCLA prole, downloads it to disk, and then accesses the list of that users UCLA Facebook friends. The crawler then visits the proles for all of those friends and moves on to their friends etc. One aspect was to determine whether or not third party (non-Facebook developed) wall applications would bias an analysis of standard Wall posts. The number of proles containing an active third party Wall box are listed below. Application Name SuperWall FunWall SuperWall and FunWall Proles 559 206 84

(a) Draw an appropriate Venn diagram to represent this context. Label the diagram with probabilities. Let F represent FunWall and let S represent SuperWall.

(b) Find the probability that a randomly selected UCLA Facebook prole in this sample does not use either of these third party wall applications.

(c) Suppose my ultimate goal was to determine the proportion of all Facebook users that do not have either of these wall applications active. Suppose I use a condence interval. Provide two reasons why using a condence interval for this analysis is not valid. You do not need to know anything about Facebook to answer this question. Hint: Reread this entire problem very carefully, and read the bolded statements.

(d) Consider the following table that breaks down the number of UCLA Facebook users in the sample that displayed their Wall and those that have their Walls hidden (hidden to the crawler) by gender. Sex Male Female Column Total Wall Displayed 2164 1835 3999 Wall Hidden 188 713 901 Row Total 2352 2548 4900

i. Among those with hidden Walls, what is the probability that the user is female?

ii. Are gender and Wall privacy independent? Why or why not? Use a probablistic argument!!

(e) Suppose the crawler ips a coin to determine whether or not to download the current prole to disk. If the coin shows heads then the prole is downloaded to disk, otherwise we just skip over it without saving any of its data. If it downloads the prole to disk, it will visit that proles friends with probability 0.75. If it does not download the prole to disk, it will visit that proles friends with probability 0.1. i. Draw the tree or contingency table that corresponds with the situation described above. Annotate all events and probabilities for each branch.

ii. Find the probability that the crawler will visit the proles of the current users friends.

iii. Suppose we know that the crawler will visit the current users friends. What is the probability that the current users prole was downloaded to disk?

Reach for the STARs. In 1998, Gray Davis approved the Standardized Testing and Reporting Program which mandated the use of Stanford Achievement Test, Ninth Edition as the sole norm-referenced measure of educational outcomes in the state. In 2003, the California Department of Education approved replacing Stanford 9 with another test, California Achievement Tests, Sixth Edition but only required 3rd and 7th graders to take the test. Everybody else had to take a dierent battery of tests referred to as Content Standards Test. In this problem, we will analyze a couple of facets of this decision using the material we have studied this quarter. 2. During the research phase leading to this decision, a sample of school districts were selected (and paid) to administer both Stanford 9 and CAT/6 to all of its students. The table below displays 10 pairs of scaled scores for the Language subtest of each battery. Scaled scores take into account the diculty of items on the test as a way of normalizing among dierent forms. Scaled scores on Stanford 9 range from 200 to 999. Scaled scores on CAT/6 Language range from 0 to 999. Stanford 9 200 310 450 520 600 650 732 810 900 960 CAT/6 Survey 50 125 500 600 670 690 780 820 950 999

(a) Make a scatterplot of these test scores. Clearly denote what x and y represent.

(b) Compute the sample correlation coecient r between Stanford 9 language and CAT/6 Language. To maximize partial credit, make sure to show all of your work, including sx , sy , x, y as well as the formula and the numbers you have plugged in!

(c) For brevity, the sample in part a only has size 10. The actual correlation coecient is r = 0.89. Compute the regression model equation using the following statistics: x = 500, y = 500, sx = 200, sy = 165. To maximize partial credit, show all steps including your computation of 0 and 1 . b b

(d) Interpret the slope and intercept of your regression model. Comment on anything strange you may notice, what may have caused it, and how you may be able to resolve it.

(e) Using your regression model from the previous problem, predict the scaled score a student that received a 610 on Stanford 9 Language would receive on CAT/6 Language. Suppose the students true CAT/6 Language score is 570. Compute the residual. Make a comment about the models prediction.

(f) Compute r2 and interpret it.

(g) Suppose the psychometrician that did this study forgot to include a pairing of scores. This particular student scored 700 on Stanford 9 Language and 360 on CAT/6 Language. Select the correct statement. r will: increase decrease remain about the same

3. In addition to national percentiles, test publishers may choose to report other metrics such as stanines and grade equivalents. Stanines express test performance on a scale from 1 to 9 with scores of 1, 2, and 3 representing below average, 4, 5, and 6 representing average and 7, 8, and 9 representing above average. Each stanine is 0.5 standard deviations in width, except for the rst and ninth which are larger (see the diagram below). Grade equivalents on the other hand, are decimals ranging from 0.0 to 12.9 in 0.1 increments that express scaled scores as an approximate grade level and time of year. The digit before the decimal (0 to 12) represents grade level and the digit after the decimal (0 to 9) represents the month of the school year, assuming a 10 month school year. Grade equivalents allow educational administrators to gauge approximate grade level improvement over time. Suppose on CAT/6 Reading the mean scaled score is 500 with standard deviation 150. For parts a, b, and c, use the following graphic (carefully) to help answer the following questions.

0.1

0.2

0.3

0.4

1 0.0

0 zscore

(a) On the graphic above, annotate each stanine with its area (percentage of scores falling in that stanine). Hint: Remember the normal distribution is symmetric!

(b) Suppose identical twins (with identical intelligence) Ted and Ned both take this test and receive scaled scores of 605 and 614 respectively. Compute Ted and Neds stanines.

Teds stanine:

Neds stanine:

(c) Based on your answer to part b, describe one criticism of the use of stanine scores.

(d) Convert Teds score to a grade equivalent. Assume that Ted is a third grader and is tested during the seventh month of instruction (about March). An average third grader tested in March should receive a grade equivalent of 3.7. Assume the standard deviation to be 0.2 (2 months).

4. The Mad Tea Party. When Alice walks through the looking glass in her quest to nd out why the White Rabbit is in such a hurry, she stumbles upon something that will change her life forever - she meets the Mad Hatter and March Hare. Mad Hatter: March Hare: Mad Hatter: March Hare: Statistics prove, prove that youve one birthday! Imagine, just one birthday every year. Ah, but there are 364 UNbirthdays! Precisely why were gathered here to cheer!

Apparently all that tea caused them to forget about leap year! Assume that a year consists of only 365 days. (a) In a tea party consisting of Alice, Mad Hatter, March Hare, Chesire Cat and Dormouse, nd the probability that none of them has a birthday (they all are celebrating an unbirthday).

(b) In the same tea party, nd the probability that at least one of them has a birthday and spoils the whole celebration.

(c) Suppose we hold a larger tea party. Find the probability that the 8th person to arrive is the rst to spoil the party.

(d) Now suppose all of Wonderland (population 5,000) is invited: rabbits, owers, brooms, people etc. What is the probability that at least 5 attendees spoils the festivities?

The rest of the song for your entertainment: Alice: Well then todays my unbirthday too! Mad Hatter: It is??? March Hare: What a small world this is! Mad Hatter: In THAT case... Both: A very merry unbirthday to... Alice: To me? March Hare: To you! Mad Hatter: a very merry unbirthday Alice: for me? Mad Hatter: for you! March Hare: now blow the candle out my dear and make your wish come true! hoo hoo! Both: A very merry unbirthday to you!!!! Dormouse: Twinkle, twinkle little bat. How I wonder what youre at. Up above the world you y, like a tea tray in the sky!

10

5. To Be or Not to Be. Problems involving authenticity of historical and literary documents such as the Federalist Papers, Shakespeares tragedies and Ronald Reagans speeches have been analyzed using statistical methods. On average, Shakespeare writes 1,500 words per act. A researcher counts the number of words in each act of Hamlet, and nds that there are, on average, 1,600 per act with a standard deviation of 125, and there are 5 acts in Hamlet. The researcher found this odd and decided to perform a hypothesis test to determine whether or not Hamlet was unusually long. Test at 5%. (a) Set up the proper null and alternate hypotheses.

(b) Calculate the test statistic and draw the appropriate picture.

(c) State your decision.

(d) What do you conclude?

11

6. Carl Icahn Strikes Back. Time-Warner, in its eort to take control over the world, has suggested a pay-per-megabyte policy that requires users to pay for the amount of bandwidth they use on a monthly basis. Suppose Time-Warner wants to classify users that transfer (download) an average of more than 20 GB per month into its most expensive per-megabyte pricing plan. Ryan gets placed in this category because of his most recent usage and is not happy. Over the past 48 months, Ryans router reports that he downloads an average of 20.5 GB per month with a standard deviation of 3 GB. Is it statistically reasonable to place Ryan in this group (assuming they value statistics)? (a) Construct the proper null and alternate hypotheses.

(b) Calculate the appropriate test statistic, draw a picture, nd the p-value and state your decision (reject/fail to reject).

(c) Make your conclusion in the context of the problem.

(d) Let pval represent the p-value. The p-value indicates that a) the probability that the null hypothesis is true is pval . b) the probability we reject the null hypothesis when the null hypothesis is true is pval . c) the probability that we observe a sample mean as extreme or more extreme than the one observed is pval , if H0 is true. d) the probability of making a Type II error is pval .

12

(e) Test the same hypothesis above using a condence interval. Clearly state your condence level, nd the condence interval and make your conclusion in terms of the condence interval.

(f) Gigabytes downloaded per month follows a normal distribution with mean 10.0 and standard deviation 5.0. Find the probability that the mean gigabytes downloaded per month for a random sample of 100 Time-Warner customers from around the country is between 9.5 and 10.5 gigabytes.

(g) Sophisticated statistical analyses are being developed to classify Internet trac without having to actually peek at what is being transmitted. Suppose Time-Warner already uses this technology and can actively prole BitTorrent users. They nd that a sample of 41 BitTorrent users download an average of 11.0 GB per month with a standard deviation of 6.0 GB. On the other hand, they nd that a sample of 60 normal users download an average of 9 GB per month with a standard deviation of 5 GB. The motive is that if the analyst can nd a signicant dierence between the two groups of users, then Time-Warner can place both of these groups of users into dierent pricing brackets. i. Construct the null and alternate hypotheses.

13

ii. Calculate the appropriate test statistic and draw the corresponding picture.

iii. State your decision (reject/do not reject).

iv. Estimate the p-value.

v. Make your conclusion in the context of the stated problem.

vi. Construct a 95% condence interval for the dierence between the two means.

14

vii. Interpret your condence interval in the context of the problem.

15

Вам также может понравиться