Вы находитесь на странице: 1из 3

Stat 101C -Int Reg Data Mining Take-home project NAME (Last, First): UCLA ID: Instructions

J. Sanchez UCLA Department of Statistics

(a) Due Friday, June 7th at the beginning of class, in person. A quiz will follow. EVERYBODY must do this project. No other projects allowed. (b) Use the project to prepare for the midterm. It will be more fun to review the material learned if you need to use it to answer some new questions on a new data set. You will probably remember the material better, and write a better cheat sheet and exam. (c) No late project accepted under any circumstances. No makeups. (d) The typed report must be handed in in person to Prof. Sanchez on June 7th. No email, mailboxes, fax or other way of turning it in will be allowed. (e) First page should be a cover page with title, and your Last name, rst name, ID. Print double sided to avoid 5 pounds reports. (f) To get full credit, you must pay attention to the instructions and follow them. Points will be deducted for not following instructions. Models tted must be justied with the appropriate diagnostics and descriptive analysis, methods used justied as well. (More on this below) (g) Must write report in the order given below. (h) Must use notation used in lecture and textbook. (i) The project must reect individual work only. You may not ask anyone, or work with anyone. No help will be given by me or the TA and you can not use any resources other than what you learned during the quarter. You are expected to use all the things taught you throughout the quarter and use your code and model approaches used in the homework. Go over all homework and select from it the code and procedures you need for this new problem. I just want you to show in the project that you have learned all the data analysis taught during the quarter and that you can do a data analysis on your own without any help. Doing this project while you prepare for the midterm will enhance your performance in both. (j) The project must be typed and written neatly. Staple the pages. (k) The answers must be written in your own words, extracting what you need from the output to support your narrative, but never just copy pasting R output. Graphs must be in the section where they help answer a question. Your homework must read as a nicely written report separated by sections as indicated below. Figures must be numbered. Sections must be numbered. Code and ALL output must always be attached as Appendix and well documented with sections headings, and separations but not dumped in the text (use homework rules as example for this) NOTE: THIS REQUEST HAS BEEN PUT IN ALL HOMEWORK, AND SOME OF STILL IGNORE IT. BE AWARE THAT IF I SEE COPY PASTED OUTPUT THAT YOU DONT EVEN TALK ABOUT INSIDE YOUR PAPER NARRATIVE, NO MATTER HOW COLORFUL AND PRETTY YOU PUT IT, YOU WILL GET 0 POINTS IN THE PROJECT). YOU MUST SELECT ONLY THE NUMBERS YOU TALK ABOUT. ALSO, IF I DO NOT SEE OUTPUT FOR ME TO CHECK YOUR RESULTS IN THE APPENDIX NEAR THE CODE YOU USED TO OBTAIN IT, YOUR PAPER WILL NOT BE READ AND WILL GET 0 (0) POINTS. NO EXCEPTIONS . Most of you have done a great work at this in the homework, but the few who dont need to change that bad habit. (l) GRADING RUBRICS: Presentation: Typed neatly following separating the sections/questions with section headings, and following a narrative while being technical, adhering to requirements given above. Appendix with output and code well organized and easy to nd. For better eect, you may highlight in your code with a marker the numbers you select for the narrative, if you wish. (20%) Use of homework, lecture material and code seen during the quarter (20%) appropriately and to the extent that is needed to give a full answer as is conventional in that method, like we have been doing in class and homework. Technical content and narrative. Good and well justied selection of exploratory analysis, testing, modeling, interpretations. ( 60%).

May 30, 2013

Stat 101C -Int Reg Data Mining Take-home project NAME (Last, First): UCLA ID:

J. Sanchez UCLA Department of Statistics

I. The data The data for this project is a subset of the Child Health and Development Studies. The full real data set includes all pregnancies that occurred between 1960 and 1967 among women in the Kaiser Foundation Health Plan in Oakland, California. The data you are going to download is from one year of the study; it includes male single births where the baby lived at least 28 days. The variables that are available for analysis are described below.
1. id - identication number 2. pluralty - 5= single fetus (all values are the same, so not very helpful. 3. outcome - 1= live birth that survived at least 28 days. All the same. 4. date - birth date where 1096=January1,1961 5. gestation - length of gestation in days 6. sex - infants sex 1=male. All the same. 7. wt - birth weight in ounces 8. parity - total number of previous pregnancies including fetal deaths and still births. 9. race - mothers race 0-5=white 6=mex 7=black 8=asian 9=mixed 10 age - mothers age in years at termination of pregnancy 11 ed - mothers education 0= less than 8th grade, 1 = 8th -12th grade - did not graduate, 2= HS graduateno other schooling , 3= HS+trade, 4=HS+some college 5= College graduate, 6&7 Trade school HS unclear, 12 ht - mothers height in inches to the last completed inch 13 wt.1 - mother prepregnancy wt in pounds 14 drace - fathers race, coding same as mothers race. 15 dage - fathers age, coding same as mothers age. 16 ded - fathers education, coding same as mothers education. 17 dht - fathers height, coding same as for mothers height 18 dwt - fathers weight coding same as for mothers weight 19 marital 1=married, 2= legally separated, 3= divorced, 4=widowed, 5=never married 20 inc - family yearly income in $2500 increments 0 = under 2500, 1=2500-4999, ..., 8= 12,500-14,999, 9=15000+, 21. smoke - does mother smoke? 0=never, 1= smokes now, 2=until current pregnancy, 3=once did, not now 22. time - If mother quit, how long ago? 0=never smoked, 1=still smokes, 2=during current preg, 3=within 1 yr, 4= 1 to 2 years ago, 5= 2 to 3 yr ago, 6= 3 to 4 yrs ago, 7=5 to 9yrs ago, 8=10+yrs ago, 9=quit and dont know, 98=unknown 23. number - number of cigs smoked per day for past and current smokers 0=never, 1=1-4 2=5-9, 3=10-14, 4=15-19, 5=20-29, 6=30-39, 7=40-60, 8=60+, 9=smoke but dont know

You may read the data set that you will use from

project.data= read.csv("http://www.stat.ucla.edu/jsanchez/data/stat101c/project.csv", header=T) attach(project.data)

II. Questions, organization of report and further instructions. 1. Introduction. Research a little the issue of birth weight and smoking of the mother. Write a little introduction. Put your references at the end of the report. 2. What aects smoking status of the mother? Using the variable smoke create a variable that represents smoking status as a 1, 0 variable, where 1 means the mother smokes now or smoked at least once and 0 otherwise. Call this new variable newsmoke. Using the variable marital, create another variable for marital status that is 1 if married and 0 otherwise. Call it newmarried. What role does marital status thus dened have on smoking status of the mother thus dened, controlling for education, age, height and weight of the mother.? 3. Learning more about what aects dierent smoking status. Repeat question 2, but without transforming the smoking and marital status variables. That is, now they will have more than two categories (use marital and smoke). How much more can you tell to answer the same questions? May 30, 2013 2

Stat 101C -Int Reg Data Mining Take-home project NAME (Last, First): UCLA ID:

J. Sanchez UCLA Department of Statistics

4. Determinants of birth weight. What role does smoking status of the mother play on birth weight of the baby controlling for mothers education, age, mothers height and weight and gestational age? Is the eect of smoking altered by the marital status of the woman? Is the eect of gestational age for smokers altered by the number of cigarretes smoked? (number) 5. Other interrelations among mothers characteristics. Is the relation between smoking status and education dierent for married and unmarried women? Are these three factors interrelated? 6. Other information. How would you set up the data set to study the eect of education on the expected number of women that smoked at least once. Write the table only and suggest what type of modeling you would do. No need to do this analysis. 7. Conclusion. 8. Appendix with code and output (do not forget the output), well separated by sections. Further instructions. Be concise and go to the point, selecting carefully your supporting evidence an not copy pasting output that you do not talk about or any output, period. Select. Copy paste already formatted equations and text from your old homework les and just change it for the questions asked here. The following must appear in each question: Explore and describe with simple summary statistics the data you use for each question, and investigate potential outliers and or anomalies that you will point out in your description. Select and mention the appropriate statistical modeling approach, among those seen this quarter, to answer the questions asked (large amount of points lost if you dont choose the right methods). Explain briey the method. The method has to be appropriate for the type of dependent variable you have and the types of independent variables. Do the data analysis needed to answer the question with R. Keep track of all your output, because it must all be in the Appendix with the code. Select what to put in the narrative of the paper. Explain briey what you did and write the equations of your tted models, specifying dependent and independent variables, signicance, appropriate tests. All this is the technical part. Your answer has to be supported by your results, and you must explain why that result you select to put in your narrative supports your answer. You must then write your nal explanation in a way that a user not familiar with statistics will understand and can put in a report to publish to the general public. Pretty much, what we have been doing in the homework all along. In this part, you must choose ratios, odds, whatever is conventional to use with each method. GOOD LUCK and DO NOT FORGET TO PRINT YOUR PAPER DOUBLE SIDED.

May 30, 2013

Вам также может понравиться