Вы находитесь на странице: 1из 12

DS 5110 - Fall 2018 Midterm Exam

Name:_________________________ Year/Progam:______________________

Instructions

Answer each problem to the best of your ability. Fully read instructions for all sections. Justify or explain
your answers when appropriate. Partial credit will be given for answers that are partially correct. Points will
be deducted for incorrect statements even if all other parts of your answer are correct.
All data tables can be found at the end of the exam. You may remove that page from the exam for your
convenience when referencing them.
No notes are permitted for this exam. Use of computers, smartphones, or other unauthorized aides while
taking the exam is strictly prohibited.
You may ask the instructor or a TA to display the help/documentation page of any function from base R or
tidyverse packages.

Honor statement

I promise I will not cheat on this exam. I will neither give nor receive any unauthorized assistance. I will not
share information about the exam with anyone who may be taking it at a different time. I have not been told
anything about the exam by someone who has taken it earlier.

Signature:________________________ Date:___________________________

Project group members

Give the names of 3 to 5 people in this class (including yourself) who will be your project group members.

Group member 1: _______________________


Group member 2: _______________________
Group member 3: _______________________
Group member 4: _______________________
Group member 5: _______________________

1
Part Points Possible Points Received

A 68

B 36

C 96

Total 200

2
Part A

This section uses multiple choice. For each problem, circle the best answer for each question.

1. (4 pts) What is the relationship between a bar plot and a histogram?


a. They use the same geometric object
b. They use the same statistical transformation
c. They are both used to visualize the distribution of categorical variables
d. All of the above
e. None of the above

2. (4 pts) What is the relationship between a histogram and frequency polygon?


a. They use the same geometric object
b. They use the same statistical transformation
c. They are both used to visualize the distribution of categorical variables
d. All of the above
e. None of the above

3. (4 pts) What is (always) true of plots with multiple layers?


a. All layers use the same dataset
b. All layers use the same aesthetic mappings
c. All layers use the same geometric objects
d. All layers use the same statistical transformations
e. None of the above

4. (4 pts) What kind of plot can be used to visualize the relationship between two continuous
variables?
a. geom = “bar”, stat = “count”
b. geom = “bar”, stat = “bin”
c. geom = “point”, stat = “identity”
d. geom = “point”, stat = “count”
e. None of the above

5. (4 pts) What kind of plot can be used to visualize the distribution of a single categorical
variable?
a. geom = “bar”, stat = “count”
b. geom = “bar”, stat = “bin”
c. geom = “point”, stat = “identity”
d. geom = “point”, stat = “count”
e. None of the above

3
6. (4 pts) What kind of plot can be used to visualize the distribution of a single continuous
variable?
a. geom = “bar”, stat = “count”
b. geom = “bar”, stat = “bin”
c. geom = “point”, stat = “identity”
d. geom = “point”, stat = “count”
e. None of the above

7. (4 pts) Which position adjustment can be used as a solution to overplotting in a scatter


plot?
a. position = “identity”
b. position = “stack”
c. position = “dodge”
d. position = “jitter”
e. None of the above

8. (4 pts) What is the relationship between a boxplot and a histogram?


a. They use the same geometric object
b. They use the same statistical transformation
c. They are both used to visualize the distribution of continuous variables
d. All of the above
e. None of the above

9. (4 pts) Consider two continuous variables A and B and a categorical variable C. How can
we visualize the relationship between all three?
a. A scatter plot of A and B, mapping C to “color”
b. A scatter plot of A and C, mapping B to “size”
c. A bar plot of A and B, mapping C to “color”
d. A bar plot of A and C, mapping B to “size”
e. All of the above

10. (4 pts) Consider two categorical variables A and B and a continuous variable C. How can
we visualize the relationship between all three?
a. ggplot(data) + geom_boxplot(aes(x=A, y=C)) + facet_wrap(~B)
b. ggplot(data) + geom_boxplot(aes(x=B, y=C)) + facet_wrap(~A)
c. ggplot(data) + geom_histogram(aes(x=C)) + facet_grid(A~B)
d. ggplot(data) + geom_freqpoly(aes(x=C, color=A)) + facet_wrap(~B)
e. All of the above

11. (4 pts) How can we visualize the distribution of a continuous variable?


a. A histogram
b. A boxplot
c. A frequency poygon
d. All of the above
e. None of the above

4
12. (4 pts) What is true of each aesthetic mapped to a variable in a plot?
a. It must be a continuous variable
b. It must be a categorical variable
c. It has a corresponding scale
d. It has a corresponding coordinate system
e. It has a corresponding facet specification

13. (4 pts) What is a rule of tidy data?


a. Each variable must have its own row
b. Each observation must have its own column
c. Each value must be a number
d. All of the above
e. None of the above

14. (4 pts) What is NOT an advantage of working with tidy data?


a. A consistent format that can be used with many tools
b. Vectorized operations on variables are intuitive and efficient
c. Tidy datasets are always the most compact and storage-efficient
d. All of the above
e. None of the above

15. (4 pts) What is a common problem when tidying a dataset?


a. Each observation is a single row
b. Column names are values rather than variables
c. All character vectors contain strings
d. All of the above
e. None of the above

16. (4 pts) What is another common problem when tidying a dataset?


a. An observation is scattered across multiple rows
b. Column names are variables rather than values
c. Each value is in a cell
d. All of the above
e. None of the above

17. (4 pts) What is NOT true of working with relational database management systems such
as MySQL or SQLite from R?
a. All computations take place in R
b. The data does not need to fit into memory
c. Multiple tables of data can be accessed from disk
d. All of the above
e. None of the above

5
Part B

In this section, provide the primary key and any foreign keys for each table in the set of relational data tables
(found at the end of this exam).
If a table does not have a given key, put “none”. For any foreign keys you list, also give the name of the table
for which it is a primary key.

18. (18 pts) The clients, titles, and sales tables:


clients:
• Primary key =
• Foreign key(s) =

titles:
• Primary key =
• Foreign key(s) =

sales:
• Primary key =
• Foreign key(s) =

19. (18 pts) The customers, inventory, and orders tables:


customers:
• Primary key =
• Foreign key(s) =

inventory:
• Primary key =
• Foreign key(s) =

orders:
• Primary key =
• Foreign key(s) =

6
Part C

In this section, provide a pseudocode strategy (using relational data concepts such as group_by(),
summarise(), joins such as left_join() and right_join(), etc.) for solving each problem.
All datasets referenced can be found at the end of the exam. You do not need to account for missing data or
other special cases. You do not need to calculate anything.

20. (12 pts) Ms. Nelson and Ms. Paige are the two agents for a certain literary agency described
in Appendix A. How many books of each genre do they each, separately, represent?

21. (12 pts) What is the average word count for books of each genre?

7
22. (12 pts) What is the total sum of money in advances (as given by the advance column in
the sales table) for deals made by each agent?

23. (12 pts) Rank the literary genres by the total amount of money made in advances on sales
of domestic first print rights.

8
24. (12 pts) Customer and transaction information for a certain online vendor are given in
Appendix B. Calculate the total price of each order in orders.

25. (12 pts) Calculate the average price of items from each department.

9
26. (12 pts) Calculate the total amount of revenue for each department across all orders.

27. (12 pts) Create a new table with all items that have not yet been ordered by anyone.

10
Appendix A

The following three data tables describe the authors, book titles, and book sales managed by a certain literary
agency:

clients

## # A tibble: 5 x 5
## cid first_name last_name sign_date agent
## <chr> <chr> <chr> <chr> <chr>
## 1 jsmith Jane Smith 2001-03-04 Nelson
## 2 adory April Dory 2001-03-04 Paige
## 3 shu Simon Hu 2003-01-29 Paige
## 4 jsmith2 Jane Smith 2006-11-09 Nelson
## 5 lortiz Lorena Ortiz 2010-09-26 Nelson
titles

## # A tibble: 9 x 4
## title author genre word_count
## <chr> <chr> <chr> <dbl>
## 1 The House on the Hill jsmith contemporary 106789
## 2 The Blue Diary jsmith contemporary 95019
## 3 Dragon Eaters adory fantasy 135501
## 4 Silent Wizards adory fantasy 126038
## 5 Forbidden Alchemy adory fantasy 111666
## 6 My Father's Piano shu memoir 101365
## 7 Blueberry Pastures jsmith2 contemporary 95019
## 8 Sudden Confinement jsmith2 horror 95134
## 9 Rubi Saves the World lortiz young adult 76045
sales

## # A tibble: 9 x 4
## title rights advance royalty
## <chr> <chr> <dbl> <dbl>
## 1 The House on the Hill domestic first print 15000 0.125
## 2 Dragon Eaters domestic first print 12000 0.1
## 3 Dragon Eaters foreign markets 5000 0.05
## 4 Dragon Eaters audio 4000 0.075
## 5 Blueberry Pastures domestic first print 15000 0.125
## 6 My Father's Piano domestic first print 14500 0.1
## 7 My Father's Piano foreign markets 14500 0.1
## 8 Rubi Saves the World domestic first print 13500 0.11
## 9 Rubi Saves the World audio 6000 0.06

11
Appendix B

The following three data tables describe the customer information, inventory items, and online orders for a
certain vendor:

customers

## # A tibble: 7 x 3
## customer_id name email
## <dbl> <chr> <chr>
## 1 1 John Smith john@thedude.com
## 2 2 Kelly Shay kt598@h0tmail.com
## 3 3 Simone Arnold coolchick99@geemail.com
## 4 4 Denise Sanchez dsanchez@outlooook.org
## 5 5 Shirley Grace noreply@somedomain.edu
## 6 6 John Smith jsmith@harvrad.org
## 7 7 Aiden Shu noreply@somedomain.edu
inventory

## # A tibble: 7 x 4
## item_id description department price
## <dbl> <chr> <chr> <dbl>
## 1 7 Black wood chair Furniture 60.9
## 2 8 XE Laptop computer Electronics 2200.
## 3 11 Sandalwood desk Furniture 111.
## 4 13 Shiny thing Toys and Games 1000.
## 5 113 Mini screwdriver Tools 5.76
## 6 213 Black wood chair Furniture 161.
## 7 226 Deck playing cards Toys and Games 5.76
orders

## # A tibble: 10 x 4
## order_id customer_id item_id date
## <dbl> <dbl> <dbl> <chr>
## 1 1001 2 7 10/3/18
## 2 1001 2 7 10/3/18
## 3 1001 2 11 10/3/18
## 4 1004 4 8 10/3/18
## 5 1022 5 113 10/6/18
## 6 1022 5 8 10/6/18
## 7 1103 1 226 10/6/18
## 8 1103 1 213 10/6/18
## 9 1268 5 226 10/8/19
## 10 1299 4 7 10/8/18

12

Вам также может понравиться