Вы находитесь на странице: 1из 6

Data Analysis and Data Interpretation

A Python Question Bank

Data Set 1: Users giving ratings to movies IMDB Data (Scrapped Simpler Version)

Given three datasets


Users contains user_id, gender, age, occupation and zipcode
Ratings contains user_id, movie_id, rating, timestamp
Movies contains movie_id, title, genre

Data Source: https://github.com/wesm/pydata-book/tree/master/ch02/movielens

Questions:

1. Combine the three datasets into one blob.


2. Get mean ratings by gender for each movie title.
3. Get the number of ratings given for each movie title.
4. Get the mean ratings for titles with at least 250 ratings.
5. Get the top movie titles rated highly by females and males.
6. Get the movie titles where there is a high rating disagreement between the two
genders.
7. Get the movies that generated most disagreement in ratings independent of gender.

Solutions:

1.
import pandas as pd
unames = ['user_id','gender','age','occupation','zip']
users = pd.read_table('/Users/yesbabu/Desktop/users.dat',
sep='::', header=None, names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']


ratings =
pd.read_table('/Users/yesbabu/Desktop/ratings.dat',
sep='::', header=None,names=rnames)

mnames = ['movie_id', 'title', 'genres']


movies = pd.read_table('/Users/yesbabu/Desktop/movies.dat',
sep='::', header=None,names=mnames)

movie_data = pd.merge(pd.merge(ratings,users),movies)
movie_data.describe()
movie_data.info()
2.

mean_ratings = movie_data.pivot_table('rating', index='title',


columns='gender',aggfunc='mean')

3.
ratings_by_title = movie_data.groupby('title').size()

4.
active_titles = ratings_by_title.index[ratings_by_title >= 250]

mean_ratings_250 = mean_ratings.ix[active_titles]

5.

top_female_ratings = mean_ratings.sort_index(by='F',
ascending=False)

top_male_ratings = mean_ratings.sort_index(by='M',
ascending=False)

6.

mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']

sorted_by_diff = mean_ratings.sort_index(by='diff')

7.

# Standard deviation of rating grouped by title

rating_std_by_title =movie_data.groupby('title')['rating'].std()

# Filter down to active_titles

rating_std_by_title = rating_std_by_title.ix[active_titles]

# Order Series by value in descending order

rating_std_by_title.order(ascending=False)[:10]
Data Set2: US baby names from 1880 to 2010. Very rich dataset especially to

Visualize the proportion of babies given a particular name (your own, or another name)
over time.

Determine the relative rank of a name.

Determine the most popular names in each year or the names with largest increases or
decreases.

Analyze trends in names: vowels, consonants, length, overall diversity, changes in


spelling, first and last letters

DataSource - https://github.com/wesm/pydata-book/tree/master/ch02/names

Questions:

1. Since the data set is split into files by year, assemble all of the data into a single
DataFrame (blob) and further to it add a year field.

2. Aggregate the data at the year and sex level.

3. Plot the number of births by year and gender.

4. Insert a column (a score value) which gives the fraction of babies given each name
relative to the total number of births. A column value of 0.02 would indicate that 2
out of every 100 babies was given a particular name.

5. Perform a sanity check to show that your scoring is right

6. Extract top 1000 names for each Sex X Year combination

7. Analyze naming trends from the above set and plot a few for each year.

Solutions:

1.

years = range(1880, 2011)


pieces = [] columns = ['name', 'sex', 'births']

for year in years: path = 'names/yob%d.txt' % year frame =


pd.read_csv(path, names=columns)

frame['year'] = year pieces.append(frame)

# Concatenate everything into a single DataFrame

names = pd.concat(pieces, ignore_index=True)

2.

total_births = names.pivot_table('births', rows='year',


cols='sex', aggfunc=sum)

3. total_births.plot(title='Total births by sex and year')

4.
def add_prop(group):

# Integer division floors births =

group.births.astype(float)

group['prop'] = births / births.sum()

return group names = names.groupby(['year',


'sex']).apply(add_prop)

5.

np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)

6.
def get_top1000(group): return
group.sort_index(by='births', ascending=False)[:1000]

grouped = names.groupby(['year', 'sex'])

top1000 = grouped.apply(get_top1000)

or

pieces = [] for year, group in names.groupby(['year',


'sex']):

pieces.append(group.sort_index(by='births',
ascending=False)[:1000])

top1000 = pd.concat(pieces, ignore_index=True)

7.

boys = top1000[top1000.sex == 'M']

girls = top1000[top1000.sex == 'F']

total_births = top1000.pivot_table('births', rows='year',


cols='name', aggfunc=sum)

subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]

subset.plot(subplots=True, figsize=(12, 10), grid=False,


title="Number of births per year")

Minimise residual sum of squares


I start with an x-y data set, which I believe has a linear
relationship and therefore I'd like to fit y against x by
minimising the residual sum of squares.

dat=data.frame(x=c(1,2,3,4,5,6),
y=c(1,3,5,6,8,12))

create a function that calculates the residual sum of square of


my data against a linear model with two parameter.

Think of y = par[1] + par[2] * x.


min.RSS <- function(data, par) {
with(data, sum((par[1] + par[2] * x - y)^2))
}

Вам также может понравиться