Data Analysis and Interpretation

Data Analysis and Data Interpretation
A Python Question Bank
Data Set 1: Users giving ratings to movies IMDB Data (Scrapped Simpler Version)
Given three datasets

Users contains user_id, gender, age, occupation and zipcode
Ratings contains user_id, movie_id, rating, timestamp
Movies contains movie_id, title, genre
Data Source: https://github.com/wesm/pydata-book/tree/master/ch02/movielens
Questions:
1. Combine the three datasets into one blob.

2. Get mean ratings by gender for each movie title.
3. Get the number of ratings given for each movie title.
4. Get the mean ratings for titles with at least 250 ratings.
5. Get the top movie titles rated highly by females and males.
6. Get the movie titles where there is a high rating disagreement between the two
genders.
7. Get the movies that generated most disagreement in ratings independent of gender.
Solutions:
1.
import pandas as pd
unames = ['user_id','gender','age','occupation','zip']
users = pd.read_table('/Users/yesbabu/Desktop/users.dat',
sep='::', header=None, names=unames)
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']

ratings =
pd.read_table('/Users/yesbabu/Desktop/ratings.dat',
sep='::', header=None,names=rnames)
mnames = ['movie_id', 'title', 'genres']

movies = pd.read_table('/Users/yesbabu/Desktop/movies.dat',
sep='::', header=None,names=mnames)
movie_data = pd.merge(pd.merge(ratings,users),movies)
movie_data.describe()
movie_data.info()
2.
mean_ratings = movie_data.pivot_table('rating', index='title',

columns='gender',aggfunc='mean')
3.
ratings_by_title = movie_data.groupby('title').size()
4.
active_titles = ratings_by_title.index[ratings_by_title >= 250]
mean_ratings_250 = mean_ratings.ix[active_titles]
5.
top_female_ratings = mean_ratings.sort_index(by='F',
ascending=False)
top_male_ratings = mean_ratings.sort_index(by='M',
ascending=False)
6.
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
sorted_by_diff = mean_ratings.sort_index(by='diff')
7.
# Standard deviation of rating grouped by title
rating_std_by_title =movie_data.groupby('title')['rating'].std()
# Filter down to active_titles
rating_std_by_title = rating_std_by_title.ix[active_titles]
# Order Series by value in descending order
rating_std_by_title.order(ascending=False)[:10]
Data Set2: US baby names from 1880 to 2010. Very rich dataset especially to
Visualize the proportion of babies given a particular name (your own, or another name)
over time.
Determine the relative rank of a name.
Determine the most popular names in each year or the names with largest increases or
decreases.
Analyze trends in names: vowels, consonants, length, overall diversity, changes in

spelling, first and last letters
DataSource - https://github.com/wesm/pydata-book/tree/master/ch02/names
Questions:
1. Since the data set is split into files by year, assemble all of the data into a single
DataFrame (blob) and further to it add a year field.
2. Aggregate the data at the year and sex level.
3. Plot the number of births by year and gender.
4. Insert a column (a score value) which gives the fraction of babies given each name
relative to the total number of births. A column value of 0.02 would indicate that 2
out of every 100 babies was given a particular name.
5. Perform a sanity check to show that your scoring is right
6. Extract top 1000 names for each Sex X Year combination
7. Analyze naming trends from the above set and plot a few for each year.
Solutions:
1.
years = range(1880, 2011)

pieces = [] columns = ['name', 'sex', 'births']
for year in years: path = 'names/yob%d.txt' % year frame =

pd.read_csv(path, names=columns)
frame['year'] = year pieces.append(frame)
# Concatenate everything into a single DataFrame
names = pd.concat(pieces, ignore_index=True)
2.
total_births = names.pivot_table('births', rows='year',

cols='sex', aggfunc=sum)
3. total_births.plot(title='Total births by sex and year')
4.
def add_prop(group):
# Integer division floors births =
group.births.astype(float)
group['prop'] = births / births.sum()
return group names = names.groupby(['year',

'sex']).apply(add_prop)
5.
np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)
6.
def get_top1000(group): return
group.sort_index(by='births', ascending=False)[:1000]
grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)
or
pieces = [] for year, group in names.groupby(['year',

'sex']):
pieces.append(group.sort_index(by='births',
ascending=False)[:1000])
top1000 = pd.concat(pieces, ignore_index=True)
7.
boys = top1000[top1000.sex == 'M']
girls = top1000[top1000.sex == 'F']
total_births = top1000.pivot_table('births', rows='year',

cols='name', aggfunc=sum)
subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]
subset.plot(subplots=True, figsize=(12, 10), grid=False,

title="Number of births per year")
Minimise residual sum of squares

I start with an x-y data set, which I believe has a linear
relationship and therefore I'd like to fit y against x by
minimising the residual sum of squares.
dat=data.frame(x=c(1,2,3,4,5,6),
y=c(1,3,5,6,8,12))
create a function that calculates the residual sum of square of

my data against a linear model with two parameter.
Think of y = par[1] + par[2] * x.

min.RSS <- function(data, par) {
with(data, sum((par[1] + par[2] * x - y)^2))
}

Data Analysis and Interpretation

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Analysis and Interpretation

Загружено:

Авторское право:

Доступные форматы

Data Analysis and Data Interpretation

A Python Question Bank

Given three datasets

Data Source: https://github.com/wesm/pydata-book/tree/master/ch02/movielens

1. Combine the three datasets into one blob.

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']

mnames = ['movie_id', 'title', 'genres']

mean_ratings = movie_data.pivot_table('rating', index='title',

mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']

# Standard deviation of rating grouped by title

# Filter down to active_titles

# Order Series by value in descending order

Determine the relative rank of a name.

Analyze trends in names: vowels, consonants, length, overall diversity, changes in

2. Aggregate the data at the year and sex level.

3. Plot the number of births by year and gender.

5. Perform a sanity check to show that your scoring is right

6. Extract top 1000 names for each Sex X Year combination

years = range(1880, 2011)

for year in years: path = 'names/yob%d.txt' % year frame =

frame['year'] = year pieces.append(frame)

# Concatenate everything into a single DataFrame

names = pd.concat(pieces, ignore_index=True)

total_births = names.pivot_table('births', rows='year',

3. total_births.plot(title='Total births by sex and year')

# Integer division floors births =

group['prop'] = births / births.sum()

return group names = names.groupby(['year',

grouped = names.groupby(['year', 'sex'])

pieces = [] for year, group in names.groupby(['year',

top1000 = pd.concat(pieces, ignore_index=True)

boys = top1000[top1000.sex == 'M']

girls = top1000[top1000.sex == 'F']

total_births = top1000.pivot_table('births', rows='year',

subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]

subset.plot(subplots=True, figsize=(12, 10), grid=False,

Minimise residual sum of squares

create a function that calculates the residual sum of square of

Think of y = par[1] + par[2] * x.

Вам также может понравиться