Академический Документы
Профессиональный Документы
Культура Документы
Data Set 1: Users giving ratings to movies IMDB Data (Scrapped Simpler Version)
Questions:
Solutions:
1.
import pandas as pd
unames = ['user_id','gender','age','occupation','zip']
users = pd.read_table('/Users/yesbabu/Desktop/users.dat',
sep='::', header=None, names=unames)
movie_data = pd.merge(pd.merge(ratings,users),movies)
movie_data.describe()
movie_data.info()
2.
3.
ratings_by_title = movie_data.groupby('title').size()
4.
active_titles = ratings_by_title.index[ratings_by_title >= 250]
mean_ratings_250 = mean_ratings.ix[active_titles]
5.
top_female_ratings = mean_ratings.sort_index(by='F',
ascending=False)
top_male_ratings = mean_ratings.sort_index(by='M',
ascending=False)
6.
sorted_by_diff = mean_ratings.sort_index(by='diff')
7.
rating_std_by_title =movie_data.groupby('title')['rating'].std()
rating_std_by_title = rating_std_by_title.ix[active_titles]
rating_std_by_title.order(ascending=False)[:10]
Data Set2: US baby names from 1880 to 2010. Very rich dataset especially to
Visualize the proportion of babies given a particular name (your own, or another name)
over time.
Determine the most popular names in each year or the names with largest increases or
decreases.
DataSource - https://github.com/wesm/pydata-book/tree/master/ch02/names
Questions:
1. Since the data set is split into files by year, assemble all of the data into a single
DataFrame (blob) and further to it add a year field.
4. Insert a column (a score value) which gives the fraction of babies given each name
relative to the total number of births. A column value of 0.02 would indicate that 2
out of every 100 babies was given a particular name.
7. Analyze naming trends from the above set and plot a few for each year.
Solutions:
1.
2.
4.
def add_prop(group):
group.births.astype(float)
5.
np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)
6.
def get_top1000(group): return
group.sort_index(by='births', ascending=False)[:1000]
top1000 = grouped.apply(get_top1000)
or
pieces.append(group.sort_index(by='births',
ascending=False)[:1000])
7.
dat=data.frame(x=c(1,2,3,4,5,6),
y=c(1,3,5,6,8,12))