Академический Документы
Профессиональный Документы
Культура Документы
This work awill analyse some aspects of the movies dataset. The two plots below show
the distribution of movies according to their categories and also according to the rating
the movies received, but grouped as “stars”. A new variable “stars” was created
receiving 1 if the rating was between 0 and 2, 2 stars if the rating was between 2 and 4
and so forth up to ten. (i.e., stars = round up (rating/2)).
# --------------------------------------------------------
# Plot 1, 2 and 3 - Barplots with categories.
p <- ggplot(mov2)
)
# Color bar plot
print(
p + geom_bar(aes(x=Category, fill=stars))
)
# --------------------------------------------------------
The creation of the graphs is done by the code show above, and the creation of the
accessory variables is show in the next page.
The following code shows the creation of variables for the category of movie (one only
variable mentioning the category to which the movie belongs), another one (sumCateg)
to compute the number of categories to which one single movie was classified, because
it was observed that some movies belong to many different categories while some of
them are not classified at all. And, finally, a variable for the number of stars attributed
to each movie, based on “ratings”, from 1 to 5.
# -----------------------------------------------------------
# Creation of factor variable category, to sum up the binary
# information present at
# 7 other variables in the original dataframe
Category <- vector()
Category[Action==1] <- "Action"
Category[Animation==1] <- "Animation"
Category[Comedy==1] <- "Comedy"
Category[Drama==1] <- "Drama"
Category[Documentary==1] <- "Documentary"
Category[Romance==1] <- "Romance"
Category[Short==1] <- "Short"
Category<-as.factor(Category)
# --------------------------------------------------------
# The following plot introduces two changes:
# 1 - only the subset in which movies recieved only one category
classification
# was considered
# 2 - The y scale is proportional, from 0 to 1, allowing for a
comparision between movies.
p <- ggplot(subset(mov2,sumCateg == 1))
print(
p + geom_bar(aes(x=Category, fill=stars),
position="fill")
)
# --------------------------------------------------------
I also wanted to investigate how many movies provided information on budget. On this
analysis, two factors were taken into account:
Inflation rates makes it unrealistic to directly compare budget of movies from
different years.
Small number of movies on dataset can make analsys more difficult.
The first analysis didn’t involve graphics. I just verified the number of movies with
budget information with the following code:
# --------------------------------------------------------
# Evaluation of how many movies provide budget information
numBudge <- length(budget)-length(which(is.na(budget)))
numBudgeRel <- round(10000*numBudge/length(budget))/100
fr <- c("Only ", numBudge, " movies, out of ", length(budget), " (",
numBudgeRel, "%),",
" provide Budget information.")
paste(fr,sep="",collapse="")
# --------------------------------------------------------
Still, it is a good numbe for analysis. The following problem is to understand the
distribution of these movies along the years, and for this a histogram was made.
The histogram below shows that recent years concentrate the larger number of movies.
Makes sense that the numbers of movies on the dataset grows along the years,
particularly around the year 2000. To have a better idea of the numbers available,
another plot was made considering only movies after 1990 (on the next page)
# -------------------------------------------------------
# Histogram of movies with budget information available
qplot(subset(mov2,is.na(budget)==FALSE,select=year)[,],
geom="histogram",
binwidth=5,
main="Movies with information on budget",
xlab="Year of the movie",
fill=I("steelblue3"),
col=I("red"))
# -------------------------------------------------------
Since more than 450 movies from 2004 provide information on budget, the following
analysis will focus on this subset, therefore eliminating larger effects of inflation rate.
# -------------------------------------------------------
# Only movies after 1990:
qplot(subset(mov2,is.na(budget)==FALSE & year>=1990,select=year)[,],
geom="histogram",
binwidth=1,
main="Movies with information on budget from 1990 on",
xlab="Year of the movie",
fill=I("steelblue3"),
col=I("red"))
# -------------------------------------------------------
Below we see a subset of movies from 2004, looking at budget (X – axis) and number
of votes the movie received (Y axis) as an indirect measurement of popularity of the
movie (assuming viewers of each category have a similar probability of voting on a
movie, otherwise this measurement is biased). The dimension of each point shows, also,
the rating received by a movie. It is curious to see that only Drama and Romance
reached the top-left corner (high votes for low budget). Action movies show a
correlation between budget and votes, suggesting that it is only possible to reach a large
audience for action movies with a large budget, whereas for a Drama or Romance some
good movie idea can reach a large audience even with a considerably smaller budget (7
times smaller for a similar number of votes, in this case).
# -------------------------------------------------------
# Relation between budget and number of votes per movie
print(
ggplot(subset(mov2,is.na(budget)==FALSE & year==2004),
aes(x=budget/1e6, y=votes)) +
geom_point(aes(colour = Category, size = rating)) +
labs(x = "Movie Budget (Millions of Dollars $)")
)
# --------------------------------------------------------
For a more detailed evaluation, here we consider only romances and action movies,
identifying by name those with more than 18 thousand votes. The most successful
action movie there was Spider-Man 2, with a budget of 200 million dollars. But we have
a Romance called “Ethernal Sunshine of the Spotless Mind” which reached a higher
number of votes with a budget of less than 25 million. It suggests
# --------------------------------------------------------
# Relation between budget and number of votes per movie
print(
ggplot(subset(mov2,is.na(budget)==FALSE & year==2004 &
(Category == "Action" | Category =="Romance")
), aes(x=budget/1e6, y=votes)) +
geom_point(aes(colour = Category, size = rating)) +
labs(x = "Movie Budget (Millions of Dollars $)") +
geom_text(aes(label=ifelse(votes>18000,title,""),hjust=0,
vjust=0)) +
expand_limits(x = 250)
)
# --------------------------------------------------------
Here we excluded movies classified in more than one category, and showed names for
movies with more than 18 thousand votes or budget of more than 124 million dollars.
# --------------------------------------------------------
# Relation between budget and number of votes per movie
# Excluding multiple category classification (sumCateg ==1)
print(
ggplot(subset(mov2,is.na(budget)==FALSE &
year==2004 & sumCateg==1),
aes(x=budget/1e6, y=votes)) +
geom_point(aes(colour = Category, size = rating)) +
labs(x = "Movie Budget (Millions of Dollars $)") +
geom_text(aes(
label=ifelse(votes>18000 |
budget>124e6,title,""),
hjust=0, vjust=0)) +
expand_limits(x = 250)
)
# --------------------------------------------------------
Here we see the distribution of ratings in box plots, separed for categories, for movies
from 2004 that were classified In only one unique category. We see also that short
movies and documentaries had the highest ratings.
# --------------------------------------------------------
# Boxplot of ratings for each category of movie at the
# year 2004
# --------------------------------------------------------
# New categorical variable based on budget:
BudgetGroup <- vector()
BudgetGroup[budget<=50e6] <- "1.< 50 mi"
BudgetGroup[budget>50e6 & budget<=100e6] <- "2.50 to 100 mi"
BudgetGroup[budget>100e6 & budget<=150e6] <- "3.100 to 150 mi"
BudgetGroup[budget>150e6 & budget<=200e6] <- "4.150 to 200 mi"
mov2 <-cbind(mov2,BudgetGroup)
# --------------------------------------------------------
# BoxPlot for ratings distributed by Category And by Budget Group
ggplot( subset(mov2,is.na(BudgetGroup)==FALSE &
is.na(Category)==FALSE),aes( x = Category, y = rating , fill =
Category ) ) + geom_boxplot( alpha = 0.6, outlier.colour =
c("grey40") , outlier.size=3.5) +
facet_wrap(~BudgetGroup) + theme_bw() +
labs(title="Distribution of ratings by Category and Investment \n",
x="\n Movie Category", y="Movie Ratings \n") +
guides(fill = guide_legend("\n Supplement")) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# --------------------------------------------------------
# --------------------------------------------------------
# Analysis of the duration of the movies by category
p <- ggplot(subset(mov2,is.na(BudgetGroup)==FALSE &
is.na(Category)==FALSE & sumCateg==1),
aes(length, colour=Category, fill=Category))
p <- p + geom_density(alpha=0.55) + xlim(0,180)
p <- p + ggtitle("Distribution of length of movies by Category")
p
# --------------------------------------------------------
On this last plot we study the distribution of categories of movies along the years. Once
again, only movies with one unique classification were considered in the subset. Many
points can be observed here:
Prior to 1920 almost all movies were just classified as “Short” movies, but it
must also be said that the number of movies from this period is rather small, as
is can be seen on the histogram shown on the next page.
During the first expansion of the movie industries, the dominant category,
according to this dataset, was Romance, followed by Comedy and Drama.
After 1930 we see an initial expansion on the share of Action movies, but it is
interesting to see that it peaks at the end of the World War 2 and falls almost
50% during the following years, resuming growth after around 1955, 10 years
later. During these 10 years right after WWII, the most remarkable expantion
was in that of Drama. It is important to remark that this plot shows RELATIVE
participation, so it is impossible to know, by the plot itself, if the number of
action movies declined after WWI, if the number of dramas jumped up, or if
both effected happened combined.
From 1955 n, the participation of Drama and Romance steadily decline whereas
the share of Action and Animations grows. This is the general patter until
around 1985, and after this year a growth in the share of both Short movies and
Documentaries will shrink the relative shares of the other categories.
The code used to generate the former plot is shown below:
# --------------------------------------------------------
# Distribution of categories of movies along the years
hist(year[which(sumCateg==1)])
# --------------------------------------------------------
# Numbers of Drama and Action movies around the
# period of WWII
df2<-data.frame(t)
colnames(df2) <- c("Year","Category","NumberOfMovies")
df2 <- subset(df2,Category=="Action" | Category=="Drama")
p<-ggplot(df2, aes(x=Year, y=NumberOfMovies, group=Category)) +
geom_line(aes(color=Category))+
geom_point(aes(color=Category)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p
# --------------------------------------------------------