Вы находитесь на странице: 1из 15

# Adriano Axel Pliopas Pereira

# SGH Number 83393


December 4th, 2018

This work awill analyse some aspects of the movies dataset. The two plots below show
the distribution of movies according to their categories and also according to the rating
the movies received, but grouped as “stars”. A new variable “stars” was created
receiving 1 if the rating was between 0 and 2, 2 stars if the rating was between 2 and 4
and so forth up to ten. (i.e., stars = round up (rating/2)).

# --------------------------------------------------------
# Plot 1, 2 and 3 - Barplots with categories.
p <- ggplot(mov2)

# Using grayscale for the bar plot


print(
p + geom_bar(aes(x=Category, fill=stars),
color="black") +
scale_fill_manual(values=gray(1:5/5))

)
# Color bar plot
print(
p + geom_bar(aes(x=Category, fill=stars))

)
# --------------------------------------------------------

The creation of the graphs is done by the code show above, and the creation of the
accessory variables is show in the next page.
The following code shows the creation of variables for the category of movie (one only
variable mentioning the category to which the movie belongs), another one (sumCateg)
to compute the number of categories to which one single movie was classified, because
it was observed that some movies belong to many different categories while some of
them are not classified at all. And, finally, a variable for the number of stars attributed
to each movie, based on “ratings”, from 1 to 5.

# -----------------------------------------------------------
# Creation of factor variable category, to sum up the binary
# information present at
# 7 other variables in the original dataframe
Category <- vector()
Category[Action==1] <- "Action"
Category[Animation==1] <- "Animation"
Category[Comedy==1] <- "Comedy"
Category[Drama==1] <- "Drama"
Category[Documentary==1] <- "Documentary"
Category[Romance==1] <- "Romance"
Category[Short==1] <- "Short"
Category<-as.factor(Category)

# Addition of the new column to the workable dataframe mov2


# (preserving original movies without change)
mov2 <- cbind(movies,Category)
# -----------------------------------------------------------
# Variable sumCateg counts how many different categories were
# attributed to each movie.
sumCateg <- rep(0,length(Comedy))
sumCateg[Action==1] <- 1
sumCateg[Animation==1] <- sumCateg[Animation==1]+1
sumCateg[Comedy==1] <- sumCateg[Comedy==1]+1
sumCateg[Drama==1] <- sumCateg[Drama==1]+1
sumCateg[Documentary==1] <- sumCateg[Documentary==1]+1
sumCateg[Romance==1] <- sumCateg[Romance==1]+1
sumCateg[Short==1] <- sumCateg[Short==1]+1

mov2 <- cbind(mov2,sumCateg)


# -----------------------------------------------------------
# Creation of variable with ordered ratings from 1-5 stars
# based on rating values
stars <- vector()
stars[rating<=2] <- "1 star"
stars[rating>2 & rating<=4] <- "2 stars"
stars[rating>4 & rating<=6] <- "3 stars"
stars[rating>6 & rating<=8] <- "4 stars"
stars[rating>8 & rating<=10] <- "5 stars"
stars <- ordered(stars, levels = c("1 star","2 stars","3 stars","4
stars","5 stars"))

mov2 <- cbind(mov2,stars)


#attach(mov2)
# -----------------------------------------------------------
The plot below show the vertical bars proportionally (all of them fill the entire vertical
Y range of the graph), so we can have an idea of the distribution of stars for each
category. Also, the dataset used was a subset of the original one, with
subset(mov2,sumCateg == 1), meaning that only movies classified at one single
category were considered. This intends to make the evaluation of each category in a
more unbiased way.
We see there that documentaries receive a large proportion of positive reviews,
followed by Short Stories and Animation. Remarkably, Action movies are down the list,
with the smaller proportion of positive reviews and larger proportion of evaluation of 3
stars or lower.

# --------------------------------------------------------
# The following plot introduces two changes:
# 1 - only the subset in which movies recieved only one category
classification
# was considered
# 2 - The y scale is proportional, from 0 to 1, allowing for a
comparision between movies.
p <- ggplot(subset(mov2,sumCateg == 1))
print(
p + geom_bar(aes(x=Category, fill=stars),
position="fill")
)
# --------------------------------------------------------
I also wanted to investigate how many movies provided information on budget. On this
analysis, two factors were taken into account:
 Inflation rates makes it unrealistic to directly compare budget of movies from
different years.
 Small number of movies on dataset can make analsys more difficult.
The first analysis didn’t involve graphics. I just verified the number of movies with
budget information with the following code:

# --------------------------------------------------------
# Evaluation of how many movies provide budget information
numBudge <- length(budget)-length(which(is.na(budget)))
numBudgeRel <- round(10000*numBudge/length(budget))/100
fr <- c("Only ", numBudge, " movies, out of ", length(budget), " (",
numBudgeRel, "%),",
" provide Budget information.")
paste(fr,sep="",collapse="")
# --------------------------------------------------------

The output was:

[1] "Only 5215 movies, out of 58788 (8.87%), provide Budget


information."

Still, it is a good numbe for analysis. The following problem is to understand the
distribution of these movies along the years, and for this a histogram was made.
The histogram below shows that recent years concentrate the larger number of movies.
Makes sense that the numbers of movies on the dataset grows along the years,
particularly around the year 2000. To have a better idea of the numbers available,
another plot was made considering only movies after 1990 (on the next page)

# -------------------------------------------------------
# Histogram of movies with budget information available
qplot(subset(mov2,is.na(budget)==FALSE,select=year)[,],
geom="histogram",
binwidth=5,
main="Movies with information on budget",
xlab="Year of the movie",
fill=I("steelblue3"),
col=I("red"))
# -------------------------------------------------------
Since more than 450 movies from 2004 provide information on budget, the following
analysis will focus on this subset, therefore eliminating larger effects of inflation rate.

# -------------------------------------------------------
# Only movies after 1990:
qplot(subset(mov2,is.na(budget)==FALSE & year>=1990,select=year)[,],
geom="histogram",
binwidth=1,
main="Movies with information on budget from 1990 on",
xlab="Year of the movie",
fill=I("steelblue3"),
col=I("red"))
# -------------------------------------------------------
Below we see a subset of movies from 2004, looking at budget (X – axis) and number
of votes the movie received (Y axis) as an indirect measurement of popularity of the
movie (assuming viewers of each category have a similar probability of voting on a
movie, otherwise this measurement is biased). The dimension of each point shows, also,
the rating received by a movie. It is curious to see that only Drama and Romance
reached the top-left corner (high votes for low budget). Action movies show a
correlation between budget and votes, suggesting that it is only possible to reach a large
audience for action movies with a large budget, whereas for a Drama or Romance some
good movie idea can reach a large audience even with a considerably smaller budget (7
times smaller for a similar number of votes, in this case).

# -------------------------------------------------------
# Relation between budget and number of votes per movie
print(
ggplot(subset(mov2,is.na(budget)==FALSE & year==2004),
aes(x=budget/1e6, y=votes)) +
geom_point(aes(colour = Category, size = rating)) +
labs(x = "Movie Budget (Millions of Dollars $)")
)
# --------------------------------------------------------
For a more detailed evaluation, here we consider only romances and action movies,
identifying by name those with more than 18 thousand votes. The most successful
action movie there was Spider-Man 2, with a budget of 200 million dollars. But we have
a Romance called “Ethernal Sunshine of the Spotless Mind” which reached a higher
number of votes with a budget of less than 25 million. It suggests

# --------------------------------------------------------
# Relation between budget and number of votes per movie
print(
ggplot(subset(mov2,is.na(budget)==FALSE & year==2004 &
(Category == "Action" | Category =="Romance")
), aes(x=budget/1e6, y=votes)) +
geom_point(aes(colour = Category, size = rating)) +
labs(x = "Movie Budget (Millions of Dollars $)") +
geom_text(aes(label=ifelse(votes>18000,title,""),hjust=0,
vjust=0)) +
expand_limits(x = 250)
)
# --------------------------------------------------------
Here we excluded movies classified in more than one category, and showed names for
movies with more than 18 thousand votes or budget of more than 124 million dollars.

# --------------------------------------------------------
# Relation between budget and number of votes per movie
# Excluding multiple category classification (sumCateg ==1)

print(
ggplot(subset(mov2,is.na(budget)==FALSE &
year==2004 & sumCateg==1),
aes(x=budget/1e6, y=votes)) +
geom_point(aes(colour = Category, size = rating)) +
labs(x = "Movie Budget (Millions of Dollars $)") +
geom_text(aes(
label=ifelse(votes>18000 |
budget>124e6,title,""),
hjust=0, vjust=0)) +
expand_limits(x = 250)
)
# --------------------------------------------------------
Here we see the distribution of ratings in box plots, separed for categories, for movies
from 2004 that were classified In only one unique category. We see also that short
movies and documentaries had the highest ratings.

# --------------------------------------------------------
# Boxplot of ratings for each category of movie at the
# year 2004

ggplot(subset(mov2,is.na(budget)==FALSE & year==2004 & sumCateg==1),


aes(Category, rating, fill = Category)) +
geom_boxplot(size = 1) +
ggtitle("Distribution of Ratings per movie Category at 2004")
# --------------------------------------------------------
In the next plot, we analyse the distribution of ratings for groupings of budget range.
Here all movies were considered, and this explain an outlier action movie for the lower
budget group, which was not visible for the former analysis concerned only with the
year 2004. Inflation effects were not corrected here because it is beyond of the scope of
the present plotting exercise, but it is worth noting that it distort data (old movies which
here are grouped as lower budget could move to higher budget groups once those
inflation effects were taken into account).

# --------------------------------------------------------
# New categorical variable based on budget:
BudgetGroup <- vector()
BudgetGroup[budget<=50e6] <- "1.< 50 mi"
BudgetGroup[budget>50e6 & budget<=100e6] <- "2.50 to 100 mi"
BudgetGroup[budget>100e6 & budget<=150e6] <- "3.100 to 150 mi"
BudgetGroup[budget>150e6 & budget<=200e6] <- "4.150 to 200 mi"
mov2 <-cbind(mov2,BudgetGroup)
# --------------------------------------------------------
# BoxPlot for ratings distributed by Category And by Budget Group
ggplot( subset(mov2,is.na(BudgetGroup)==FALSE &
is.na(Category)==FALSE),aes( x = Category, y = rating , fill =
Category ) ) + geom_boxplot( alpha = 0.6, outlier.colour =
c("grey40") , outlier.size=3.5) +
facet_wrap(~BudgetGroup) + theme_bw() +
labs(title="Distribution of ratings by Category and Investment \n",
x="\n Movie Category", y="Movie Ratings \n") +
guides(fill = guide_legend("\n Supplement")) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# --------------------------------------------------------
# --------------------------------------------------------
# Analysis of the duration of the movies by category
p <- ggplot(subset(mov2,is.na(BudgetGroup)==FALSE &
is.na(Category)==FALSE & sumCateg==1),
aes(length, colour=Category, fill=Category))
p <- p + geom_density(alpha=0.55) + xlim(0,180)
p <- p + ggtitle("Distribution of length of movies by Category")
p
# --------------------------------------------------------
On this last plot we study the distribution of categories of movies along the years. Once
again, only movies with one unique classification were considered in the subset. Many
points can be observed here:
 Prior to 1920 almost all movies were just classified as “Short” movies, but it
must also be said that the number of movies from this period is rather small, as
is can be seen on the histogram shown on the next page.
 During the first expansion of the movie industries, the dominant category,
according to this dataset, was Romance, followed by Comedy and Drama.
 After 1930 we see an initial expansion on the share of Action movies, but it is
interesting to see that it peaks at the end of the World War 2 and falls almost
50% during the following years, resuming growth after around 1955, 10 years
later. During these 10 years right after WWII, the most remarkable expantion
was in that of Drama. It is important to remark that this plot shows RELATIVE
participation, so it is impossible to know, by the plot itself, if the number of
action movies declined after WWI, if the number of dramas jumped up, or if
both effected happened combined.
 From 1955 n, the participation of Drama and Romance steadily decline whereas
the share of Action and Animations grows. This is the general patter until
around 1985, and after this year a growth in the share of both Short movies and
Documentaries will shrink the relative shares of the other categories.
The code used to generate the former plot is shown below:
# --------------------------------------------------------
# Distribution of categories of movies along the years
hist(year[which(sumCateg==1)])

ggplot(subset(mov2,sumCateg==1 & year>0), aes(year, fill = Category))


+
geom_density(position="fill") +
ggtitle("Proportion of categories of movies along the years") +
scale_x_continuous(breaks=seq(from=1895, to=2005, by=5)) +
theme(axis.text.x = element_text(size=12, angle=90))
# --------------------------------------------------------

This histogram, generated with base R,


shows the number of movies belonging to
one unique category along the years. We
see that the number of movies in the catalog
jumps around 1930.

The table below shows is useful to solve the


doubt regarding the explanation of the post-
war phenomena observed in the former
plot. We see that the number of Action
movies declined after 1943 but kept
relatively stable until 1948, and then
dropped for many of the following years.
The number of Drama movies, on the other
hand, jumped 26% right after 1945 and
kept growth with oscillations since then.

This same information can be, of course, be


viewed graphically, and this is done on the
next page.
In this last plot, we see numerically that the variation in the number of Action Movies
following 1945 was relatively important but not numerically relevant. The biggest
change was a huge growth in the absolute number of Drama movies, specially from
1945 to 1948.

# --------------------------------------------------------
# Numbers of Drama and Action movies around the
# period of WWII
df2<-data.frame(t)
colnames(df2) <- c("Year","Category","NumberOfMovies")
df2 <- subset(df2,Category=="Action" | Category=="Drama")
p<-ggplot(df2, aes(x=Year, y=NumberOfMovies, group=Category)) +
geom_line(aes(color=Category))+
geom_point(aes(color=Category)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p
# --------------------------------------------------------

Вам также может понравиться