Академический Документы
Профессиональный Документы
Культура Документы
Our attempt at discovering an interesting and viable project led to the feasible choice
in choosing movie rating data, both for its intrinsic value of significance, practicality,
and engagement of thought. As we began to muster ideas about how we can analyze
this data, the notion arose, “What determines a ‘good’ movie?” In the 2015 study,
McCullough and Conway state that award-winning films consistently tend to have a
lower state of integrative complexity [1]. The idea of integrative complexity is defined
as “a person’s ability to differentiate between the different but relevant perspectives of
a problem and, at higher levels, the ability to integrate those perspectives in some
coherent manner” [2]. These conclusions diverge from the realm of human behavior,
and the corresponding studies were conducted by psychologists. The conclusions
drawn from our study were conducted through the determination of trends and
analytical data from a corresponding IMDb dataset, which contains information on the
top 250 movies of all time, according to IMDb. The dataset consisted of variables that
naturally correspond to movie data, including but not limited to: movie title, release
year, parental rating, genre, director, actors, and box office results. Typically, in a
supervised learning model, a target variable is assigned, and algorithms are applied to
the corresponding data to actualize trends. With the IMDb top 250 movies of all time,
a target variable like “reception” or “rating” with corresponding values like “good” or
“bad” would be applied. However, in the case of this study, we focused on applying an
unsupervised clustering algorithm, naturally lacking a target variable, which allowed
freedom to actualize trends without a corresponding target to fulfill. We anticipate the
trends that are formed to present insight to our original question, “what determines a
‘good’ movie.” This project seeks to agglomerate these intrinsic qualities and structure
an informed prediction of the characteristics of a “good” movie. Moreover, the project
systematically employs machine learning algorithmic techniques via the conduit of
web scraping, k-means clustering, the elbow method, text mining, dimensionality
reduction (Principal Component Analysis (PCA)) and integrates minute components
of natural language processing (NLP). The intention to utilize predictive modeling
grounds its basis upon the ability to strategically determine, by means of common
characteristics within the data, why there is a likelihood of being considered a “good”
movie. This notion can be furthered to determine whether or not a movie would be
considered an award-winning movie but is beyond the scope of this project. To
reiterate, our initial questions are as follows: Given this particular data set, can we
predict why a movie would be considered a “good” movie? Based on our results, are
there identifiable attributes among these movies that are more likely to contribute to a
“good” movie? In a practical sense, choosing to answer these questions does not take
into consideration any influence from other realms of study, such as concepts from the
aforementioned applicable study in human behavior, and relies solely on machine
learning methods and statistical analysis.
2 Dataset
Our dataset is located on the Internet Movie Database (IMDb), and can be downloaded
through its API (see source file for API key). The dataset we analyzed contains the top
rated 250 movies, according to IMDb. The method of acquiring the top 250 movies
requires that the accessor of the data inputs a movie ID, and relevant content
concerning the movie is returned. Downloading the dataset occurs natively within
Python, and in our case the Jupyter Notebook. To view the raw list of the top 250
movies (not including relevant data that corresponds to each of the movies) please see
the following link: http://www.imdb.com/chart/top. As a note to future accessors of
the dataset: the dataset used is dynamically changing (as new information is added, the
database changes), and all data collected and analyzed within this document reflect a
respective instance of this dataset. The data that was imported into Python contains
250 rows and 9 original columns (programmatic addition caused this number to
change as categorical variables were converted to numerical and corresponding
dummy variables were added). Generally speaking, the dataset contains a higher
number of features when compared to the number of samples. We have created a data
dictionary, Table 1, which describes our raw dataset, includes details about the
features, and can be applied as a ledger for the methodologies in our research study.
Table 1. Data Dictionary for original data (sans programmatic addition) [3]. Contents of table
on following page. Legend is as follows: $ Plot of the movie displayed as string. $$ Metascore
is considered the rating of a film. Scores are assigned to movie's reviews of large group of the
world's most respected critics, and weighted average are applied to summarize their opinions
range.
3 Methodology
Our goal is to learn from the dataset why a movie is considered a “good” movie, and
to develop general trends or a prediction outcome. We endeavor to use unsupervised
learning models to establish natural clusters within the top 250 movies of all time,
using the several intrinsic features of the dataset. We also endeavor to apply pre-
processing and dimensionality reduction techniques to enhance and improve the
produced clusters.
We followed a loosely scientific approach for this project, using step-by-step
trial and error methodology and a plethora of noted resources. Although, we didn’t
formulate an explicit hypothesis, we used the initial questions posed in our proposal as
a guide. We resolved to experiment with what would work, what wouldn’t work, and
why. We continued finding solutions and improving our model after answering our
initial questions. Initially, we sought to find a viable solution by adhering to online
tutorials and guides. Ultimately, we applied k-means clustering, text mining, the elbow
method, and Principal Component Analysis as our solutions to answering our initial
questions. The following sub-sections describe the methods used from acquiring the
data to employing unsupervised learning to actualize solutions to our questions.
The initial process began with seeking out data that would correspond to a topic of
interest, choosing a predictive model, and applying Python code to execute it (see
beginning of methodology section for holistic methodology). The process of executing
our code began with the importing of various packages (and corresponding models)
that we deemed necessary for this process. These packages include, but are not limited
to: NumPy, Pandas, Requests, components of BeautifulSoup, components
of sklearn, json, datetime, nltk (the natural language processing library), and
Matplotlib. After importing our libraries, we imported the Top 250 Movies
database from IMDb. This was a relatively long process. The data was a direct
download from IMDb’s website and had to be accessed through an application
program interface (see Dataset section above for more details). The method of
acquiring the top 250 movies required that the accessor of the data input a movie ID,
and relevant content concerning the movie would be returned. Initially, we had to
scrape the top 250 movies page from IMDb to acquire a list of IMDb’s designated
movie IDs. Then the list of IMDb movie IDs was used to query the API. After
querying the API, a set of information was returned for each value queried. This
process was facilitated by the BeautifulSoup Python library, which parsed the
HTML code for the dataset and returned a finely transcripted dataset of information.
This information consisted of general data concerning each movie, such as: the plot,
parental rating, language, actors, title, genre, etc. Please see our data dictionary to
review features within our model and corresponding information (Table 1). Within this
set of information, we also queried the budget and revenue for each of the movies.
From there, we performed a general analysis on the data, using code to determine
unique values within a variable (unique function), discern natural trends within the
data (head and tail functions), and understand the shape of our data (shape
function).
Next, we used the aforementioned custom functions that were created to perform
analyses by visualization of the data and subsets of the data. Initially, we ran the
plotColumn function on the Director variable to discern top producing directors
(line 18 of source code). The function freqSort was used to further determine top
values in the Director variable. Unique values were extracted from the Genre
variable and displayed as a bar chart, using plotColumn (line 22 of source code).
Ratings was then modified to be manifested as a binary variable, opposed to a
categorical variable with several values. The splitting function was used to create
new columns that served as a binary operator using 0 and 1 as possible values.
Actors was analyzed by visualization and via the freqSort function, top
occurring actors were added to the dataFrame. A likewise procedure was performed
for the Production variable, and top production companies were added to our
dataFrame. These visualizations can be found in the source code starting on line 18.
For any of the columns that contained text-intense contents (for example variable
Plot), a process was employed to extract the most frequently occurring words and
create corresponding dummy columns. Variable Plot was a challenging column to
extract top words as it contained a number of different words that reflect natural
spoken language (as opposed to a generic list of information, categorical data, or small
descriptive clauses). From the IMDb top 250 data, the most commonly extracted
words, considering Plot contained data that reflects spoken language, tended to be
articles, conjunctions, prepositions and pronouns common to natural language, like
“the”, “a”, “and”, “for”, etc. (see Figure 1). Consequently, we compiled a list of the
top 21 words in occurring in this column that were considered significant, designated
to a new variable, wordsOfInterest. The words appropriated to this variable
were: “man”, “boy”, “him”, “woman”, “girl”, “her”, “love”, “war”, “journey”,
“murder”, “friendship”, “police”, “battle”, “beautiful”, “team”, “detective”, “fight”,
“death”, “crime”, “struggles”, and “family”. The wordsOfInterest variable was
then integrated into the existing dataFrame as a new variable, plotNew, using the
function discoverPlot and wordExist.
At the halfway point of this project, we determined the shape of the dataFrame
and realized we had a dataframe containing 250 entries and 103 features, from the
amalgamation of data within the previous sections of the project. In the context of
modeling, a high number of features in relation to data points leads to a situation
which causes relevant information to be muted. According to the machine learning
researcher, Matthew Mayo, using irrelevant attributes, mixed in with powerful
predictors cause a negative effect on the resulting model [4]. Moreover, “…as the
feature set grows, the feature space grows exponentially, pushing the relative
separation between data points to reach parity. That means that the feature set ceases
to provide predictive power for grouping similar points because all points are equally
dissimilar” [5]. We recognized further feature selection tools would need to be
implemented to reduce dimensionality. After corresponding dummy variables were
added to each of the existing variables within the dataFrame, a final dataframe
called lastDataFrame was created containing said dummy variables, narrowed
down by feature selection. The variables now exist with binary operators designating
presence (or not) of a respective value.
Before employing our clustering algorithm, we reduced the number of
features within our model by iterating a Principal Component Analysis. We began by
instantiating lastDataFrame as a NumPy matrix, appropriated as variable X. We
instantiated the StandardScaler() function from sklearn.preprocessing,
and the X variable into the scaler.fit_transformed() function. We then used
the pca.fit() function and employed the aforementioned X variable to employ the
Principal Component Analysis. After reviewing the results of the initial PCA, we
decided to additionally analyze the first component and determine which variables
were most influential to the first component.
The last phase of the project entailed ascertaining which clustering model to employ
on the processed data and fitting the respective algorithm to the model. The
determination process was extensive, reflecting a trial and error process, and arrived at
the conclusion that k-Means clustering was the most optimal solution for this model,
much because the data was already pre-processed. For the k-means clustering
algorithm we introduced a new variable, Xpca, a derivate of the variable X, which
also employs the pcaComponents variable, from the Principal Component
Analysis. The parameters that were employed within our model include:
n_clusters, init, and random_state. The n_clusters parameter allows the
user to indicate “the number of clusters to form as well as the number of centroids to
generate” [6]. In our model, we indicated n_clusters=i, a range, which will help
us implement the elbow method to determine the optimal number of clusters. The
Elbow Method and its findings are further discussed in the k-Means section of the
results. The init parameter in our model was set to k-means++, which “selects
initial cluster centers for k-means clustering in a smart way to speed up convergence”
[6]. Random_state in a model “determines random number generation for centroid
initialization” [6], and for our model, was set to 0. Following the establishment of the
optimal number of clusters, by the Elbow Method, a subsequent k-Means model was
run using MiniBatchKMeans. We employed this method as it reduces
computational cost, while maintaining the quality of the output. Scikit-learn
documentation demonstrates the high-quality output of MiniBatchKMeans, when
compared to a normal instantiation of k-Means [7]. Within this iteration, we used the
following parameters and respective arguments: init=k-means++
(aforementioned), max_iter=500 (“Maximum number of iterations over the
complete dataset before stopping independently of any early stopping criterion
heuristics [8]”), n_init=1000 (“Number of random initializations that are tried. In
contrast to KMeans, the algorithm is only run once, using the best of the n_init
initializations as measured by inertia [8]”), init_size=1000 (“Number of
samples to randomly sample for speeding up the initialization [8]”, and
batch_size=1000 (indicates “size of the mini batches [8]”).
Figure 1. [From source file.] Illustration of code employed to extract the most frequently
occurring words from Plot variable and create corresponding dummy columns. Full list is
shown in output 149. Input 151 is new variable, wordsOfInterest created using words
that were significant, omitting parts of speech found in natural spoken language.
4 Experimental Results
We have broken down our results into corresponding phases of our project, as
designated by the following sections. Each section contains objective experimental
results with minor amounts of reasoning or interpretation. A subsequent Interpretation
and Conclusion section follows the respective phases of our project and helps develop
subjective rationales for our findings and concludes our initially posed questions.
Text mining and components of natural language processing were applied to find
insight in text-intense variables like Plot. From the IMDb top 250 data, the most
commonly extracted words, considering Plot contained data that reflects spoken
language, tended to be articles, conjunctions, prepositions and pronouns common to
natural language, like “the”, “a”, “and”, “for”, etc. Consequently, we compiled a list of
the top 21 words in occurring in this column that were considered significant,
designated to a new variable, wordsOfInterest. The words appropriated to this
variable were: “man”, “boy”, “him”, “woman”, “girl”, “her”, “love”, “war”, “journey”,
“murder”, “friendship”, “police”, “battle”, “beautiful”, “team”, “detective”, “fight”,
“death”, “crime”, “struggles”, and “family”. (See Section 5, Interpretation &
Conclusion).
“Principal component analysis is a method that rotates the dataset in a way such that
the rotated features are statistically uncorrelated” [9]. The algorithm progresses by
determining the direction of maximum variance (contains the most information), then
determines the direction that contains the most information while being orthogonal
(oriented at a right angle) to the first direction [9]. In our application of Principal
Component Analysis, with intention of dimensionality reduction, we found that (for
this instance of the dataset) 40 components explained 72.26% of the variance in the
data, with the first component being attributed to 5% of the variance (line 55 of source
code). We then took the components greater than 1% (to determine which of the
features has the greatest influence on our first component) and printed these most
influential features. Out of the 141 original features, we printed the first 40 as they
accounted for most of the variance. (See Section 5, Interpretation & Conclusion).
From the Principal Component Analysis, we employed the Elbow Method to help us
determine the optimal number of clusters to include in our k-Means model (line 57 of
source code). The Elbow Method employs the within clusters sum of squares to return
this information [10]. We employed a MiniBatchKMeans model with parameter
n_clusters=8 on the variable Xpca and coupled it k-Means labels to determine
how many movies were in each cluster. The first cluster had 182 movies in it, while
the eighth cluster had four in it. We initially tried to visualize the clusters and their
respective centroids but found that due to the dimensionality of the dataset, were not
able to strategically and intuitively display the visualization. Instead, we opted to list
each cluster with its associated features (line 66 of source code). Notice that Cluster 0
contained the following words: genre: Romance, genre: Sci-Fi, Director: Akira
Kurosawa, genre: Adventure, genre: Music, while the titles include movies such as:
The Shawshank Redemption, The Godfather, The Godfather: Part II, 12 Angry Men,
Schindler's List, Pulp Fiction.
3. The Internet Movie Database (2018) Top Rated Movies: Top 250 as Rated by IMDb
Users. https://www.imdb.com/chart/top
5. Man H (2017) Data Clustering Using Unsupervised Learning- What type of movies
are in the IMDB top 250? In: medium.com. https://medium.com/hanman/data-
clustering-what-type-of-movies-are-in-the-imdb-top-250-7ef59372a93b. 2018
9. Müller Andreas C., Guido S (2017) Introduction to machine learning with Python: a
guide for data scientists. OReilly, Beijing
10. Naik K (2018) K means clustering in python- Machine Learning Tutorial with Python
and R-Part 12. In: YouTube. https://www.youtube.com/watch?v=tAY6jtFoNEA.
Accessed 2 Dec 2018