Вы находитесь на странице: 1из 9

IMDB Movie Dataset Analysis

SARADA SARIPALLI
IMDB Movie dataset analysis to predict movie success rate.

Purpose: To do Linear regression analysis on the data with 12 variables to predict the Movie
rating.
1. which of them are critical in telling the IMDB rating of a movie.
2. Is there any correlation between genre & IMDB rating & IMDB rating, director name &
IMDB rating and runtime & IMDB rating.
3. Predict the IMDB Score using model.
Filename : IMDB-Movie-Data.csv.
Source : https://www.kaggle.com/kerneler/starter-idmb-movie-76b7f8b7-1
 1000 movie titles
 Dataset has 12 columns and 1000 rows.
IMDB Schema details.

root
|-- Rank: integer (nullable = true)
|-- Title: string (nullable = true)
|-- Genre: string (nullable = true)
|-- Description: string (nullable = true)
|-- Director: string (nullable = true)
|-- Actors: string (nullable = true)
|-- Year: string (nullable = true)
|-- Runtime (Minutes): string (nullable = true)
|-- Rating: string (nullable = true)
|-- Votes: string (nullable = true)
|-- Revenue (Millions): double (nullable = true)
|-- Metascore: double (nullable = true)
Title and Content Layout with SmartArt

Step 1 Data Step 2 Data Step 3 Step 4 Making


Collection Cleaning & Regression Predictions
Exploration Analysis • RMSE
• Remove duplicates • Fitting model • R2
Collecting Data. • Replace missing • Train data
values.
Reading Data.
• Task description
Step 1 Data Collection

Filename : IMDB-Movie-Data.csv.
Source : https://www.kaggle.com/kerneler/starter-idmb-movie-76b7f8b7-1
 1000 movie titles
 Dataset has 12 columns and 1000 rows.
 Read data into dataset.
Step 2 Data Cleaning & Exploration

View data and study information.


Remove duplicates by dropping duplicates.
Verify missing values and replace/drop.
Verify for outliers and replace with appropriate values
Step 3 Regression Analysis

The dataset contains 12 columns with the first 11 columns being the features that
determine the rating(the last column) of a movie.
We create transformers and assemble first 11 columns into vector.
Linear regression models can only operate on numeric data, we need to transform those
features which are categorical into numerical data.
Create estimator linear regression model.
Create pipeline and run estimator.
Step 4 Making Predictions

Use the Regression Evaluator to test how well our model performed.
Find RMSE.
Conclusion :

Find the most important factor that affects movie rating.


Among these following :
Director
Runtime
Genre
Actors

Thank you !!!

Вам также может понравиться