Description
This research project examines the data set titled “Highest Rated Films Annually from 1984-2014.” The data set consists of 510 films and displays their title, Motion Picture Association rating, genre, budget, gross, runtime, release date, rating, and rating count. The investigation has been conducted using R, a programming language for statistical computing and graphics. The purpose of this study is to explore the factors that influence high film ratings along with visualizing the data in this set. R was utilized for data visualization and multiple linear regression of quantitative variables. Multiple linear regression is a statistical model that examines the relationship between a dependent variable and two or more independent variables. The regression model analyzes if the independent variables of budget, gross, rating count, and runtime have a significant effect on a film’s rating. The results of the model indicate that budget, runtime, and rating count are significant predictors of a film’s rating, and gross income does not have a significant effect. Scatter plots, histograms, bar plots, and pie charts were created for the purpose of data visualization. Along with visualizing film data, simple statistics were used to determine which movies have the highest and lowest values for each category. Predictions by using several statistical learning models are also performed.