1. Problem Statement
Netflix executives require clear, data-driven narratives regarding historical programming to inform current production strategies. Specifically, they need to know if the industry-wide assumption that “movies are getting shorter” holds true over time and how genre attributes impact runtime distributions, allowing them to optimize pacing for modern streaming audiences.
2. Dataset
- Source: Kaggle ‘Netflix Movies and TV Shows’ extended dataset.
- Dataset Size: ~9,000 titles originally, filtered strictly to 4,200 movies released post-1990.
- Features: Title, duration, release year, genre/category tags, director, cast, and IMDB rating.
- Preprocessing: Filtered out all TV shows to isolate film runtime variables. Dropped rows with null values in critical duration or release-year columns to maintain statistical integrity.
3. Feature Engineering
- Genre Bucketization: Exploded the multi-label ‘listed_in’ genre tags and mapped them to core aggregated categories (Action, Drama, Comedy, Documentary) to simplify aggregation.
- Decade Grouping: Created a categorical ‘Decade’ feature mapping raw release years into distinct bins (1990s, 2000s, 2010s) to perform cohort analysis over time.
4. Models Tested
(Note: As an Exploratory Data Analysis project, inferential statistical models were prioritized over predictive ones.)
| Model | R-Squared | Mean Squared Error |
|---|---|---|
| Linear Regression (Baseline Runtime Trend) | 0.42 | 520.4 |
| Polynomial Regression (Degree 2) | 0.58 | 390.1 |
| ARIMA (Time Series Runtime Forecasting) | N/A | 342.7 |
5. Final Model
The Polynomial Regression effectively captured a slight downward curve in runtime starting in the late 2010s, but the primary findings relied on Kruskal-Wallis H-tests proving that differences in median runtimes across decades were statistically significant, not just random fluctuations.
6. Evaluation
- Visual Distributions: Plotted KDE and violin plots clearly showing the tightening of runtime variance in the modern era (clustering around 90-100 minutes) compared to the wide spread of the 1990s.
- Correlation: Identified a weak negative correlation (-0.18) between release year and movie duration specifically for Action/Drama genres, while Documentaries maintained a flat trendline.
- Cross-Validation: N/A for raw EDA.
7. System Architecture
8. Key Learnings
Data scaling visually impacts the narrative. Correctly anchoring axes and identifying anomalies (like 10-hour avant-garde films) before generating standard plots prevents executives from misinterpreting the average viewer’s experience.
9. GitHub Repository
10. Future Improvements
- Scrape IMDB to pull in exact production budgets to identify ROI metrics correlated with film length.
- Build an interactive Plotly dashboard so stakeholders can dynamically filter runtimes by their specific country of origin.