Netflix Movies Data Analysis

1. Problem Statement

Netflix executives require clear, data-driven narratives regarding historical programming to inform current production strategies. Specifically, they need to know if the industry-wide assumption that “movies are getting shorter” holds true over time and how genre attributes impact runtime distributions, allowing them to optimize pacing for modern streaming audiences.

2. Dataset

  • Source: Kaggle ‘Netflix Movies and TV Shows’ extended dataset.
  • Dataset Size: ~9,000 titles originally, filtered strictly to 4,200 movies released post-1990.
  • Features: Title, duration, release year, genre/category tags, director, cast, and IMDB rating.
  • Preprocessing: Filtered out all TV shows to isolate film runtime variables. Dropped rows with null values in critical duration or release-year columns to maintain statistical integrity.

3. Feature Engineering

  • Genre Bucketization: Exploded the multi-label ‘listed_in’ genre tags and mapped them to core aggregated categories (Action, Drama, Comedy, Documentary) to simplify aggregation.
  • Decade Grouping: Created a categorical ‘Decade’ feature mapping raw release years into distinct bins (1990s, 2000s, 2010s) to perform cohort analysis over time.

4. Models Tested

(Note: As an Exploratory Data Analysis project, inferential statistical models were prioritized over predictive ones.)

Model R-Squared Mean Squared Error
Linear Regression (Baseline Runtime Trend) 0.42 520.4
Polynomial Regression (Degree 2) 0.58 390.1
ARIMA (Time Series Runtime Forecasting) N/A 342.7

5. Final Model

The Polynomial Regression effectively captured a slight downward curve in runtime starting in the late 2010s, but the primary findings relied on Kruskal-Wallis H-tests proving that differences in median runtimes across decades were statistically significant, not just random fluctuations.

6. Evaluation

  • Visual Distributions: Plotted KDE and violin plots clearly showing the tightening of runtime variance in the modern era (clustering around 90-100 minutes) compared to the wide spread of the 1990s.
  • Correlation: Identified a weak negative correlation (-0.18) between release year and movie duration specifically for Action/Drama genres, while Documentaries maintained a flat trendline.
  • Cross-Validation: N/A for raw EDA.

7. System Architecture

graph TD A[Raw Netflix CSV] --> B[Pandas Data Filtering] B --> C[TV Show Removal & Null Handling] C --> D[Explode Genre Columns into Bins] subgraph Analytical Pipeline D --> E[Matplotlib Temporal Aggregation] D --> F[Seaborn Distribution Plots] F --> G[Statistical Significance Testing] end subgraph Business Narrative E --> H[Generate Executive Dashboard] G --> H H --> I[Actionable Content Output Rules] end

8. Key Learnings

Data scaling visually impacts the narrative. Correctly anchoring axes and identifying anomalies (like 10-hour avant-garde films) before generating standard plots prevents executives from misinterpreting the average viewer’s experience.

9. GitHub Repository

Source Code: Netflix EDA

10. Future Improvements

  • Scrape IMDB to pull in exact production budgets to identify ROI metrics correlated with film length.
  • Build an interactive Plotly dashboard so stakeholders can dynamically filter runtimes by their specific country of origin.