Conjoint Analysis for Netflix
Project Introduction
This project delves into assisting Netflix in rejuvenating its growth trajectory using data analysis powered by Python. Rather than relying on traditional analytics methods that focus on individual feature surveys, the strategy employs the Choice-Based Conjoint Analysis approach. This method, anchored in scientific analytics, provides a holistic examination of consumer preferences.
Project Structure
Below is an overview of the project's various components. To navigate, simply click on the sections of interest.
Libraries and Data Setup
In this section, several Python libraries are imported to aid in data analysis and visualization.
​
Data Preparation
Isolating Dependent and Independent Variables
Building on the previously imported data, this section delves into preparing the dataset for analysis. The initial phase involves distinguishing between the target variable and the predictor variables.
​​
​
​
​
Transforming Variables using Dummy Encoding
To accommodate categorical variables in the dataset, a transformation is applied through dummy encoding. This method, synonymous with one-hot encoding, transposes categorical variables into a format of binary columns. Each column signifies a category and is marked with a 1 if the category corresponds to the observation, and 0 otherwise.
​
The transformation exploits the 'get_dummies' method from pandas, which instinctively identifies and processes categorical variables.
Delving into Logistic Regression
Following the data preparation phase, the next logical step is to apply a statistical method to discern the relationships inherent within the data. Enter Logistic Regression, a statistical analysis tailored for scenarios with a binary outcome, perfectly fitting our target variable.
​
Model Building
Employing the Ordinary Least Squares (OLS) method, a regression model is crafted to predict the target variable using the transformed predictors. The 'sm.OLS' function from the 'statsmodels' library makes this possible. Post model-building, a comprehensive summary can be procured, offering critical statistics about the regression. This summary encapsulates key metrics such as coefficients, t-values, p-values, and the overall model fit, essential for evaluating the model's reliability and significance.
Output:
​
Diving Deep into Conjoint Analysis Interpretation
Transitioning from our logistic regression analysis, the journey now leads towards deriving actionable insights, particularly understanding the importance (or 'partworth') of each feature based on the regression results.
​
Collating Key Metrics:
The first task at hand is to gather vital metrics from the regression model, such as feature names, their associated partworths, and significance levels.
​
​
​
​
Streamlining Data for Visualization:
Before diving into plotting, the dataset is reorganized, ensuring a seamless and coherent visual presentation.
​
​
The significance of each feature's partworth is then highlighted. Here, a p-value less than 0.05 indicates statistical significance at the 95% confidence level.
​
​
​
​
Output:
​
​
​
​
Crafting the Visualization:
The final piece in this analytical journey is a compelling visual representation. A horizontal bar graph is plotted, emphasizing the partworths of each feature.
​
​
​
​
​
​
​
Output:
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
Inferences:
-
Most and Least Preferred Features
"ads_none" appears to be the most positively influential feature, suggesting consumers greatly prefer options with no ads. Conversely, the red bar indicates a feature that has a strong negative impact on consumer preference, which in this case is "ExtraContent_less content."
-
Feature Pricing
The features with "price_" as a prefix likely denote different price points, with increasing price points generally leading to a more negative impact on consumer preference.
-
Content Preferences
The features with "ExtraContent_" as a prefix suggest different types of content offerings and their effect on preference. Some content types are more preferred than others, with "HBO", "Disney", and "Marvel" having positive partworths, indicating a preference for these content types.​
-
Account Limitations
The features beginning with "NumberAccounts_" seem to indicate the number of accounts or simultaneous streams offered, with a greater number of accounts generally preferred.
​
Honing in on Ads: A Deep Dive into Specific Drivers
After a comprehensive overview of feature importance, it's now time to zoom in on specific aspects of the analysis: the impact of various advertisement drivers. This focused lens will provide insights into how different ad-related attributes influence viewer preferences.
Extracting Ad-Related and Price-Related Attributes:
From the broader pool of metrics, attributes related to ads and price are singled out, collecting their corresponding coefficients (or 'partworths').
Ads: Price:​
​
​
​
​
Visualization: A Stem Plot of Ad Drivers and Price Drivers:
A stem plot is the choice of visualization here, offering a clear representation of how each ad-related driver and price-related driver impacts overall viewer sentiment.
-
Ads
​
​
​
​
​
​
​
​
Output:
​
​
​
​
​
​
​
​
​
​
​
Inference:
​"ads_none" has the highest partworth value, suggesting that among the levels shown, having no ads is the most preferred option for the respondents. "ads_one_per_show" has the lowest partworth value, indicating that it is the least preferred option compared to the others displayed. "ads_one_per_day" has a partworth value that is higher than "ads_one_per_show" but lower than "ads_none," placing it in a middle preference position.
​
-
Price
​
​
​
​
​
​
​
​
Output:
​
​
​
​
​
​
​
​
​
​
Inference:
"price_20" has a negative partworth value, suggesting that this price level is the least preferred among the options shown. "price_8" has the highest partworth value, indicating that it is the most preferred price level. There is a consistent pattern that as the price decreases from "price_20" to "price_8", the partworth values increase, suggesting that customers are more likely to prefer lower-priced options in this scenario.
The Hierarchy of Features: Ranking by Influence
Transitioning from a granular look at specific drivers, it's pivotal to understand the broader landscape of feature influence on user choices. This section harnesses the data to calculate and visualize which features have the greatest sway on decisions.
​
Breaking Down Feature Coefficients:
Every feature has a set of coefficients or 'partworths' associated with its different levels. These coefficients are extracted and organized for subsequent calculations.
​
​
​
​
​
​
​
Output:
​
​
​
​
​
​
​
​
Computing Feature Importance:
For each feature, its influence is calculated as the difference between its maximum and minimum coefficients.
​
​
Output:
​
​
​
To ensure the calculations are on track, the total of all feature importance is displayed.
​
​
Output:
Gauging Relative Importance:
Relative importance translates a feature's influence into a percentage of the total feature importance.
​
​
​
Output:
Preparing Data for Visualization:
To depict feature importance visually, a DataFrame is constructed, sorting features by their relative importance.
​
​
​
Output:
Treemap: A Bird's-Eye View of Feature Influence
A treemap offers an intuitive representation of feature importance, with the size of each square corresponding to a feature's relative importance.
​
​
​
​
​
Output:
​
​
​
​
​
​
​
​
​
​
​
​
​
Inferences:
-
The largest rectangle is labeled "NumberAccounts" and has a value of 33.8. This suggests that the number of accounts is the most significant feature with the highest relative importance percentage among the attributes considered in the analysis.
-
The second-largest rectangle is labeled "price" with a value of 27.5. This indicates that price is the second most significant factor in this analysis.
-
The rectangle labeled "ExtraContent" has a value of 22.5. This shows that extra content also has a considerable impact, being the third most significant feature.
-
The smallest rectangle is labeled "ads" with a value of 16.1, suggesting that ads have the lowest relative importance among the features displayed.
​
Deepening the Analysis: The Power of Interaction Terms
Following the exploration of individual feature importance, it becomes imperative to understand how combinations of different features might influence viewer preferences. This section delves into the concept of "interaction terms" and investigates the combined effect of multiple features.
​
Crafting Interaction Terms:
The primary idea behind interaction terms is to study how two or more features, when taken together, might have a different effect on the dependent variable than if they were considered individually. Here, an interaction term between 'ExtraContent' and 'ads' is introduced.
​
​
Output:
Refining the Dataset:
To ensure that the analysis doesn't get muddied with redundancy, it's necessary to remove the original variables that are now part of the interaction term.
​
​
​
Output:
​
​
Preparing Data for Modeling:
Similar to earlier steps, the dataset is divided into target and predictor variables. And then, these predictor variables are transformed into dummy variables to ensure they are in a format suitable for logistic regression.
​
​
​
​
​
​
Employing Logistic Regression with Interaction Terms:
A logistic regression model is fitted, incorporating the interaction term
​
​
Decoding Interaction Term Results
Having incorporated interaction terms in the analysis, it's paramount to dissect their results to comprehend the synergy between 'ExtraContent' and 'ads'. This section meticulously breaks down the findings, providing a graphical representation to better visualize the interdependencies.
​
Extracting Model Results:
From the logistic regression model, results concerning the coefficients (part-worths) are extracted.
​
​
​
​
Output:
​
​
​
Honing in on Specific Interaction Effects:
While the model results offer insights into all features, a keen focus is directed towards the interaction terms between 'ExtraContent' and 'ads'.
​
​
​
​
​
​
Graphical Representation of Interaction Effects:
A stem plot is employed to provide a clear visual depiction of the relationship.
​
​
​
​
​
​
​
Output:
Conclusion from Netflix's Conjoint Analysis
Insights Derived:
-
Advertisements play a crucial role in influencing user preferences. The specific drivers associated with ads, when visualized, showcased their significance.
-
The combined effect of extra content and ads is pivotal. The interaction term analysis revealed that it's not just about offering additional content or controlling ad frequency but understanding how they work together.
-
The treemap's relative importance analysis painted a clear picture of feature priorities. This means that while some features significantly impact user choice, others might not be as influential.
​
Final Takeaway:
The entire analysis underscores the importance of understanding user preferences in a granular manner. For a platform like Netflix, it's pivotal to recognize not just what users like but also how different platform features, both individually and in tandem, influence those preferences. By harnessing these insights, Netflix can make more informed decisions, optimizing their platform to align perfectly with user desires.