Surviving the App-pocalypse
In an ever-evolving digital marketplace, understanding why apps fail or succeed is crucial. Our project, embarks on a deep dive into the dynamics of app churn on the Google Play Store. By harnessing the power of Survival Analysis and Cox Proportional Hazards Regression, this project aims to unravel the mysteries behind app longevity and determine the key factors that contribute to an app's survival or demise.
Libraries and Data
In the initial phase of our "Surviving the App-pocalypse" capstone project, we establish the groundwork by assembling the necessary tools and data. We begin by loading a suite of Python libraries, each serving a specific function that will be integral throughout our project.
​
​
​
​
​
​
The dataset, sourced from the Google Play Store and contained within a file named googleplaystore.csv, is loaded into a DataFrame using pandas. This dataset includes various features of apps that potentially influence their longevity and success, such as user ratings, category, updates, and more.
​
​
​
​
Data Cleaning
After setting up our tools and loading the data in the first step, we now move into the critical phase of data cleaning for our "Surviving the App-pocalypse" project. This step is essential to ensure that the data we analyze is accurate, complete, and formatted correctly, which will directly influence the reliability of our findings.
​
We begin by invoking df.info(), a powerful function in pandas that provides a concise summary of the DataFrame. This includes the total number of entries, the number of non-null entries per column, and the data type of each column (e.g., integer, float, object). This overview helps us quickly spot issues like missing values and incorrect data types that could skew our analysis.
​
​
​
Output:
​
​
​
​
​
​
​
​
​
​
​
​
​
-
Dropping the "App" Column
We begin by removing the first column, typically labeled "App," which might contain the names of the apps. While the app names are useful for identification, they do not provide quantitative value for survival analysis.
​
​
​
​
-
Analyzing the 'Category' Distribution
Before making any modifications, it's important to understand how apps are distributed across different categories. We use df.Category.value_counts() to get a count of apps in each category. This helps identify any anomalies or incorrect category data that might affect the analysis.
​
​
​
Output:
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
Upon reviewing the category data, we discover an entry labeled "1.9," which does not conform to the standard naming convention of app categories. This could be a data entry error or a placeholder that needs to be addressed to maintain data integrity. To clean this up, we remove rows where the category is "1.9."
​
​
​
​
-
Converting the "Reviews" Variable
We now address the "Reviews" variable. Originally stored as a string due to inconsistencies like non-numeric characters or errors in data entry, this variable is crucial for our analysis as it represents the number of reviews an app has received, an indicator of its popularity and user engagement.
​
To perform quantitative analysis and include the "Reviews" variable in our survival models, we need to ensure that it is in a numeric format. We use the pd.to_numeric() function from pandas, which attempts to convert values in the 'Reviews' column to a numeric data type.
​
​
-
​ Standardizing the "Size" Variable
Our next task is to tackle the "Size" variable. This step is crucial as the size of an app can influence its attractiveness to users and potentially affect its survival on the platform due to device compatibility and download preferences.
​
We define a function convert_size that converts the app size into a consistent numeric format (megabytes). The app sizes in the dataset are represented in different units and formats, including Megabytes (M), Kilobytes (k), and sometimes as 'Varies with device', which indicates a lack of fixed size.
​
​
​
​
​
​
​
​
​
​
​
​
-
Refining the "Installs" Variable
The next step involves addressing the "Installs" variable. This variable is critical as it quantifies the number of times an app has been downloaded, which is a key indicator of its popularity and potential survivability. First, we assess the distribution and formatting of the "Installs" data. This step helps us understand the range of values and any inconsistencies in data formatting, such as commas and plus signs that can interfere with numerical analysis.
​
​
​
Output:
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
The "Installs" data is stored as strings with commas and plus signs, making it unsuitable for numerical operations. We clean this data by removing these characters.
​
​
​
​
​
-
Standardizing the "Price" Variable
​We now turn our attention to refining the "Price" variable. Properly formatting this variable is crucial because the price of an app can significantly influence consumer behavior and, consequently, the app's survival on the Google Play Store.
​
We begin by examining how the "Price" data is presented using df.Price.value_counts(). This function helps us identify any formatting inconsistencies, such as the presence of dollar signs which could impede numerical analysis.
​
To make the "Price" data suitable for analysis, we need to remove any non-numeric characters and convert the data to a numerical type. We strip out dollar signs from the prices. This is essential as the dollar sign is non-numeric and would prevent conversion to a float type. Once cleaned of non-numeric characters, we convert the string representation of prices into floats. This conversion is necessary for any mathematical operations we might need to perform on the "Price" data.
​
​
​
​
​​
​
-
Refining the "Content Rating" Variable
The next critical step in our data cleaning process focuses on the "Content Rating" variable. This variable, which categorizes apps based on their suitability for different age groups, can significantly influence user engagement and app retention rates. Properly managing this data is key to understanding demographic impacts on app churn.
​
Initially, we examine the distribution of values within the "Content Rating" column. This overview helps us identify any categories that are inappropriate for our analysis or potentially mislabeled.
Output:
​
​
From the distribution, we identify categories that are not suitable for our analysis, such as "Unrated" and "Adults only 18+". These categories might be less relevant due to their sparse data or specific market segmentation, which could skew the broader analysis of general app survival.
​
​
​
​
​
​
-
Streamlining the Dataset by Dropping Variables
As we continue to refine our dataset for the "Surviving the App-pocalypse" project, an important part of the process involves simplifying the data by selectively removing variables that may not contribute significantly to our analysis of app survival. This step ensures that our focus remains on the most impactful factors, reducing complexity and enhancing the clarity of our predictive models.
After careful review of the dataset and understanding the role of each variable in app survival analysis, we decide to drop the "Genres", "Current Ver", and "Android Ver" columns.
Genres: While potentially informative, the genre of an app might overlap significantly with the 'Category' variable, which is already included in our analysis. Reducing redundancy can simplify the model without losing crucial information.
Current Ver: The current version of an app, while indicative of updates and improvements, often contains varied formatting and is not standardized across entries, making it difficult to use effectively in a predictive model.
Android Ver: This variable indicates the minimum Android version required to run the app. Similar to 'Current Ver', the diversity in version formatting and its indirect impact on app survival make it less valuable for our primary analysis focus.
​
​
​
​
With the completion of our meticulous data cleaning and preparation process, our dataset for the "Surviving the App-pocalypse" project now presents a streamlined and focused view of the critical factors that may influence app survival on the Google Play Store. Here’s a snapshot of our final dataset, optimized and ready for the analytical stages ahead.
​
​
​
​
​
​
​​​​
Defining the Dependent Variable: Identifying App Churn
Transitioning from data cleaning to deeper analysis project, we now focus on defining the dependent variable for our survival analysis. This variable will capture the churn status of apps based on their update frequency, an important indicator of an app’s active maintenance and engagement with its user base.
​
-
Converting 'Last Updated' to Datetime
First, we convert the "Last Updated" column from a string format to a datetime object. This transformation allows us to perform date-time calculations more accurately and efficiently, crucial for the subsequent steps.​
-
Identifying the Most Recent Update
We determine the most recent update across all apps by finding the maximum date in the "Last Updated" column. This date represents the most current point of reference for assessing whether an app has been actively updated.
-
Setting the Churn Threshold
To define app churn, we set a threshold date, calculated as six months prior to the most recent update date. This threshold helps us identify apps that have not been updated recently and are potentially less engaged with maintaining user interest or compatibility with new device standards.
-
Creating the 'Churn' Variable
We define the churn variable by comparing each app's last update date against the threshold date. If an app’s last update was before this threshold, it is marked as churned (1), otherwise not churned (0). This binary variable is created by comparing "Last Updated" to threshold_date and converting the result to an integer format.
​
​
​
​
​
​
​
​
​
​
​
-
​Calculating the Mean of the Churn Variable
To understand the overall churn rate within our dataset, we compute the mean of the 'churn' variable. This statistic tells us the proportion of all apps that have not been updated in the past six months, providing a high-level view of app maintenance activity on the Google Play Store.
​
​
​
Output:
​
​
​
A churn rate of 35.72% suggests that more than one-third of the apps have potentially been neglected or abandoned by their developers, indicating a lack of recent updates. This could point to several underlying issues such as reduced user interest, developer abandonment, or a shift in market dynamics that have made continued updates unfeasible or unnecessary.
​
-
Adding Time Duration Since Last Update
We introduce a new variable, 'days_since_last_update', which measures the number of days that have elapsed since each app was last updated. This is calculated by subtracting the "Last Updated" date from the most recent update date (max_date) across all apps and converting the result to days.
​
This variable provides a continuous measure of time, offering a more granular insight than the binary 'churn' variable. It helps us understand not just whether apps are being updated, but how long they have been inactive.
​
​
​
​
​
Output:
​
​
​
​
-
Removing the "Last Updated" Variable
We make a strategic decision to streamline our data by removing the "Last Updated" variable. This adjustment is aimed at simplifying the dataset and focusing more directly on variables that will be used in our predictive models and survival analysis.
​
​
​
​
The dataset is now fully prepared and ready for the analysis phase. This clean and structured dataset includes a range of variables that are essential for understanding the factors affecting app survival on the Google Play Store.​​​
Kaplan-Meier Survival Analysis: Comparing Free vs Paid Apps
In this phase of our project, we employ the Kaplan-Meier Estimator, a non-parametric statistic used to estimate the survival function from lifetime data. This approach is particularly valuable in our context to analyze the impact of an app being free versus paid on its survival, defined as the time since the last update before being considered as churned.
​
-
Setting Up Kaplan-Meier Fitter
We initiate a Kaplan-Meier Fitter object from the lifelines library. This tool will help us model the survival probability of apps over time, allowing us to visually and statistically compare the longevity of free and paid apps.​
-
Segmenting the Data
The dataset is divided into two groups based on the type of app: free and paid. This distinction is crucial as it allows us to explore how monetary strategies might influence app updates and user retention.
-
Fitting the Model to Free Apps
We fit the Kaplan-Meier Estimator to the data for free apps, using "days_since_last_update" as the time variable and "churn" as the event observed (churn being defined as not having updated in the past six months).
-
Fitting the Model to Paid Apps
Similarly, we fit the model to the data for paid apps. Each group is labeled appropriately in the plots to distinguish between the survival curves of free and paid apps.
-
​Plotting and Comparing Survival Curves
The survival functions for both groups are plotted on the same graph to facilitate direct comparison. This visualization shows the probability of an app not churning as a function of the number of days since its last update.
​
​
​
​
​
​
​
​
​
​
​
​
​
​
Output:
​
​
​​​​​
Insights from the Survival Curve - Free vs Paid Apps
​
​
​​
-
​ Initial Similarity
At the beginning of the observed period, both free and paid apps show similar survival probabilities. This suggests that initially, both types of apps are equally likely to receive updates, regardless of whether they are free or paid.
-
Rapid Decline in Survival Probability
Both curves demonstrate a sharp decline within the first 500 days. This steep drop indicates that a significant portion of apps, whether free or paid, tend not to receive updates after this period. The rapid decrease could be attributed to developers either abandoning the apps or shifting focus to newer projects.
-
Convergence and Divergence
Around the 500 to 1000 days mark, the curves start to show slight differences. The survival probability for paid apps appears to decline at a slower rate compared to free apps. This could imply that paid apps, possibly due to financial incentives or smaller but more dedicated user bases, receive updates for a longer period.
-
Long-term Stability
Post 1000 days, both curves begin to flatten out, indicating a stabilization. However, the survival probability for paid apps tends to remain consistently higher than for free apps as time progresses. This suggests that while the majority of apps stop receiving updates relatively early in their lifecycle, paid apps are more likely to receive sporadic updates or maintain support over a longer period.
-
​Tail End Behavior
Towards the end of the observed period (beyond 2500 days), the survival probability of free apps approaches close to zero, suggesting almost all free apps have stopped receiving updates. In contrast, a small proportion of paid apps still show a slightly higher survival probability, reinforcing the notion that paid apps may receive longer-term support.
​​​​​
Setting Up for Cox Proportional Hazards Model​
After analyzing the survival probabilities of apps using the Kaplan-Meier estimator, the next step in our project is to build and assess a Cox Proportional Hazards Model. This model will allow us to explore the relative impacts of various factors on the survival times of apps, giving us deeper insights into what influences app churn.
​
-
​ Assessing Data Completeness
Before building the Cox model, it's crucial to ensure that our data is complete with no missing values. Missing data can significantly distort the results of a survival analysis. We start by using df_final.info() to get a summary of the dataset, which shows the count of non-null entries in each column and helps identify columns with missing values.
​
​
​​
Output:
​
​
​
​​
-
​ Removing NaN Values
To maintain the integrity of the Cox model, we need to remove any rows containing NaN values. This is because the Cox model requires complete cases to properly compute the risks associated with different covariates.
​
​
​
​
-
Converting Categorical Variables into Dummy Variables
To incorporate categorical variables in the Cox model, we first need to transform these into dummy (or indicator) variables. This process involves creating binary columns for each category except for the base category, which is excluded to avoid multicollinearity.
-
​Splitting the Data into Training and Testing Sets
The dataset is divided into an 80% training set and a 20% testing set, using stratified sampling to ensure that each set is representative of the overall dataset. This split allows for robust model training while reserving a portion of the data for model validation.
​​
-
Fitting the Cox Proportional Hazards Model
The model is then fitted to the training dataset, with 'days_since_last_update' serving as the duration column and 'churn' as the event column. This setup directs the model to analyze how the predictors influence the likelihood of an app becoming inactive over time.
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
-
Plotting Coefficients from the Cox Proportional Hazards Model
After fitting the Cox Proportional Hazards Model, an insightful next step is to visually analyze the impact of each predictor on app churn. Plotting the coefficients of the model can provide a clear and immediate understanding of which factors are most influential in determining the survival of apps on the Google Play Store.
​
​
​
​
Output:
​
​
​​​​​
Key Insights from the Coefficient Plot​
Positive and Negative Influences:
-
​ Positive Coefficients
Variables with positive coefficients increase the hazard of churn, meaning these factors are associated with a higher likelihood of an app ceasing updates sooner.
​
-
Negative Coefficients
Conversely, variables with negative coefficients decrease the hazard of churn, indicating these elements potentially extend the update lifecycle of apps. Categories like "Category_GAME" and "Category_PERSONALIZATION" exhibit negative coefficients, implying a lower risk of churn, possibly due to higher engagement or user demand.
​
-
Significance of Variables
The width of the confidence intervals (CIs) for each variable's coefficient indicates the precision of the estimate. Narrower CIs suggest a higher level of certainty about the effect size. For instance, basic attributes such as "Rating" and "Size" have narrow CIs, indicating a robust estimation of their impact.
​​
Variables with CIs that cross the zero line (indicated by the dashed vertical line) are not statistically significant at typical confidence levels, meaning we cannot conclusively say they affect the churn hazard. Variables such as "Category_VIDEO_PLAYERS" and "Category_WEATHER" straddle the line, highlighting uncertainty in their effects.
​
-
Impact of Content Rating and App Type
Different content ratings and the type of app (Free or Paid) also show varied impacts. Notably, "Type_Paid" has a negative coefficient, suggesting paid apps may have a lower risk of churn, aligning with the notion that financial incentives might encourage more consistent updates.
​
-
Category-Specific Trends
The impact of app categories on churn varies widely, with some categories like "Category_BUSINESS" and "Category_EDUCATION" showing negative impacts, suggesting these apps might be updated more frequently or for longer durations.​​​​​​​​​​
Assessing the Cox Proportional Hazards Model on the Test Set​
After fitting the Cox Proportional Hazards Model to the training data, evaluating its performance on the test set is crucial to ensure that the model is effective and generalizes well to new data. This step involves isolating the test data, predicting the hazard, and then assessing the model's accuracy using the Concordance Index.
​
-
​ Isolate and Prepare the Test Data
Extract 'days_since_last_update' and 'churn' from the test set to use as the outcome variables. The rest of the data, which consists of predictors used in the Cox model, is isolated into test_X.
​
​
-
Calculate the Predicted Hazard
The predict_partial_hazard method from the CoxPHFitter object is used to compute the hazard for each individual in the test set based on their covariates. This metric reflects the risk of the event occurring at each time point, given the covariates.
​
​
​
​
​
-
Concordance Index Evaluation
This statistic is used to evaluate the predictive accuracy of the model. It measures the model's ability to correctly provide a relative ranking of the risk of events (churn) based on the covariates. A Concordance Index (C-index) of 0.5 suggests no better than random predictions, while 1.0 indicates perfect predictive accuracy. The index is calculated using the actual times, the predicted hazards (with sign inversion for correct directional interpretation), and the event indicators from the test set.
​
​
​
​
​
Output:
​
​
​
A C-index of 0.6 indicates that the model performs better than random chance, which would score a 0.5. This suggests that the model has learned to identify some patterns or relationships in the data that are predictive of the outcome (churn).
While the C-index shows that the model has predictive validity, it also highlights room for improvement. A score close to 0.6, while decent, is not very high, indicating that the model's predictions are correct slightly more than half the time but not exceptionally so.
​
-
Implications for Model Improvement
Feature Review and Engineering: Consider reviewing the features used in the model or engineering new features that might capture more nuances of the data. Sometimes, incorporating interaction terms or polynomial features can uncover relationships not captured by linear terms alone.
​
Model Complexity: If the model is too simple, it might not capture all the complexities of the data. Experimenting with different sets of variables or different types of survival models (like those including random effects or time-dependent covariates) might enhance predictive accuracy.
​
Data Quality and Size: Ensuring high-quality, comprehensive data can significantly affect model performance. More data, or more representative data, might improve the model's ability to generalize. Also, checking for data imbalances or biases that could affect model training and testing is crucial.
​
Alternative Modeling Techniques: Considering other modeling techniques or advanced machine learning algorithms designed for censored data might provide better predictions. Techniques like random forests for survival analysis or deep learning approaches could be explored.​​​​​​​​​​​