DALL·E 2024-04-23 20.13.05 - A dramatic and historical representation of the Titanic sinki

Titanic Survivor Prediction

Dive into the Titanic dataset to predict survival outcomes with a logistic regression model. This project explores how age, sex, and fare influenced the chances of surviving the historic sinking. By analyzing these factors, we aim to uncover deeper insights into the social dynamics and personal stories of those aboard the Titanic.

Libraries and Data

In the first step of our Titanic Survivor Prediction project, we lay the groundwork by loading the essential tools and data. Here’s how we start:

Gathering Our Tools:

We begin by importing a suite of Python libraries that are crucial for data manipulation and analysis. numpy and pandas help us handle and explore the dataset, while statsmodels and sklearn bring in the statistical firepower for building and evaluating our logistic regression model.

Loading the Dataset:

With our tools ready, we load the Titanic dataset using pandas, a library that makes data manipulation intuitive and straightforward. The dataset, stored in a CSV file named titanic.csv, contains detailed records of the Titanic passengers which will serve as the foundation of our analysis.

Understanding the Dataframe:

We dive deeper into the dataset's details by invoking df.info(). This method gives us a concise summary of the dataframe, including the number of entries, the type of data in each column, and a preliminary view of how much missing data we might need to address.

Outliers removal and Exploratory data analysis

In this pivotal step, we focus on refining our dataset to ensure accuracy and enhance the clarity of our predictions. Here’s how we proceed:

Understanding the Basics:

We start with a basic statistical summary. This provides a quick overview of the numerical columns—revealing tendencies, variability, and potential irregularities in our data that might require closer examination.

Output:

Survival Rates:

The Survived column indicates that approximately 38.56% of the passengers in our dataset survived the disaster. This survival rate sets the stage for our analysis, highlighting the grim reality faced by the majority of passengers.

Passenger Class Distribution:

The mean value of the Pclass column is approximately 2.31, suggesting that a significant number of passengers were in second and third class. The diversity in passenger class may influence survival rates, as historical accounts suggest that first-class passengers had better access to lifeboats.

Age Distribution:

The average age of passengers was about 29.47 years, with a standard deviation of 14.12 years, indicating a wide age range from infants to the elderly. The age range (from 0.42 to 80 years) could be pivotal in analyzing survival patterns, as age might have played a role in survival chances

Family Aboard:

On average, passengers had about 0.53 siblings or spouses and 0.38 parents or children aboard. The presence of family might have influenced decisions during evacuation, affecting survival outcomes.

Fare Details:

The average fare paid was approximately $32.31, but with a high standard deviation of $49.78, indicating significant variation in what passengers paid to board the ship. The fare range (from $0 to $512.33) suggests economic diversity among the passengers, which might correlate with passenger class and potentially survival.

Outliers and Extremes:

The presence of high maximum values in fare and family numbers (like paying $512.33 for a ticket or having 8 siblings/spouses aboard) points to outliers that may need further investigation to understand their impact on the analysis.

Streamlining and Transforming the Data:

Certain columns like "Name" may not be directly useful for our predictive model, so we streamline the dataset by removing these to focus solely on impactful variables. We convert the "Sex" column from categorical (male/female) to binary (0/1), simplifying the model’s input without losing information. The "Pclass" (Passenger Class) column is transformed into dummy variables to retain its categorical significance without imposing ordinality. This involves creating binary columns for each class, enhancing our model's ability to distinguish between the classes’ impacts on survival.

Visual Exploration with Correlations:

We visualize the relationships between continuous variables with a correlation heatmap. This visualization helps us see how variables relate to each other, potentially highlighting dependencies or conflicts that could inform further adjustments to our model.

Output:

Outlier Detection and Removal:

Outliers can skew results, leading to less reliable predictions. We implement a function to remove outliers based on the standard deviation of each numeric variable, setting a threshold at three standard deviations away from the mean. This method helps in maintaining the integrity of our data, ensuring that our model learns from patterns that are representative of the general population of the dataset.

Building and Evaluating the Logistic Regression Model

As we continue our journey through the Titanic dataset, our next port of call is constructing and assessing a logistic regression model. This model will help us predict survival outcomes based on various passenger characteristics. Here’s how we navigate through this crucial phase:

Preparing the Data:

We start by defining our target variable y, which represents whether a passenger survived (1) or not (0). The feature set X includes all other relevant passenger data, excluding the Survived column. To ensure our model has an intercept, we add a constant to the feature set. This helps to adjust the decision boundary of the logistic regression.

Splitting the Data:

We divide our dataset into training and testing sets, using 80% of the data for training and the remaining 20% for testing. This split allows us to train our model on a large subset while reserving a separate portion for unbiased evaluation.

Fitting the Model and Model Summary:

With our data prepared, we create and fit a logistic regression model using the training set. This process involves finding the best coefficients for our features that will minimize prediction errors. After fitting the model, we review a detailed summary that includes coefficients, significance levels, and other diagnostic metrics. This summary helps us understand the strength and influence of each predictor within the model.

Output:

Making Predictions:

Moving forward, we use our model to predict the probabilities of survival for passengers in the test set. These probabilities give us a nuanced view of survival chances rather than a simple binary outcome. For practical application, we convert these probabilities into binary predictions. We classify a passenger as likely to have survived (1) if their predicted probability exceeds 0.5, and as not survived (0) otherwise.

Interpreting Our Logistic Regression Model

Now that our logistic regression model is built and operational, it's crucial to understand what the model's coefficients are telling us about the factors affecting survival on the Titanic. This phase of our analysis is dedicated to decoding the model's output to extract meaningful insights that can inform both historical understanding and predictive accuracy.

Understanding the Model Coefficients:

We employ a custom function designed to elucidate the coefficients of our logistic regression model. This function not only calculates the impact of each variable but also expresses it in terms of the change in odds of survival.

Deep Dive into Variables:

Each coefficient in the model is explored to understand its influence on the likelihood of survival. For binary variables (such as 'Sex' or 'Embarked'), the function explains how a change from 0 to 1 (e.g., from male to female) alters the survival odds. For continuous variables (like 'Age' or 'Fare'), it reveals how incremental changes affect survival probabilities.

Percentage Increase in Odds:

The function converts the logistic regression coefficients into a more intuitive metric: the percentage increase in the odds of survival. This transformation helps in understanding the practical implications of each variable more clearly.

Statistical Significance:

It’s not just about the size of the coefficients but also their reliability. The function checks the statistical significance of each coefficient to determine whether the observed effects are likely to reflect true patterns, or if they might just be due to random variation.

Evaluating the Predictive Performance of Our Model

In this critical phase of our Titanic Survivor Prediction project, we turn our attention to evaluating how well our logistic regression model performs in classifying whether passengers survived. To do this, we use a comprehensive suite of metrics that offer insights into the accuracy and reliability of our predictions.

Understanding the Evaluation Metrics:

Accuracy:

This is our starting point, providing a straightforward measure of overall correctness. It tells us what proportion of total predictions made by our model was accurate—both true positives and true negatives.

F1-Score:

This metric helps us understand the balance between precision (quality of positive predictions) and recall (ability to find all positive instances). It is particularly useful when the costs of false positives and false negatives are high.

Sensitivity (Recall):

This focuses exclusively on the model’s ability to correctly identify actual survivors. High sensitivity is crucial for ensuring that most survivors are correctly predicted by the model.

Specificity:

Complementing sensitivity, specificity measures how well the model can identify those who did not survive. This ensures that the non-survivors are accurately classified, preventing false alarms.

Decoding the Logistic Regression Coefficients

Now that our logistic regression model is trained and evaluated, it's time to delve deeper into what the coefficients tell us about the factors influencing survival on the Titanic.

Output:

Impact of Sex:

Significant Boost in Survival Odds for Females: The change from 'male' to 'female' (0 to 1 in our model) results in an astounding 2170.25% increase in the odds of survival. This tremendous increase is statistically significant, underscoring the historical accounts that women had a higher priority for lifeboat access during the evacuation.

Effect of Age:

Slight Decrease with Age: As age increases by one year, the odds of surviving decrease slightly by 5.34%. This finding is statistically significant and could reflect the physical limitations or social norms that made survival less likely for older passengers.

Family Aboard – Siblings/Spouses and Parents/Children:

Negative Impact of More Siblings/Spouses: An additional sibling or spouse aboard leads to a 42.10% decrease in survival odds, significantly affecting the chances. This could indicate difficulties in organizing larger family groups during the evacuation.
Marginal Effect of Parents/Children: Each additional parent or child decreases survival odds by 5.16%, although this result isn't statistically significant, suggesting that the effect might be less uniform or influenced by other factors.

Influence of Fare:

Minimal Impact on Survival: A higher fare contributes a marginal increase of 0.14% in survival odds, but this effect is not statistically significant, indicating that within the confines of our model, fare alone doesn't strongly determine survival.

Role of Passenger Class:

Second Class Disadvantage: Moving from first class to second class reduces survival odds by 73.68%, a significant drop that highlights the disparities in survival based on passenger class.

Third Class Severe Disadvantage: Even more drastic is the shift from first to third class, which sees a reduction in survival odds by 92.74%, confirming the severe disadvantage faced by third-class passengers.