DALL·E 2024-04-19 20.22.24 - A realistic digital art image for a project titled 'Multiline

Retail Store Sales' Driver

Let's face it, the retail world is tough and getting the edge in sales can feel like searching for a needle in a haystack. But what if you could know exactly what affects your store's bottom line? That's where our project steps in. With a curious mind and a dash of data science, we're peeling back the layers of retail complexity to discover what really makes the cash registers ring. Using multilinear regression, we're not just crunching numbers; we're uncovering the story behind each sale.

Libraries and Data - Choosing Your Variables

As a business analyst, it's like being a chef—you need the right ingredients to make a great dish. Similarly, understanding what drives retail sales starts with picking the right variables. This isn't just about data; it's about knowing your business like the back of your hand and choosing factors that truly matter.

Set Up the Environment:

Just like a chef sharpens their knives, we begin by setting up our analytical tools. We load essential libraries that are our best friends in the journey of data analysis—think of them as our pots and pans in data cooking.

Loading the data:

Our analysis starts by loading up our data.

Picking the right variables:

Like selecting the best spices for a meal, choosing the right variables is crucial. We're not just throwing everything into the pot; we're carefully selecting data points that can tell us the most about our sales dynamics. Here's how we refine our dataset to include only the most impactful variables;

Output:

Understanding Our Data

Before we dive deeper, let's take a moment to get to know our data a little better.

Output:

We have a mix of integers and floats—this tells us about the nature of data we are analyzing, from solid counts like total sales to more fluid measurements like margin percentages.

analyze your data

Now that we’ve set the stage with our selected variables, it’s time to really get to know them. Think of this step as having a quick chat with each variable to understand its personality and quirks.

Delving Deeper Into the Data:

Here’s how we initiate this insightful conversation.

Output:

By analyzing these statistics, we can start to see patterns or anomalies. Does the data spread widely or is it tightly grouped? Are there any variables that seem unusually high or low? These insights can lead us to deeper questions and hypotheses, setting the stage for more complex analysis.

Overview of the variables:

tsales (Total Sales): From a modest $50,000 to a whopping $5,000,000, with an average around $833,584, the range here is vast. This tells us our stores vary greatly in performance, a key point of interest for pinpointing what drives sales.

margin (Profit Margin): The margins span from 16% to 66%, averaging at about 38.77%. This variability suggests different stores manage costs and pricing strategies quite uniquely.

nown (Ownership Factor): With most stores having a value close to 1, and a maximum of 10, this could indicate different levels of ownership or franchise status influencing operations.

inv1 and inv2 (Inventory Levels): These figures highlight the stock volume stores handle, varying significantly from store to store. The large standard deviations suggest some stores might be inventory-heavy, potentially affecting liquidity and sales efficiency.

ssize (Store Size): Ranging from 16 to 1214 square meters, with most around 151 square meters, indicates a mix of boutique and large-format stores, impacting customer experience and sales volume.

start (Business Start Year or Period): Most stores seem to have started around the year 40 (assuming a coding format), with some outliers starting as early as year 16 and as late as 90, hinting at varying levels of maturity and market experience.

Deciphering Relationships with the Correlation matrix

After getting familiar with our individual variables, it's time to see how they interact with each other. Think of this step as drawing a map of the relationships within our dataset. We're not just looking for friendships (positive correlations) but also rivalries (negative correlations), and everything in between.

Navigating the Web of Relationships:

Output:

Decoding the Correlation Matrix:

The Loudest Relationships:

Total Sales and Store Size (0.53): A moderate, positive relationship. It seems like the more ample the store, the higher the sales. This could be a case of 'more room, more revenue', suggesting that customers enjoy a wider selection or a more pleasant shopping experience.
Profit Margin and Start (0.48): Also moderately positive. Perhaps stores that have been around longer have fine-tuned their operations, maximizing margins as they grow more seasoned in the retail game.

Subtle Nuances:

Ownership and Inventory 2 (0.12): There’s a hint of a relationship here. The way a store is owned or managed might just have a slight sway over certain types of inventory.
Profit Margin and Inventory 2 (0.20): Another whisper of a relationship. It’s not loud, but it's there, suggesting that how much inventory is carried might play a role in profitability.

Quiet Corners:

Inventory 1 and Start (-0.012): There's barely a murmur between these variables. The age or start period of a store doesn't seem to significantly affect this type of inventory level.

correlation matrix (multilinear regression).png

Setting the Stage for Success – Training and Test Sets

Imagine preparing for a play. Before the grand premiere, you'd run dress rehearsals to make sure everything goes off without a hitch. That's exactly what we're doing here, but our stage is the world of data, and our actors are the predictive models we're eager to debut.

Isolating Variables for Prediction:

In this step, we're selecting our champion, y (our total sales), and its team, X (the influencing factors), gearing them up with the essential stats. Adding a constant to X is like giving our runners a consistent starting block to launch from.

Training and Validation Splits:

We divide our path: 80% for training, where our model builds endurance and learns the intricacies of the sales landscape, and 20% for testing, where we see if our model can sprint as well in a new environment as it does in familiar terrain.

Building the Multilinear Regression Model

Imagine you're the coach on the sidelines, and the players are about to take the field. It's the moment we've been preparing for: translating all our drills, strategy sessions, and practice games into real action. This is where we build our multilinear regression model, the playbook we'll use to predict our retail sales.

Constructing the Regression Model:

By executing this code, we're effectively setting our players—the variables—into motion. The sm.OLS() function is like the coach calling the plays, telling each variable how to move and interact to reach the end goal: reliable sales predictions.

The fit() method is the practice session turning into performance. It's where the model learns from the training data, making all the intricate adjustments needed to understand the dynamics between store size, inventory, profit margins, and more.

When we print the model's summary, it’s like looking at the post-game statistics. We see which plays worked—variables that significantly impact sales. We also identify strategies that might need tweaking—variables that don't have a clear link to sales outcomes. This summary provides the coefficients, p-values, R-squared, and other statistical measures that tell us how well our regression model is expected to perform.

Output:

Multilinear Regression Results Explained

R-squared Insights:

Sitting at 0.779, this tells us that approximately 77.9% of the variation in total sales across our stores can be explained by the model we've built. That's like saying nearly 8 out of 10 times, we can confidently predict sales based on our current lineup of variables.

Variable Impact:

Margin: For every unit increase in margin, we expect to see an increase of approximately $6,365 in sales, although this effect is right on the edge of statistical significance with a p-value just above 0.05.

Ownership (nown): Surprisingly, ownership doesn't seem to play a starring role in this model, given its p-value is far from the conventional threshold of significance.

Inventory 1 (inv1): Similarly, inv1's coefficient is not telling a strong story here with a high p-value, suggesting we might need to reassess its role in our strategy.

Inventory 2 (inv2): This variable is a bit more promising than inv1, suggesting a small positive effect on sales, but it's still not a showstopper in terms of statistical significance

Store Size (ssize): Here's our MVP! With a very significant p-value, store size is a key player. It's making a strong and statistically significant contribution to sales, with every square meter increase potentially upping sales by $2,816.

Start: This might indicate the older the store, the higher the sales, but with a p-value of 0.176, it's not a statistically significant predictor in this lineup.

Assessing Our Model's Accuracy

It's time to shine a spotlight on our model and see how well it predicts sales when faced with new data. This is like the taste test after meticulously following a new recipe—you want to know if your efforts have paid off in flavor, or in our case, accuracy.

With these lines of code, we coax our model to take the stage, making predictions about total sales based on the test set—data it hasn't seen before. It's the equivalent of a chef trying out their new dish on discerning diners for the first time.

The Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE) are crucial metrics for evaluating the accuracy of your regression model. Here's what they tell us about the model's performance based on the numbers you provided:

Mean Absolute Error (MAE) - 351,329:

This value represents the average absolute difference between the predicted sales and the actual sales across all observations in the test set. In simpler terms, on average, your model's predictions are about $351,329 away from the actual sales figures. This is a measure of the typical error magnitude without considering the direction (over or underestimation).

Root Mean Squared Error (RMSE) - 532,643:

The RMSE is similar to the MAE but gives a higher weight to larger errors. This is because the errors are squared before averaging, which disproportionately increases the impact of bigger mistakes. An RMSE of $532,643 means that, on average, the squares of the prediction errors are equivalent to this value when squared. RMSE is often more sensitive to outliers than MAE and is considered a good measure of accuracy for models where large errors are particularly undesirable.

Both the MAE and RMSE are quite substantial, suggesting that the model may be experiencing some difficulties in accurately predicting the sales figures. The high values could be due to various factors, such as outliers in the data, underfitting (model is too simple to capture underlying patterns), or the omission of important predictors that influence sales significantly.

The fact that RMSE is higher than MAE suggests the presence of some larger errors in your predictions. This could indicate that while the model performs reasonably well most of the time, there are a few cases where it gets the predictions wrong by a large margin.

Advancing our understanding

While the initial results have provided valuable revelations, the path to perfecting our predictive prowess is paved with further exploration and refinement. To enhance the accuracy and reliability of our model, several strategic steps are proposed. These steps aim to deepen our analysis, mitigate potential pitfalls, and expand the scope of our understanding, ensuring that our model not only captures the complexities of retail dynamics but also adapts to the evolving marketplace. Here’s how we plan to move forward:

Review Model Complexity:

Ensure that the model is neither too simple (underfitting) nor too complex (overfitting). You might need to explore adding more features, creating polynomial features, or adjusting other aspects of the model.

Check for Outliers:

Large errors leading to high RMSE could be due to outliers in your data. Investigating and potentially mitigating the influence of outliers might improve model performance.

Feature Engineering:

Consider whether there are additional variables not included in the model that could improve prediction accuracy. For example, external factors like economic indicators or competitive activity might be relevant.

Cross-Validation:

Use cross-validation to assess the model’s performance across different subsets of your data. This can provide a more robust estimate of your model's accuracy and stability.