I. Introduction and Problem Description

Project Overview

In this project, we present a robust end-to-end framework for forecasting daily revenue across multiple locations of an anonymous supermarket chain. Leveraging historical transaction data, the framework employs advanced time series modeling techniques to produce accurate short-term forecasts for the upcoming seven days. The approach incorporates both temporal patterns and relevant external factors to enhance predictive accuracy.

Key Features

Utilizes multiple forecasting models:
- Prophet
- ARIMA
- XGBoost
- Random Forest
Captures complex temporal patterns, including:
Daily, weekly, monthly, and yearly seasonality
Incorporates external covariates such as:
- Promotional activity
- Temperature data
Implements a walk-forward validation scheme for robust model evaluation
Generates forward-looking revenue predictions for each location/store location

Business Objective

Produce reliable 7-day revenue forecasts to support operational decision-making and resource planning.

Practical Applications

Inventory and stock optimization
Workforce scheduling and planning
Promotion and campaign alignment

II. Data Preparation

## # A tibble: 6 x 5
##   location date                revenue promo  temp
##      <dbl> <dttm>                <dbl> <dbl> <dbl>
## 1        1 2014-07-01 00:00:00  11694.     1  21.4
## 2        1 2014-07-02 00:00:00  11468.     1  20.5
## 3        1 2014-07-03 00:00:00  12833.     0  30.3
## 4        1 2014-07-04 00:00:00  15693.     1  28.9
## 5        1 2014-07-05 00:00:00  18001.     0  25.9
## 6        1 2014-07-07 00:00:00  10411.     0   9.6

II.1. Variable description

The data set has 5 variables:

location is a number refering to the location of the supermarkt
revenue indicates the total revenue in Euro
day refers to the day where the revenu has been obtained
promo is a binary variable that indicates whether it refers to a particular promotional day or not.
temp is a numeric vriable refering to teh daily temperature. We know that temperatute conditions how people buy things

II.2. Add day, weeks, months, years for better data exploration

## # A tibble: 25,963 x 5
##    location date       revenue promo  temp
##       <dbl> <date>       <dbl> <dbl> <dbl>
##  1        1 2014-07-01  11694.     1  21.4
##  2        1 2014-07-02  11468.     1  20.5
##  3        1 2014-07-03  12833.     0  30.3
##  4        1 2014-07-04  15693.     1  28.9
##  5        1 2014-07-05  18001.     0  25.9
##  6        1 2014-07-07  10411.     0   9.6
##  7        1 2014-07-08  10498.     0  32.4
##  8        1 2014-07-09   9827.     0  16.6
##  9        1 2014-07-10  12275.     0  23.9
## 10        1 2014-07-11  15567.     0  23.3
## # i 25,953 more rows

II.3. Variable summary

##    1    2    3    4    5    6    7    8   10 
## 3030 3012 3005 3030 3030 1784 3012 3030 3030

The supermakt chain has overall 9 locations given by the number above.

location	start_date	end_date	total_days	mean_revenue	median_revenue	sd_revenue
1	2014-07-01	2024-06-29	3030	14329.99	13230.132	4207.914
2	2014-07-01	2024-06-29	3012	14318.03	13014.349	4321.590
3	2014-07-01	2024-06-29	3005	9249.62	8449.249	2391.801
4	2014-07-01	2024-06-29	3030	10621.86	9983.729	3138.429
5	2014-07-01	2024-06-29	3030	27337.17	24276.131	10896.631
6	2018-08-16	2024-06-29	1784	11501.36	10726.464	3631.567
7	2014-07-01	2024-06-29	3012	12250.90	11001.964	3870.386
8	2014-07-01	2024-06-29	3030	41104.16	38868.945	14116.451
10	2014-07-01	2024-06-29	3030	10777.80	10084.744	3119.286

III. Exploratory Data Analysis

III.1. Revenue by Day of Week

Define the correct order of weekdays

Ensure day_of_week is an ordered factor

Across all supermarket locations, revenue consistently peaks on Fridays and Saturdays.
This pattern aligns with typical consumer behavior, as these days generally provide more
opportunities for shopping.
The observation underscores the importance of explicitly
incorporating day-of-week seasonality into revenue forecasting models.

III.2. Revenue by Store/ location

Stores 5 and 8 exhibit higher revenue levels compared to other locations.

III.3. Revenue by month

December seams to be the period with more revenue

III.4. Revenue by Year

A notable increase in revenue was observed during 2020 and 2021.
This period coincides with the onset of the COVID-19 pandemic.
While the growth may appear counterintuitive, it aligns with major shifts in consumer behavior.

Key contributing factors include:

Widespread stockpiling of essential goods.
Increased at-home consumption due to lockdowns and restrictions.

These changes significantly boosted supermarket revenues during the pandemic.

III.5. Median Revenue by Day of the Week

This analysis considers the median daily revenue per location, providing a robust measure of central tendency that reduces the influence of outliers.

Consistently higher revenues are observed on Fridays and Saturdays across all supermarket locations.
This recurring pattern aligns with increased consumer availability and shopping behavior at the end of the workweek.
The trend is further reinforced by Germany’s widespread Sunday retail closures,
which concentrate consumer activity into the preceding days, particularly the weekend.

III.6. Revenue Trends Over Time

We examine the overall trend by displaying the raw daily revenue values, offering an unfiltered view of temporal fluctuations.

All stores show a consistent upward trend in daily revenue over the years.
Locations 8 and 5 stand out with notably higher variability in daily revenue.
This variability may reflect differences in customer behavior, promotional activity, or store-specific factors.

III.7. Revenue vs. External Regressors: Relationship Analysis

We would like to examine how key external factors influence daily revenue across store locations.

Temperature vs Revenue

No clear linear pattern is observed between revenue and temperature. However, this does not necessarily imply the absence of a relationship, as the association may be nonlinear or influenced by additional factors.

Revenue by Promotion Days

Revenue tends to be slightly higher on promotion days compared to regular days.

IV. Early-Stage Reflections on Model Selection: Key Considerations

The final modeling phase will incorporate Random Forest and XGBoost regression models, which are particularly effective in capturing the influence of covariates. However, in this initial stage, we focus on traditional time series models—such as ARIMA, AR, and Prophet that are promising for short-term forecasting and are not originally designed to account for covariate effects.

Objective:

Proceed with time series modeling using traditional approaches without covariates, with a focus on short-term forecasting.

Candidate models for the first step modeling:

ARIMA (AutoRegressive Integrated Moving Average)
AR (AutoRegressive)
MA (Moving Average)
Prophet (by Facebook, suitable for trend and seasonality)

Guidance by Exploratory Data Analysis (EDA):

Evidence of moderate seasonality, e.g.:
- Increased sales on Fridays and Saturdays
- Sales peaks in December
Presence of a trend or changing mean, indicating possible non-stationarity

Implication for model selection:

The non-stationary behavior can suggests ARIMA as a promising candidate
However, model suitability should be confirmed with statistical stationarity tests

Planned statistical tests

Augmented Dickey-Fuller (ADF) test: to detect unit roots
KPSS test: to assess trend stationarity

**Exclusion of external regressors/covariates:

At this stage, modeling will be performed without exogenous variables
Future iterations may incorporate covariates as needed

Next step:

Important statistical Assumption: Locations are assumed to be independent of one another, and a separate model is fitted for each location.
Select Arima and Prophet
Prophet is particularly powerful to model seasonality
Perform stationarity diagnostics to validate the appropriateness of ARIMA and similar models
Compare prophet and Arima

Load & Prepare Data

Stationarity Test (ADF)

## ADF p-value: 0.01

Prophet: Decompose Seasonality

IV. Fit Prophet & ARIMA – Forecast vs. Actual**

ARIMA: model, fitted and forecast

## Series: ts_arima 
## ARIMA(4,1,3)(0,0,2)[7] 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ma1     ma2      ma3     sma1     sma2
##       0.9027  -1.2067  0.2426  -0.2664  -1.3044  1.1335  -0.4736  -0.1953  -0.0466
## s.e.  0.0595   0.0458  0.0434   0.0322   0.0597  0.0618   0.0383   0.0211   0.0213
## 
## sigma^2 = 8505962:  log likelihood = -28460.63
## AIC=56941.26   AICc=56941.34   BIC=57001.42
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE        ACF1
## Training set 3.476719 2911.682 2038.095 -3.145853 14.59483 0.5449085 0.001403995

Appropriate differencing (d = 1) was applied — series is non-stationary but corrected.
Residuals are uncorrelated (ACF1 ≈ 0), which is critical for a good ARIMA fit.
Reasonable accuracy (MAPE ≈ 15%) — acceptable in most forecasting applications.
Model complexity (5 AR + 5 MA terms) could be a sign of overfitting.
Consider simplifying or regularizing if overfitting is a concern (e.g., using stepwise = FALSE and approximation = FALSE in auto.arima())

Plot: Prophet vs ARIMA vs Actual

ARIMA vs Actual plot

Prophet vs Actual

Both Prophet and ARIMA models exhibit visually strong predictive performance
The data exhibits clear and significant yearly, weekly, and monthly trends
These characteristics make the Prophet model a more suitable choice compared to the ARIMA model
For future analyses, we will continue with the Prophet model and incorporate external regressors (covariates).
Additionally, we will explore two widely used and effective machine learning regression models: Random Forest and XGBoost, particularly for their strength in handling external regressors.
Random Forest and XGBoost model will account for the fact that we have time series data.

V. Final Model Comparison and Forcasting

V.1. Feature Engineering: Extract time-based features and handle gaps

## # A tibble: 6 x 10
##   location date       revenue promo  temp  year month  week day_of_week day_of_year
##      <dbl> <date>       <dbl> <dbl> <dbl> <int> <ord> <int> <ord>             <int>
## 1        1 2014-07-01  11694.     1  21.4  2014 July     27 Tuesday             182
## 2        1 2014-07-02  11468.     1  20.5  2014 July     27 Wednesday           183
## 3        1 2014-07-03  12833.     0  30.3  2014 July     27 Thursday            184
## 4        1 2014-07-04  15693.     1  28.9  2014 July     27 Friday              185
## 5        1 2014-07-05  18001.     0  25.9  2014 July     27 Saturday            186
## 6        1 2014-07-06  14206.     0  25.9  2014 July     27 Sunday              187

## -- Data Summary ------------------------
##                            Values
## Name                       df    
## Number of rows             31361 
## Number of columns          10    
## _______________________          
## Column type frequency:           
##   Date                     1     
##   factor                   2     
##   numeric                  7     
## ________________________         
## Group variables            None  
## 
## -- Variable type: Date --------------------------------------------------------------------------------------
##   skim_variable n_missing complete_rate min        max        median     n_unique
## 1 date                  0             1 2014-07-01 2024-06-29 2019-09-22     3652
## 
## -- Variable type: factor ------------------------------------------------------------------------------------
##   skim_variable n_missing complete_rate ordered n_unique top_counts                                
## 1 month                 0             1 TRUE          12 Jan: 2666, Mar: 2666, May: 2666, Oct: 2666
## 2 day_of_week           0             1 TRUE           7 Thu: 4483, Fri: 4483, Sat: 4483, Tue: 4482
## 
## -- Variable type: numeric -----------------------------------------------------------------------------------
##   skim_variable n_missing complete_rate      mean        sd     p0    p25     p50    p75     p100
## 1 location              0             1     5.07      2.83     1      3       5       7      10  
## 2 revenue               0             1 17516.    12288.    1728.  9960.  12948.  19236. 109512. 
## 3 promo                 0             1     0.200     0.400    0      0       0       0       1  
## 4 temp                  0             1    19.3       5.50    -0.6   15.6    19.2    23      39.8
## 5 year                  0             1  2019.        2.90  2014   2017    2019    2022    2024  
## 6 week                  0             1    26.6      15.1      1     14      27      40      53  
## 7 day_of_year           0             1   183.      105.       1     92     183     275     366  
##   hist                            
## 1 "\u2587\u2587\u2586\u2587\u2583"
## 2 "\u2587\u2582\u2581\u2581\u2581"
## 3 "\u2587\u2581\u2581\u2581\u2582"
## 4 "\u2581\u2583\u2587\u2583\u2581"
## 5 "\u2587\u2586\u2587\u2587\u2586"
## 6 "\u2587\u2587\u2587\u2587\u2587"
## 7 "\u2587\u2587\u2587\u2587\u2587"

V.2. Walk-forward Daily Validation Setup

Defining the walk-forward validation strategy. We’ll use the last 5 full years of data for validation,
performing 7-day ahead forecasts for each validation fold.

Define validation periods

Validate on the last 7 days of the last 5 full years available in the training data.

Validation Framework

Prophet Validation
Prophet is ideal for daily data with multiple seasonalities (daily, weekly, yearly).
We include promo and temp as extra regressors.

Machine Learning Validation (XGBoost, Random Forest)

Run Validation

Define the forecast horizon (next 7 days from the latest date in the dataset)

Generate final forecasts for the next 7 days for all models and locations

Result Analysis and Visualization

## # A tibble: 3 x 3
##   Model         Mean_MAE Mean_RMSE
##   <chr>            <dbl>     <dbl>
## 1 Random Forest    1596.     1975.
## 2 XGBoost          1739.     2173.
## 3 Prophet          3550.     4222.

Output results

## # A tibble: 3 x 3
##   Model         Mean_MAE Mean_RMSE
##   <chr>            <dbl>     <dbl>
## 1 Random Forest    1596.     1975.
## 2 XGBoost          1739.     2173.
## 3 Prophet          3550.     4222.

VI. Conclusion and recommendation

The Random Forest model demonstrated superior performance, as evidenced by a lower Root Mean Square Error (RMSE) compared to other approaches.
We selected Random Forest for the final forecasting task delivered to our client because it offers:
- Better predictive accuracy,
- High interpretability and transparency in feature contributions.
For any forecasting task, we strongly recommend:
- Comparing multiple candidate models,
- Evaluating each model’s performance using appropriate metrics,
- Selecting the most suitable one based on both accuracy and explainability.
- This can apply to most of supermarkt chains

If you’re struggling with your data and want to produce precise, informed forecasts, feel free to reach out; we’re here to help!

VII. Outlook

Stores/ Locations should not be considered all independent to each other.
Locations that are geographically closer may exhibit more similar behaviors than those that are farther apart.
Adopting a space-time modeling approach instead of fitting separate time series models for each location.
Mixed-Effects Models: Random intercepts per machine: ‘djeudeu et al.’
GAMs with Temporal Smoothing: Penalized splines for time trends.
Consider time aware train/ test split data in a cross validation for more comparability

3D Statistical Learning is well-equipped to implement this solution, including the components outlined in the project outlook.

We would like to thank Dr. Dany Djeudeu for preparing this document, based on a client use case for which we received permission to publish using simulated data closely reflecting the original.

Literature
Djeudeu, D. et al. Multilevel Conditional Autoregressive Models for Longitudinal and Spatially Referenced Epidemiological Data, Spatial and Spatio-temporal Epidemiology, Volume 41, 2022, 100477. https://doi.org/10.1016/j.sste.2022.100477

3 D Statistical Learning

We help businesses and researchers solve complex challenges by providing expert guidance in statistics, machine learning, and tailored education.

Our core services include:

– Statistical Consulting:
Comprehensive consulting tailored to your data-driven needs.

– Training and Coaching:
In-depth instruction in statistics, machine learning, and the use of statistical software such as SAS, R, and Python.

– Reproducible Data Analysis Pipelines:
Development of documented, reproducible workflows using SAS macros and customized R and Python code.

– Interactive Data Visualization and Web Applications:
Creation of dynamic visualizations and web apps with R (Shiny, Plotly), Python (Streamlit, Dash by Plotly), and SAS (SAS Viya, SAS Web Report Studio).

– Automated Reporting and Presentation:
Generation of automated reports and presentations using Markdown and Quarto.

– Scientific Data Analysis:
Advanced analytical support for scientific research projects.

Revenue Forecasting for Supermarket Chains: A Case Study with an Anonymous Supermarket Chain in Germany