I. Introduction and Problem Description

Project Overview

In this project, we present a robust end-to-end framework for forecasting daily revenue across multiple locations of an anonymous supermarket chain. Leveraging historical transaction data, the framework employs advanced time series modeling techniques to produce accurate short-term forecasts for the upcoming seven days. The approach incorporates both temporal patterns and relevant external factors to enhance predictive accuracy.

Key Features

  • Utilizes multiple forecasting models:

    • Prophet

    • ARIMA

    • XGBoost

    • Random Forest

  • Captures complex temporal patterns, including:

    Daily, weekly, monthly, and yearly seasonality

  • Incorporates external covariates such as:

    • Promotional activity

    • Temperature data

  • Implements a walk-forward validation scheme for robust model evaluation

  • Generates forward-looking revenue predictions for each location/store location

Business Objective

Produce reliable 7-day revenue forecasts to support operational decision-making and resource planning.

Practical Applications

  • Inventory and stock optimization

  • Workforce scheduling and planning

  • Promotion and campaign alignment

II. Data Preparation

## # A tibble: 6 x 5
##   location date                revenue promo  temp
##      <dbl> <dttm>                <dbl> <dbl> <dbl>
## 1        1 2014-07-01 00:00:00  11694.     1  21.4
## 2        1 2014-07-02 00:00:00  11468.     1  20.5
## 3        1 2014-07-03 00:00:00  12833.     0  30.3
## 4        1 2014-07-04 00:00:00  15693.     1  28.9
## 5        1 2014-07-05 00:00:00  18001.     0  25.9
## 6        1 2014-07-07 00:00:00  10411.     0   9.6

II.1. Variable description

The data set has 5 variables:

  • location is a number refering to the location of the supermarkt

  • revenue indicates the total revenue in Euro

  • day refers to the day where the revenu has been obtained

  • promo is a binary variable that indicates whether it refers to a particular promotional day or not.

  • temp is a numeric vriable refering to teh daily temperature. We know that temperatute conditions how people buy things

II.2. Add day, weeks, months, years for better data exploration

## # A tibble: 25,963 x 5
##    location date       revenue promo  temp
##       <dbl> <date>       <dbl> <dbl> <dbl>
##  1        1 2014-07-01  11694.     1  21.4
##  2        1 2014-07-02  11468.     1  20.5
##  3        1 2014-07-03  12833.     0  30.3
##  4        1 2014-07-04  15693.     1  28.9
##  5        1 2014-07-05  18001.     0  25.9
##  6        1 2014-07-07  10411.     0   9.6
##  7        1 2014-07-08  10498.     0  32.4
##  8        1 2014-07-09   9827.     0  16.6
##  9        1 2014-07-10  12275.     0  23.9
## 10        1 2014-07-11  15567.     0  23.3
## # i 25,953 more rows

II.3. Variable summary

##    1    2    3    4    5    6    7    8   10 
## 3030 3012 3005 3030 3030 1784 3012 3030 3030

The supermakt chain has overall 9 locations given by the number above.

location start_date end_date total_days mean_revenue median_revenue sd_revenue
1 2014-07-01 2024-06-29 3030 14329.99 13230.132 4207.914
2 2014-07-01 2024-06-29 3012 14318.03 13014.349 4321.590
3 2014-07-01 2024-06-29 3005 9249.62 8449.249 2391.801
4 2014-07-01 2024-06-29 3030 10621.86 9983.729 3138.429
5 2014-07-01 2024-06-29 3030 27337.17 24276.131 10896.631
6 2018-08-16 2024-06-29 1784 11501.36 10726.464 3631.567
7 2014-07-01 2024-06-29 3012 12250.90 11001.964 3870.386
8 2014-07-01 2024-06-29 3030 41104.16 38868.945 14116.451
10 2014-07-01 2024-06-29 3030 10777.80 10084.744 3119.286

III. Exploratory Data Analysis

III.1. Revenue by Day of Week

Define the correct order of weekdays

Ensure day_of_week is an ordered factor

plot of chunk setup9

  • Across all supermarket locations, revenue consistently peaks on Fridays and Saturdays.

  • This pattern aligns with typical consumer behavior, as these days generally provide more
    opportunities for shopping.

  • The observation underscores the importance of explicitly
    incorporating day-of-week seasonality into revenue forecasting models.

III.2. Revenue by Store/ location

plot of chunk setup10

  • Stores 5 and 8 exhibit higher revenue levels compared to other locations.

III.3. Revenue by month

plot of chunk setup11

  • December seams to be the period with more revenue

III.4. Revenue by Year

plot of chunk setup12

  • A notable increase in revenue was observed during 2020 and 2021.

  • This period coincides with the onset of the COVID-19 pandemic.

  • While the growth may appear counterintuitive, it aligns with major shifts in consumer behavior.

Key contributing factors include:

  • Widespread stockpiling of essential goods.

  • Increased at-home consumption due to lockdowns and restrictions.

These changes significantly boosted supermarket revenues during the pandemic.

III.5. Median Revenue by Day of the Week

This analysis considers the median daily revenue per location, providing a robust measure of central tendency that reduces the influence of outliers.

plot of chunk setup13

  • Consistently higher revenues are observed on Fridays and Saturdays across all supermarket locations.

  • This recurring pattern aligns with increased consumer availability and shopping behavior at the end of the workweek.

  • The trend is further reinforced by Germany’s widespread Sunday retail closures,
    which concentrate consumer activity into the preceding days, particularly the weekend.

We examine the overall trend by displaying the raw daily revenue values, offering an unfiltered view of temporal fluctuations.

plot of chunk setup14

  • All stores show a consistent upward trend in daily revenue over the years.

  • Locations 8 and 5 stand out with notably higher variability in daily revenue.

  • This variability may reflect differences in customer behavior, promotional activity, or store-specific factors.

III.7. Revenue vs. External Regressors: Relationship Analysis

We would like to examine how key external factors influence daily revenue across store locations.

Temperature vs Revenue

plot of chunk setup15

No clear linear pattern is observed between revenue and temperature. However, this does not necessarily imply the absence of a relationship, as the association may be nonlinear or influenced by additional factors.

Revenue by Promotion Days

plot of chunk setup16

Revenue tends to be slightly higher on promotion days compared to regular days.

IV. Early-Stage Reflections on Model Selection: Key Considerations

The final modeling phase will incorporate Random Forest and XGBoost regression models, which are particularly effective in capturing the influence of covariates. However, in this initial stage, we focus on traditional time series models—such as ARIMA, AR, and Prophet that are promising for short-term forecasting and are not originally designed to account for covariate effects.

Objective:

Proceed with time series modeling using traditional approaches without covariates, with a focus on short-term forecasting.

Candidate models for the first step modeling:

  • ARIMA (AutoRegressive Integrated Moving Average)

  • AR (AutoRegressive)

  • MA (Moving Average)

  • Prophet (by Facebook, suitable for trend and seasonality)

Guidance by Exploratory Data Analysis (EDA):

  • Evidence of moderate seasonality, e.g.:

    • Increased sales on Fridays and Saturdays

    • Sales peaks in December

  • Presence of a trend or changing mean, indicating possible non-stationarity

Implication for model selection:

  • The non-stationary behavior can suggests ARIMA as a promising candidate

  • However, model suitability should be confirmed with statistical stationarity tests

Planned statistical tests

  • Augmented Dickey-Fuller (ADF) test: to detect unit roots

  • KPSS test: to assess trend stationarity

**Exclusion of external regressors/covariates:

  • At this stage, modeling will be performed without exogenous variables

  • Future iterations may incorporate covariates as needed

Next step:

  • Important statistical Assumption: Locations are assumed to be independent of one another, and a separate model is fitted for each location.

  • Select Arima and Prophet

  • Prophet is particularly powerful to model seasonality

  • Perform stationarity diagnostics to validate the appropriateness of ARIMA and similar models

  • Compare prophet and Arima

Load & Prepare Data

Stationarity Test (ADF)

## ADF p-value: 0.01

Prophet: Decompose Seasonality

plot of chunk setup20

IV. Fit Prophet & ARIMA – Forecast vs. Actual**

ARIMA: model, fitted and forecast

## Series: ts_arima 
## ARIMA(4,1,3)(0,0,2)[7] 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ma1     ma2      ma3     sma1     sma2
##       0.9027  -1.2067  0.2426  -0.2664  -1.3044  1.1335  -0.4736  -0.1953  -0.0466
## s.e.  0.0595   0.0458  0.0434   0.0322   0.0597  0.0618   0.0383   0.0211   0.0213
## 
## sigma^2 = 8505962:  log likelihood = -28460.63
## AIC=56941.26   AICc=56941.34   BIC=57001.42
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE        ACF1
## Training set 3.476719 2911.682 2038.095 -3.145853 14.59483 0.5449085 0.001403995
  • Appropriate differencing (d = 1) was applied — series is non-stationary but corrected.

  • Residuals are uncorrelated (ACF1 ≈ 0), which is critical for a good ARIMA fit.

  • Reasonable accuracy (MAPE ≈ 15%) — acceptable in most forecasting applications.

  • Model complexity (5 AR + 5 MA terms) could be a sign of overfitting.

  • Consider simplifying or regularizing if overfitting is a concern (e.g., using stepwise = FALSE and approximation = FALSE in auto.arima())

Plot: Prophet vs ARIMA vs Actual

plot of chunk setup24

ARIMA vs Actual plot

plot of chunk setup25

Prophet vs Actual

plot of chunk setup26

  • Both Prophet and ARIMA models exhibit visually strong predictive performance

  • The data exhibits clear and significant yearly, weekly, and monthly trends

  • These characteristics make the Prophet model a more suitable choice compared to the ARIMA model

  • For future analyses, we will continue with the Prophet model and incorporate external regressors (covariates).

  • Additionally, we will explore two widely used and effective machine learning regression models: Random Forest and XGBoost, particularly for their strength in handling external regressors.

  • Random Forest and XGBoost model will account for the fact that we have time series data.

V. Final Model Comparison and Forcasting

V.1. Feature Engineering: Extract time-based features and handle gaps

## # A tibble: 6 x 10
##   location date       revenue promo  temp  year month  week day_of_week day_of_year
##      <dbl> <date>       <dbl> <dbl> <dbl> <int> <ord> <int> <ord>             <int>
## 1        1 2014-07-01  11694.     1  21.4  2014 July     27 Tuesday             182
## 2        1 2014-07-02  11468.     1  20.5  2014 July     27 Wednesday           183
## 3        1 2014-07-03  12833.     0  30.3  2014 July     27 Thursday            184
## 4        1 2014-07-04  15693.     1  28.9  2014 July     27 Friday              185
## 5        1 2014-07-05  18001.     0  25.9  2014 July     27 Saturday            186
## 6        1 2014-07-06  14206.     0  25.9  2014 July     27 Sunday              187
## -- Data Summary ------------------------
##                            Values
## Name                       df    
## Number of rows             31361 
## Number of columns          10    
## _______________________          
## Column type frequency:           
##   Date                     1     
##   factor                   2     
##   numeric                  7     
## ________________________         
## Group variables            None  
## 
## -- Variable type: Date --------------------------------------------------------------------------------------
##   skim_variable n_missing complete_rate min        max        median     n_unique
## 1 date                  0             1 2014-07-01 2024-06-29 2019-09-22     3652
## 
## -- Variable type: factor ------------------------------------------------------------------------------------
##   skim_variable n_missing complete_rate ordered n_unique top_counts                                
## 1 month                 0             1 TRUE          12 Jan: 2666, Mar: 2666, May: 2666, Oct: 2666
## 2 day_of_week           0             1 TRUE           7 Thu: 4483, Fri: 4483, Sat: 4483, Tue: 4482
## 
## -- Variable type: numeric -----------------------------------------------------------------------------------
##   skim_variable n_missing complete_rate      mean        sd     p0    p25     p50    p75     p100
## 1 location              0             1     5.07      2.83     1      3       5       7      10  
## 2 revenue               0             1 17516.    12288.    1728.  9960.  12948.  19236. 109512. 
## 3 promo                 0             1     0.200     0.400    0      0       0       0       1  
## 4 temp                  0             1    19.3       5.50    -0.6   15.6    19.2    23      39.8
## 5 year                  0             1  2019.        2.90  2014   2017    2019    2022    2024  
## 6 week                  0             1    26.6      15.1      1     14      27      40      53  
## 7 day_of_year           0             1   183.      105.       1     92     183     275     366  
##   hist                            
## 1 "\u2587\u2587\u2586\u2587\u2583"
## 2 "\u2587\u2582\u2581\u2581\u2581"
## 3 "\u2587\u2581\u2581\u2581\u2582"
## 4 "\u2581\u2583\u2587\u2583\u2581"
## 5 "\u2587\u2586\u2587\u2587\u2586"
## 6 "\u2587\u2587\u2587\u2587\u2587"
## 7 "\u2587\u2587\u2587\u2587\u2587"

V.2. Walk-forward Daily Validation Setup

Defining the walk-forward validation strategy. We’ll use the last 5 full years of data for validation,
performing 7-day ahead forecasts for each validation fold.

Define validation periods

Validate on the last 7 days of the last 5 full years available in the training data.

Validation Framework

Prophet Validation
Prophet is ideal for daily data with multiple seasonalities (daily, weekly, yearly).
We include promo and temp as extra regressors.

Machine Learning Validation (XGBoost, Random Forest)

Run Validation

Define the forecast horizon (next 7 days from the latest date in the dataset)

Generate final forecasts for the next 7 days for all models and locations

Result Analysis and Visualization

## # A tibble: 3 x 3
##   Model         Mean_MAE Mean_RMSE
##   <chr>            <dbl>     <dbl>
## 1 Random Forest    1596.     1975.
## 2 XGBoost          1739.     2173.
## 3 Prophet          3550.     4222.

plot of chunk setup46

Output results

## # A tibble: 3 x 3
##   Model         Mean_MAE Mean_RMSE
##   <chr>            <dbl>     <dbl>
## 1 Random Forest    1596.     1975.
## 2 XGBoost          1739.     2173.
## 3 Prophet          3550.     4222.

VI. Conclusion and recommendation

  • The Random Forest model demonstrated superior performance, as evidenced by a lower Root Mean Square Error (RMSE) compared to other approaches.

  • We selected Random Forest for the final forecasting task delivered to our client because it offers:

    • Better predictive accuracy,

    • High interpretability and transparency in feature contributions.

  • For any forecasting task, we strongly recommend:

    • Comparing multiple candidate models,

    • Evaluating each model’s performance using appropriate metrics,

    • Selecting the most suitable one based on both accuracy and explainability.

    • This can apply to most of supermarkt chains

If you’re struggling with your data and want to produce precise, informed forecasts, feel free to reach out; we’re here to help!

VII. Outlook

  • Stores/ Locations should not be considered all independent to each other.

  • Locations that are geographically closer may exhibit more similar behaviors than those that are farther apart.

  • Adopting a space-time modeling approach instead of fitting separate time series models for each location.

  • Mixed-Effects Models: Random intercepts per machine: ‘djeudeu et al.’

  • GAMs with Temporal Smoothing: Penalized splines for time trends.

  • Consider time aware train/ test split data in a cross validation for more comparability

3D Statistical Learning is well-equipped to implement this solution, including the components outlined in the project outlook.

We would like to thank Dr. Dany Djeudeu for preparing this document, based on a client use case for which we received permission to publish using simulated data closely reflecting the original.

Literature
Djeudeu, D. et al. Multilevel Conditional Autoregressive Models for Longitudinal and Spatially Referenced Epidemiological Data, Spatial and Spatio-temporal Epidemiology, Volume 41, 2022, 100477. https://doi.org/10.1016/j.sste.2022.100477