Forecasting sales with Gaussian Processes and Autoregressive models

Problem definition

Sales forecasting is a very common problem faced by many companies. Given a history of sales of a certain product we would like to predict the demand of that product for a time window in the future. This is useful as it allows companies and industries to plan their workload and reduce waste of resources. This problem is a common example of time series forecasting and there are many approaches to tackle it.

Today we will learn a bit more about this with a practical example: the Kaggle Predict Future Sales dataset. This challenge aims to predict future sales of different products for the next month in different shops of a retail chain. In this notebook we do not aim at solving this challenge, rather, we will explore some concepts of time series forecasting that could be used to solve such problem.

You can download the Jupyter notebook version of this tutorial here.

Exploring the dataset

First we need to download the whole dataset and extract it to a data folder in the root of this notebook path.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ipywidgets import interact_manual'ggplot')
%matplotlib inline
#load sales and rename columns to ease understanding
dtype = {'date_block_number':np.uint16, 'shop_id':np.uint32, 'item_id':np.uint32, 'item_cnt_day':np.float32}
sales = pd.read_csv('data/sales_train.csv', dtype=dtype) \
          .rename(columns={'date_block_num': 'month', 'item_cnt_day':'sold'})

date month shop_id item_id item_price sold
0 02.01.2013 0 59 22154 999.00 1.0
1 03.01.2013 0 25 2552 899.00 1.0
2 05.01.2013 0 25 2552 899.00 -1.0
3 06.01.2013 0 25 2554 1709.05 1.0
4 15.01.2013 0 25 2555 1099.00 1.0

As we can see, we have daily records for each shop and item. Note that some of the sold values are negative because they include returns.

Firstly, since we are interested in estimating the monthly sales, we will aggregate the dataset by month using the handy month feature, which ranges from 0 (representing January 2013) to 33 (October 2015). We will group the sales by the shop and item ids. The aggregation will sum all sold fields during the month for each product and shop, and average the item_price (it is likely to change from month to month).

gsales = sales.groupby(['shop_id','item_id','month']).agg({'item_price':np.mean, 'sold':np.sum})

item_price sold
shop_id item_id month
0 30 1 265.0 31.0
31 1 434.0 11.0
32 0 221.0 6.0
1 221.0 10.0
33 0 347.0 3.0
1 347.0 3.0
35 0 247.0 1.0
1 247.0 14.0
36 1 357.0 1.0
40 1 127.0 1.0
42 1 127.0 1.0
43 0 221.0 1.0
49 1 127.0 2.0
51 0 128.5 2.0
1 127.0 3.0
57 1 167.0 1.0
59 1 110.0 1.0
61 0 195.0 1.0
75 0 76.0 1.0
85 1 190.0 1.0

We can now observe that for the first shop many items only have selling records for the first two months. We would like to investigate how this varies for different products across all 60 shops.

@interact_manual(shop_id = (0,59,1))
def plot_product_record_frequency(shop_id):
    count_months = gsales.reset_index().groupby(['shop_id','item_id']).size()[shop_id], count_months)
    plt.ylabel('Num months available')
interactive(children=(IntSlider(value=29, description='shop_id', max=59), Button(description='Run Interact', s…

However, this interactive visualisation will not work unless you have a notebook running, so we will plot the sales frequency for some of the stores below:

def plot_product_record_frequency(shop_id):
    count_months = gsales.reset_index().groupby(['shop_id','item_id']).size()[shop_id], count_months)
    plt.ylabel('Num months available')
    plt.title(f'shop_id {shop_id}')
fig=plt.figure(figsize=(12, 8), dpi= 80, facecolor='w', edgecolor='k')    
for i, shop_id in enumerate([0,1,29,31]):


We can observe that some shops have a very limited record of sales, e.g. shop 0 and 1. On the other hand, shop 29 and 31 have a considerable number of items with a record above 10 weeks.

Perhaps we should look into the shops with the larger number of sales, which we can inspect by observing the distribution of shop_id:



If we observe the previous histogram for the number of weeks of records for each product of a given shop, we will observe indeed that the shop 31 has a comprehensive number of products with a significant history of sales.

Considering the behaviour of multiple shops introduce complexity steaming from this biased distribution of sales. Some shops will have very little preior information about their sales, so it would be difficult to make a good prediction of sales in these shops. For this reason we will neglect different shops and instead try to estimate a new volume of sales for all products in the whole supermarket chain. So, we modify our aggregated sales DataFrame gsales accordingly:

gsales = sales.groupby(['item_id','month']).agg({'item_price':np.mean, 'sold':np.sum})

We can also observe what is the number of months available for each product sale history in this chain-wide collection:

count_months = gsales.reset_index().groupby(['item_id']).size(),count_months);
plt.ylabel('Num months available');


Although in a real forecast scenario we would like to have a good prediction of sales even when the history for a given product is minimal, in this example we will focus on products with full history for all 34 months.

productsIdx = count_months[count_months == 34].index.values
selected_sales = gsales.loc[productsIdx]

We obtained 523 unique products that have a selling history for all of the 33 months. An example of the first 60 items are below:


item_price sold
item_id month
32 0 338.110349 299.0
1 337.771930 208.0
2 343.794702 178.0
3 341.888889 97.0
4 347.000000 66.0
5 342.070886 79.0
6 345.951190 87.0
7 340.172727 72.0
8 340.017544 59.0
9 184.592593 58.0
10 144.316456 81.0
11 147.994444 89.0
12 144.710526 84.0
13 144.066667 48.0
14 149.000000 44.0
15 143.714286 30.0
16 149.000000 26.0
17 149.000000 26.0
18 130.500000 12.0
19 148.991176 34.0
20 144.771429 37.0
21 148.991667 37.0
22 146.060714 29.0
23 149.000000 40.0
24 145.572727 42.0
25 146.387333 32.0
26 146.869167 40.0
27 149.000000 20.0
28 149.000000 20.0
29 149.000000 26.0
30 149.000000 21.0
31 148.714286 30.0
32 149.000000 19.0
33 149.000000 22.0
33 0 488.517241 61.0
1 484.170732 39.0
2 490.870968 32.0
3 489.500000 16.0
4 499.000000 12.0
5 205.046512 44.0
6 195.439130 46.0
7 197.277778 35.0
8 198.052381 43.0
9 195.915152 33.0
10 194.866667 15.0
11 195.900000 42.0
12 197.487805 42.0
13 196.862069 29.0
14 197.673333 30.0
15 199.000000 18.0
16 196.658824 17.0
17 199.000000 21.0
18 199.000000 14.0
19 197.756250 17.0
20 199.000000 13.0
21 199.000000 14.0
22 195.460000 20.0
23 199.000000 21.0
24 199.000000 19.0
25 199.000000 26.0

We have explored the dataset and observed some items of interest. The next step is using this information to predict the number of sales over another month

Time Series Analysis

Firstly, let’s observe what the sales time series look like for some of the products:

fig=plt.figure(figsize=(12, 8), dpi= 80, facecolor='w', edgecolor='k')
for i, idx in enumerate(selected_sales.index.get_level_values(0).unique()[10:20]):
    plt.title(f'Sales for item_id {idx}')


We can see some seasonality patterns in the plotted data, so we will decompose one of the observed time series into a trend and seasonal effects with a multiplicative model $Y(t) = T(t) S(t) r(t)$ where $T(t)$ is the trend, $S(t)$ the the seasonal component and $r(t)$ is the residual.

from statsmodels.tsa.seasonal import seasonal_decompose
item_id = 491
sales_product = selected_sales.loc[item_id]['sold'].values
dec = seasonal_decompose(sales_product, model='multiplicative', period=12)


We observed that the residual values tend to be close to 1, which means the observed series has a good fit to the seasonal and trend decomposition.

This analysis shows that the observed time series have a trend that decreases throughout the months, and a seasonal component, which seems to peak around August and December. This decomposition could be exploited to improve prediction results, however we would require a model that incorporate this seasonality pattern.

Time series forecasting

In this section we will evaluate two different models for time series forecasting: Auto Regressive Integrated Moving Average (ARIMA) and Gaussian Process (GP).

Firstly, we train the models on each product time series up to month 30. Then, we evaluate for the remaining of months available across all selected products. Our metric will be the Root Mean Squared Error (RMSE) computed with the predicted and ground-truth time series. Please note that this metric is the same as used in the original challenge.

Gaussian Processes

GPs are non-parametric models that can represent a posterior over functions. They work well when we do not wish to impose strong assumptions on the generative data process. For a more detailed explanation of GPs, please visit this distill post.

We chose experimented with a multitude of kernels, the best performing were the simpler ones: RBF and Matern. Note that the parameters of these kernels are optimised (within the given bounds) during the fit process. Please note that we normalise the target sales by the maximum value such that the target number of sales ranges from [0,1].

First, let’s observe what happens using a single item_id and extrapolating between the months:

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, RBF, Matern, ExpSineSquared, ConstantKernel

kernel = RBF()
gp = GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer=2, alpha = 1e-5, normalize_y=True)
item_id = 53
X = np.arange(0,34,0.05).reshape(-1,1)
Y = selected_sales['sold'][item_id].values.reshape(-1,1)
ymax = Y.max(),1), Y[:30]/ymax)
Y_pred, Y_pred_std = gp.predict(X, return_std=True)
Y_pred = Y_pred.reshape(-1)*ymax

fig=plt.figure(figsize=(6, 4), dpi= 80, facecolor='w', edgecolor='k')
plt.plot(X, Y_pred, 'r.', label='Pred')
plt.fill_between(X.reshape(-1), Y_pred - Y_pred_std, Y_pred + Y_pred_std, alpha=0.2, color='k');
plt.plot(np.arange(34), Y, 'bx', label='GT')
plt.axvline(29.5, 0, 1, color='black', ls='--', label='Train/Test split')


Now we will create a GP for each item_id time series within our selected sales:

Y_preds_gp = []
Y_gts = []
for item_id in selected_sales.index.get_level_values(0).unique():
    X = np.arange(34).reshape(-1,1)
    X_train, X_test = X[:30], X[30:]
    Y = selected_sales['sold'][item_id].values.reshape(-1,1)
    Y_train, Y_test = Y[:30], Y[30:]
    ymax = Y_train.max()
    kernel = RBF()
    gp = GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer=0, alpha = 1e-2, normalize_y=True), Y_train/ymax)
    ypred = gp.predict(X_test)*ymax
Y_preds_gp = np.concatenate(Y_preds_gp, axis=0)
Y_gts = np.concatenate(Y_gts, axis=0)


Auto Regressive Integrated Moving Average models the time series using $$Y(t) = \alpha + \beta_1 Y(t-1) + \beta_2 Y(t-2) + \dots + \beta_p Y(t-p) + \gamma_1 \epsilon(t-1) + \gamma_2 \epsilon(t-2) + \dots + \gamma_q \epsilon(t-q) + \epsilon(t)$$ where $\epsilon(t)$ is the residual from the ground-truth and estimated value. The parameters $\alpha, \beta, \gamma$ are optimised during fitting. The hyper-parameters $p,d,q$ correspond to the order of the process, i.e. how many terms of previous timestamps, how many previous error terms and how many times to differentiate the time series until it becomes stationary.

For simplicity, we assume our time-series are stationary and use $p=1$, $q=0$, $d=0$

Again, let’s start by visualising a single time-series and the resulting ARIMA prediction.

from statsmodels.tsa.arima_model import ARIMA

item_id = 53
Y = selected_sales['sold'][item_id].values.reshape(-1,1)

model = ARIMA(Y[:30], order=(1,0,0)).fit(trend='nc')
Y_pred = model.predict(0,33)

fig=plt.figure(figsize=(6, 4), dpi= 80, facecolor='w', edgecolor='k')
plt.plot(X, Y_pred, 'r.', label='Pred')
plt.plot(np.arange(34), Y, 'bx', label='GT')
plt.axvline(29.5, 0, 1, color='black', ls='--', label='Train/Test split')


We may also observe the approximate distribution of the residuals:

residuals = pd.DataFrame(model.resid)


Next, we forecast for all items in the selected sales for the remaining 4 months (30,31,32,33):

Y_preds_arima = []
for item_id in selected_sales.index.get_level_values(0).unique():  
    Y = selected_sales['sold'][item_id].values.reshape(-1,1)
    Y_train, Y_test = Y[:30], Y[30:]
    ypred = gp.predict(X_test)
    model = ARIMA(Y[:30], order=(1,0,0)).fit(trend='nc')
    ypred = model.predict(30,33)
Y_preds_arima = np.concatenate(Y_preds_arima, axis=0)


Finally computing the RMSE metric between predictions (of GP and ARIMA) and ground-truths for all selected items for the remaining 4 months (30,31,32,33):

from sklearn.metrics import mean_squared_error as mse
rmse_gp = mse(Y_gts, Y_preds_gp, squared=False)
rmse_arima = mse(Y_gts, Y_preds_arima, squared=False)
print(f"RMSE GP {rmse_gp} \nRMSE ARIMA {rmse_arima}")
RMSE GP 58.58414105624866 
RMSE ARIMA 50.27118697649368

Knowingly, forecasting 4 months into the future would be difficult. Instead we could consider only the 30th month forecast of all items:

rmse_gp = mse(Y_gts[::4], Y_preds_gp[::4], squared=False)
rmse_arima = mse(Y_gts[::4], Y_preds_arima[::4], squared=False)
print(f"RMSE GP {rmse_gp} \nRMSE ARIMA {rmse_arima}")
RMSE GP 28.212626816686967 
RMSE ARIMA 7.92125195871772


We observed that the ARIMA model performed better under the RMSE metric for the presented dataset. Still, both RMSE are quite high and should be reduced if the model were to be used in practice.

To further improve the results and to account to a more realistic setting where some items or shops will have a limited history of sales, we must work on feature engineering to overcome these limitations. Example of features that could be exploited include, but are not limited to:

  1. Shop-specific aggregated stats (mean num of sales)
  2. City-specific aggregated stats (mean num of sales, pop. size, etc).
  3. Item categories aggregated stats (mean num of sales for a specific product category)
  4. Item price and price variations

Further reading

If you are interested in time series forecasting I highly recommend the above blog posts:

Eduardo Arnold
Eduardo Arnold
Machine Learning Engineer

I’m a Machine Learning Engineer at Niantic. Previously, I obtained my PhD degree at the University of Warwick, supervised by Mehrdad Dianati and Paul Jennings, and focusing on perception methods for autonomous driving.