Forecasting sales with Gaussian Processes and Autoregressive models

Jul 23, 2020 10 min read

Problem definition

Sales forecasting is a very common problem faced by many companies. Given a history of sales of a certain product we would like to predict the demand of that product for a time window in the future. This is useful as it allows companies and industries to plan their workload and reduce waste of resources. This problem is a common example of time series forecasting and there are many approaches to tackle it.

Today we will learn a bit more about this with a practical example: the Kaggle Predict Future Sales dataset. This challenge aims to predict future sales of different products for the next month in different shops of a retail chain. In this notebook we do not aim at solving this challenge, rather, we will explore some concepts of time series forecasting that could be used to solve such problem.

You can download the Jupyter notebook version of this tutorial here.

Exploring the dataset

First we need to download the whole dataset and extract it to a data folder in the root of this notebook path.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ipywidgets import interact_manual
plt.style.use('ggplot')
%matplotlib inline

#load sales and rename columns to ease understanding
dtype = {'date_block_number':np.uint16, 'shop_id':np.uint32, 'item_id':np.uint32, 'item_cnt_day':np.float32}
sales = pd.read_csv('data/sales_train.csv', dtype=dtype) \
          .rename(columns={'date_block_num': 'month', 'item_cnt_day':'sold'})

sales.head()

	date	shop_id	item_id	item_price	sold
0	02.01.2013	59	22154	999.00	1.0
1	03.01.2013	25	2552	899.00	1.0
2	05.01.2013	25	2552	899.00	-1.0
3	06.01.2013	25	2554	1709.05	1.0
4	15.01.2013	25	2555	1099.00	1.0

As we can see, we have daily records for each shop and item. Note that some of the sold values are negative because they include returns.

Firstly, since we are interested in estimating the monthly sales, we will aggregate the dataset by month using the handy month feature, which ranges from 0 (representing January 2013) to 33 (October 2015). We will group the sales by the shop and item ids. The aggregation will sum all sold fields during the month for each product and shop, and average the item_price (it is likely to change from month to month).

gsales = sales.groupby(['shop_id','item_id','month']).agg({'item_price':np.mean, 'sold':np.sum})

gsales[:20]

			item_price	sold
shop_id	item_id	month
0	30	1	265.0	31.0
	31	1	434.0	11.0
	32	0	221.0	6.0
	32	1	221.0	10.0
	33	0	347.0	3.0
	33	1	347.0	3.0
	35	0	247.0	1.0
	35	1	247.0	14.0
	36	1	357.0	1.0
	40	1	127.0	1.0
	42	1	127.0	1.0
	43	0	221.0	1.0
	49	1	127.0	2.0
	51	0	128.5	2.0
	51	1	127.0	3.0
	57	1	167.0	1.0
	59	1	110.0	1.0
	61	0	195.0	1.0
	75	0	76.0	1.0
	85	1	190.0	1.0

We can now observe that for the first shop many items only have selling records for the first two months. We would like to investigate how this varies for different products across all 60 shops.

@interact_manual(shop_id = (0,59,1))
def plot_product_record_frequency(shop_id):
    count_months = gsales.reset_index().groupby(['shop_id','item_id']).size()[shop_id]
    plt.bar(count_months.keys(), count_months)
    plt.xlabel('product_id')
    plt.ylabel('Num months available')

interactive(children=(IntSlider(value=29, description='shop_id', max=59), Button(description='Run Interact', s…

However, this interactive visualisation will not work unless you have a notebook running, so we will plot the sales frequency for some of the stores below:

def plot_product_record_frequency(shop_id):
    count_months = gsales.reset_index().groupby(['shop_id','item_id']).size()[shop_id]
    plt.bar(count_months.keys(), count_months)
    plt.xlabel('product_id')
    plt.ylabel('Num months available')
    plt.title(f'shop_id {shop_id}')
    
fig=plt.figure(figsize=(12, 8), dpi= 80, facecolor='w', edgecolor='k')    
for i, shop_id in enumerate([0,1,29,31]):
    plt.subplot(2,2,i+1)
    plot_product_record_frequency(shop_id)
plt.tight_layout()

png

We can observe that some shops have a very limited record of sales, e.g. shop 0 and 1. On the other hand, shop 29 and 31 have a considerable number of items with a record above 10 weeks.

Perhaps we should look into the shops with the larger number of sales, which we can inspect by observing the distribution of shop_id:

sales['shop_id'].plot.hist(bins=60)
plt.xlabel('product_id');

png

If we observe the previous histogram for the number of weeks of records for each product of a given shop, we will observe indeed that the shop 31 has a comprehensive number of products with a significant history of sales.

Considering the behaviour of multiple shops introduce complexity steaming from this biased distribution of sales. Some shops will have very little preior information about their sales, so it would be difficult to make a good prediction of sales in these shops. For this reason we will neglect different shops and instead try to estimate a new volume of sales for all products in the whole supermarket chain. So, we modify our aggregated sales DataFrame gsales accordingly:

gsales = sales.groupby(['item_id','month']).agg({'item_price':np.mean, 'sold':np.sum})

We can also observe what is the number of months available for each product sale history in this chain-wide collection:

count_months = gsales.reset_index().groupby(['item_id']).size()
plt.bar(count_months.keys(),count_months);
plt.xlabel('product_id')
plt.ylabel('Num months available');

png

Although in a real forecast scenario we would like to have a good prediction of sales even when the history for a given product is minimal, in this example we will focus on products with full history for all 34 months.

productsIdx = count_months[count_months == 34].index.values
selected_sales = gsales.loc[productsIdx]
productsIdx.shape

(523,)

We obtained 523 unique products that have a selling history for all of the 33 months. An example of the first 60 items are below:

selected_sales[:60]

		item_price	sold
item_id	month
32	0	338.110349	299.0
	1	337.771930	208.0
	2	343.794702	178.0
	3	341.888889	97.0
	4	347.000000	66.0
	5	342.070886	79.0
	6	345.951190	87.0
	7	340.172727	72.0
	8	340.017544	59.0
	9	184.592593	58.0
	10	144.316456	81.0
	11	147.994444	89.0
	12	144.710526	84.0
	13	144.066667	48.0
	14	149.000000	44.0
	15	143.714286	30.0
	16	149.000000	26.0
	17	149.000000	26.0
	18	130.500000	12.0
	19	148.991176	34.0
	20	144.771429	37.0
	21	148.991667	37.0
	22	146.060714	29.0
	23	149.000000	40.0
	24	145.572727	42.0
	25	146.387333	32.0
	26	146.869167	40.0
	27	149.000000	20.0
	28	149.000000	20.0
	29	149.000000	26.0
	30	149.000000	21.0
	31	148.714286	30.0
	32	149.000000	19.0
	33	149.000000	22.0
33	0	488.517241	61.0
	1	484.170732	39.0
	2	490.870968	32.0
	3	489.500000	16.0
	4	499.000000	12.0
	5	205.046512	44.0
	6	195.439130	46.0
	7	197.277778	35.0
	8	198.052381	43.0
	9	195.915152	33.0
	10	194.866667	15.0
	11	195.900000	42.0
	12	197.487805	42.0
	13	196.862069	29.0
	14	197.673333	30.0
	15	199.000000	18.0
	16	196.658824	17.0
	17	199.000000	21.0
	18	199.000000	14.0
	19	197.756250	17.0
	20	199.000000	13.0
	21	199.000000	14.0
	22	195.460000	20.0
	23	199.000000	21.0
	24	199.000000	19.0
	25	199.000000	26.0

We have explored the dataset and observed some items of interest. The next step is using this information to predict the number of sales over another month

Time Series Analysis

Firstly, let’s observe what the sales time series look like for some of the products:

fig=plt.figure(figsize=(12, 8), dpi= 80, facecolor='w', edgecolor='k')
for i, idx in enumerate(selected_sales.index.get_level_values(0).unique()[10:20]):
    plt.subplot(5,2,i+1)
    selected_sales.loc[idx]['sold'].plot(style='r.')
    plt.title(f'Sales for item_id {idx}')
    plt.tight_layout()

png

We can see some seasonality patterns in the plotted data, so we will decompose one of the observed time series into a trend and seasonal effects with a multiplicative model $Y(t) = T(t) S(t) r(t)$ where $T(t)$ is the trend, $S(t)$ the the seasonal component and $r(t)$ is the residual.

from statsmodels.tsa.seasonal import seasonal_decompose
item_id = 491
sales_product = selected_sales.loc[item_id]['sold'].values
dec = seasonal_decompose(sales_product, model='multiplicative', period=12)
dec.plot();

png

We observed that the residual values tend to be close to 1, which means the observed series has a good fit to the seasonal and trend decomposition.

This analysis shows that the observed time series have a trend that decreases throughout the months, and a seasonal component, which seems to peak around August and December. This decomposition could be exploited to improve prediction results, however we would require a model that incorporate this seasonality pattern.

Time series forecasting

In this section we will evaluate two different models for time series forecasting: Auto Regressive Integrated Moving Average (ARIMA) and Gaussian Process (GP).

Firstly, we train the models on each product time series up to month 30. Then, we evaluate for the remaining of months available across all selected products. Our metric will be the Root Mean Squared Error (RMSE) computed with the predicted and ground-truth time series. Please note that this metric is the same as used in the original challenge.

Gaussian Processes

GPs are non-parametric models that can represent a posterior over functions. They work well when we do not wish to impose strong assumptions on the generative data process. For a more detailed explanation of GPs, please visit this distill post.

We chose experimented with a multitude of kernels, the best performing were the simpler ones: RBF and Matern. Note that the parameters of these kernels are optimised (within the given bounds) during the fit process. Please note that we normalise the target sales by the maximum value such that the target number of sales ranges from [0,1].

First, let’s observe what happens using a single item_id and extrapolating between the months:

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, RBF, Matern, ExpSineSquared, ConstantKernel

kernel = RBF()
gp = GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer=2, alpha = 1e-5, normalize_y=True)

item_id = 53
X = np.arange(0,34,0.05).reshape(-1,1)
Y = selected_sales['sold'][item_id].values.reshape(-1,1)
ymax = Y.max()

gp.fit(np.arange(30).reshape(-1,1), Y[:30]/ymax)
Y_pred, Y_pred_std = gp.predict(X, return_std=True)
Y_pred = Y_pred.reshape(-1)*ymax

fig=plt.figure(figsize=(6, 4), dpi= 80, facecolor='w', edgecolor='k')
plt.plot(X, Y_pred, 'r.', label='Pred')
plt.fill_between(X.reshape(-1), Y_pred - Y_pred_std, Y_pred + Y_pred_std, alpha=0.2, color='k');
plt.plot(np.arange(34), Y, 'bx', label='GT')
plt.axvline(29.5, 0, 1, color='black', ls='--', label='Train/Test split')
plt.legend();

png

Now we will create a GP for each item_id time series within our selected sales:

Y_preds_gp = []
Y_gts = []
for item_id in selected_sales.index.get_level_values(0).unique():
    X = np.arange(34).reshape(-1,1)
    X_train, X_test = X[:30], X[30:]
    
    Y = selected_sales['sold'][item_id].values.reshape(-1,1)
    Y_train, Y_test = Y[:30], Y[30:]
    ymax = Y_train.max()
    
    kernel = RBF()
    gp = GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer=0, alpha = 1e-2, normalize_y=True)
    gp.fit(X_train, Y_train/ymax)
    ypred = gp.predict(X_test)*ymax
    
    Y_preds_gp.append(ypred)
    Y_gts.append(Y_test)
    
Y_preds_gp = np.concatenate(Y_preds_gp, axis=0)
Y_gts = np.concatenate(Y_gts, axis=0)

ARIMA

Auto Regressive Integrated Moving Average models the time series using $$Y(t) = \alpha + \beta_1 Y(t-1) + \beta_2 Y(t-2) + \dots + \beta_p Y(t-p) + \gamma_1 \epsilon(t-1) + \gamma_2 \epsilon(t-2) + \dots + \gamma_q \epsilon(t-q) + \epsilon(t)$$ where $\epsilon(t)$ is the residual from the ground-truth and estimated value. The parameters $\alpha, \beta, \gamma$ are optimised during fitting. The hyper-parameters $p,d,q$ correspond to the order of the process, i.e. how many terms of previous timestamps, how many previous error terms and how many times to differentiate the time series until it becomes stationary.

For simplicity, we assume our time-series are stationary and use $p=1$, $q=0$, $d=0$

Again, let’s start by visualising a single time-series and the resulting ARIMA prediction.

from statsmodels.tsa.arima_model import ARIMA

item_id = 53
Y = selected_sales['sold'][item_id].values.reshape(-1,1)

model = ARIMA(Y[:30], order=(1,0,0)).fit(trend='nc')
Y_pred = model.predict(0,33)

fig=plt.figure(figsize=(6, 4), dpi= 80, facecolor='w', edgecolor='k')
plt.plot(X, Y_pred, 'r.', label='Pred')
plt.plot(np.arange(34), Y, 'bx', label='GT')
plt.axvline(29.5, 0, 1, color='black', ls='--', label='Train/Test split')
plt.legend();

png

We may also observe the approximate distribution of the residuals:

residuals = pd.DataFrame(model.resid)
residuals.plot.kde();

png

Next, we forecast for all items in the selected sales for the remaining 4 months (30,31,32,33):

Y_preds_arima = []
for item_id in selected_sales.index.get_level_values(0).unique():  
    Y = selected_sales['sold'][item_id].values.reshape(-1,1)
    Y_train, Y_test = Y[:30], Y[30:]
    
    ypred = gp.predict(X_test)
    model = ARIMA(Y[:30], order=(1,0,0)).fit(trend='nc')
    ypred = model.predict(30,33)
    
    Y_preds_arima.append(ypred)
    
Y_preds_arima = np.concatenate(Y_preds_arima, axis=0)

Evaluation

Finally computing the RMSE metric between predictions (of GP and ARIMA) and ground-truths for all selected items for the remaining 4 months (30,31,32,33):

from sklearn.metrics import mean_squared_error as mse
rmse_gp = mse(Y_gts, Y_preds_gp, squared=False)
rmse_arima = mse(Y_gts, Y_preds_arima, squared=False)
print(f"RMSE GP {rmse_gp} \nRMSE ARIMA {rmse_arima}")

RMSE GP 58.58414105624866 
RMSE ARIMA 50.27118697649368

Knowingly, forecasting 4 months into the future would be difficult. Instead we could consider only the 30th month forecast of all items:

rmse_gp = mse(Y_gts[::4], Y_preds_gp[::4], squared=False)
rmse_arima = mse(Y_gts[::4], Y_preds_arima[::4], squared=False)
print(f"RMSE GP {rmse_gp} \nRMSE ARIMA {rmse_arima}")

RMSE GP 28.212626816686967 
RMSE ARIMA 7.92125195871772

Conclusion

We observed that the ARIMA model performed better under the RMSE metric for the presented dataset. Still, both RMSE are quite high and should be reduced if the model were to be used in practice.

To further improve the results and to account to a more realistic setting where some items or shops will have a limited history of sales, we must work on feature engineering to overcome these limitations. Example of features that could be exploited include, but are not limited to:

Shop-specific aggregated stats (mean num of sales)
City-specific aggregated stats (mean num of sales, pop. size, etc).
Item categories aggregated stats (mean num of sales for a specific product category)
Item price and price variations