Time Series Forecast. SARIMA Model

Predicting the future is a tricky task. In many situations, there are too many factors in play to be certain. As a result, in the stock market analysis for instance, the academic convention is that the best predictor of tomorrow is the situation on the market today. Still there is a big room for data-driven forecasting models: using text analysis or including other economic indicators in a multivariate analysis. One of the features, that can help is a seasonal pattern of change. In this example I will apply a seasonal ARIMA model to forecast the production of utilities.

ARIMA stands for AutoRegressive Integrated Moving Average:
Autoregression (AR) shows a changing variable that regresses on its own prior values.
Integrated (I) represents the differencing of raw observations so that data values are replaced by the difference between the data values and the previous values.
Moving average (MA) incorporates the dependency between an observation and a residual error from a moving average model applied to lagged observations.

The dataset I'll be using here describes the industrial production of electric and gas utilities in the United States between 1985–2018 with monthly frequency being production output. Available here https://fred.stlouisfed.org/series/IPG2211A2N

Let's get started!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pmdarima import auto_arima
import warnings
from pylab import rcParams
import statsmodels.api as sm
In [2]:
df = pd.read_csv('electric_production_dataset.csv',index_col=0)
df.head()
Out[2]:
IPG2211A2N
DATE
1985-01-01 72.6803
1985-02-01 70.8479
1985-03-01 62.6166
1985-04-01 57.6106
1985-05-01 55.4467

Before we move on, let's rename the energy production column and importantly transform Date column to datetime format:

In [3]:
df.columns = ['Energy Production']
df.index = pd.to_datetime(df.index)
df.index
Out[3]:
DatetimeIndex(['1985-01-01', '1985-02-01', '1985-03-01', '1985-04-01',
               '1985-05-01', '1985-06-01', '1985-07-01', '1985-08-01',
               '1985-09-01', '1985-10-01',
               ...
               '2017-10-01', '2017-11-01', '2017-12-01', '2018-01-01',
               '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01',
               '2018-06-01', '2018-07-01'],
              dtype='datetime64[ns]', name='DATE', length=403, freq=None)
In [4]:
df.head()
Out[4]:
Energy Production
DATE
1985-01-01 72.6803
1985-02-01 70.8479
1985-03-01 62.6166
1985-04-01 57.6106
1985-05-01 55.4467

Looking at the data on the graphs, one can see that the data exhibits a strong seasonal pattern which we'll try to grasp with the model, and a positive trend (though it became less prominent in the last decade).

Residuals have increased as the amplitude of the observed values has also increased.

In [5]:
warnings.filterwarnings("ignore")
plt.figure(figsize = (15,5))
plt.plot(df, 'b')
plt.title('Energy Production by Years')
plt.xlabel('Years')
plt.ylabel('Amount')
plt.autoscale(enable=True, axis='x', tight=True)
In [6]:
from pylab import rcParams
import statsmodels.api as sm
rcParams['figure.figsize'] = 15, 8
decomposition = sm.tsa.seasonal_decompose(df, model='additive')
fig = decomposition.plot()
plt.show()

Now that the data has been prepared and explored, let's create a model for the time series.

First, it is essential that we split the data into train and test sets.

Training the model on the data that we want to predict will naturally result in a very good precision, but such an experiment would make no sense: there would be no need to make prediction in the first place if we had the true data before the experiment.

It is also of importance that we do not apply a random split method: random dates would introduce significant bias into the model.
Let's create train and test sets with two slices of dates:

In [7]:
train_percent = 0.9
split_point = round(len(df) * train_percent)
train, test = df[0:split_point], df[split_point:]

In this example, we'll be using Auto ARIMA algorithm (to import it, you may need to install 'pmdarima' library with pip first)

The algorithm does a GRID search to choose the best combination of values for p, d and q hyperparameters. Auto ARIMA supports seasonality and can autotune its hyperparameters as well. That is where it differs from SARIMA.

As a result, we don't need to perform differentiation on the time series and plot autocorrelation and partial autocorrelation functions to find the best p, d, q.

We use only the train subset to fine-tune the model.

In [8]:
from pmdarima import auto_arima
model = auto_arima(train, start_p=1, start_q=1, max_p=3, max_q=3, m=12,
                   start_P=0, seasonal=True, d=1, D=1, trace=True,
                   error_action='ignore', suppress_warnings=True,
                   stepwise=True)
print(model.aic())
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=1584.316, BIC=1603.605, Fit time=6.040 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 1, 0, 12); AIC=1782.450, BIC=1790.166, Fit time=0.122 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(1, 1, 0, 12); AIC=1722.110, BIC=1737.541, Fit time=1.456 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=1636.882, BIC=1652.314, Fit time=1.700 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 1, 12); AIC=1584.548, BIC=1607.695, Fit time=9.040 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 0, 12); AIC=1698.511, BIC=1713.943, Fit time=3.930 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 2, 12); AIC=1583.678, BIC=1606.825, Fit time=17.366 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 2, 12); AIC=1636.881, BIC=1656.171, Fit time=5.248 seconds
Fit ARIMA: order=(2, 1, 1) seasonal_order=(0, 1, 2, 12); AIC=1584.168, BIC=1611.173, Fit time=13.254 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(0, 1, 2, 12); AIC=1651.354, BIC=1670.644, Fit time=5.574 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 1, 2, 12); AIC=1583.672, BIC=1610.677, Fit time=35.658 seconds
Fit ARIMA: order=(2, 1, 3) seasonal_order=(0, 1, 2, 12); AIC=1587.457, BIC=1622.179, Fit time=47.227 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(1, 1, 2, 12); AIC=1582.757, BIC=1613.621, Fit time=32.061 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(1, 1, 1, 12); AIC=1584.632, BIC=1611.637, Fit time=9.288 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 1, 1, 12); AIC=1584.626, BIC=1607.773, Fit time=6.734 seconds
Fit ARIMA: order=(0, 1, 2) seasonal_order=(1, 1, 2, 12); AIC=1592.022, BIC=1619.028, Fit time=16.755 seconds
Fit ARIMA: order=(2, 1, 2) seasonal_order=(1, 1, 2, 12); AIC=1584.507, BIC=1619.228, Fit time=23.405 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=1582.935, BIC=1609.941, Fit time=20.793 seconds
Fit ARIMA: order=(1, 1, 3) seasonal_order=(1, 1, 2, 12); AIC=1590.278, BIC=1625.000, Fit time=37.459 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=1636.678, BIC=1659.826, Fit time=9.025 seconds
Fit ARIMA: order=(2, 1, 3) seasonal_order=(1, 1, 2, 12); AIC=1586.247, BIC=1624.826, Fit time=38.622 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(2, 1, 2, 12); AIC=1581.643, BIC=1616.365, Fit time=34.679 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(2, 1, 1, 12); AIC=1579.567, BIC=1610.431, Fit time=16.705 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(1, 1, 0, 12); AIC=1651.191, BIC=1674.338, Fit time=9.625 seconds
Fit ARIMA: order=(0, 1, 2) seasonal_order=(2, 1, 1, 12); AIC=1588.757, BIC=1615.762, Fit time=11.785 seconds
Fit ARIMA: order=(2, 1, 2) seasonal_order=(2, 1, 1, 12); AIC=1582.704, BIC=1617.426, Fit time=32.387 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(2, 1, 1, 12); AIC=1579.239, BIC=1606.244, Fit time=20.590 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(2, 1, 1, 12); AIC=1656.363, BIC=1675.653, Fit time=3.883 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(2, 1, 0, 12); AIC=1605.982, BIC=1629.130, Fit time=15.863 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(2, 1, 2, 12); AIC=1581.043, BIC=1611.906, Fit time=19.535 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 0, 12); AIC=1650.638, BIC=1669.928, Fit time=7.602 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(2, 1, 1, 12); AIC=1630.991, BIC=1654.138, Fit time=6.979 seconds
Fit ARIMA: order=(2, 1, 1) seasonal_order=(2, 1, 1, 12); AIC=1579.872, BIC=1610.736, Fit time=28.255 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(2, 1, 1, 12); AIC=1645.320, BIC=1668.467, Fit time=5.739 seconds
Total fit time: 554.603 seconds
1579.23873632733
In [9]:
model.fit(train)
Out[9]:
ARIMA(callback=None, disp=0, maxiter=None, method=None, order=(1, 1, 1),
      out_of_sample_size=0, scoring='mse', scoring_args={},
      seasonal_order=(2, 1, 1, 12), solver='lbfgs', start_params=None,
      suppress_warnings=True, transparams=True, trend=None,
      with_intercept=True)

Now we need to check the size of the test sample to tune 'number of periods' hyperparameter for prediction:

In [10]:
test.size
Out[10]:
40
In [11]:
predict = model.predict (n_periods = 40)
print(predict)
[ 91.77814897  93.44740783 103.41147879 112.28882753 111.37283204
 100.25792434  92.17206033  96.55013439 109.70998714 118.35695065
 110.98928129 101.99053351  91.11172508  92.8935815  103.14243531
 112.64820233 111.48774341 100.20129035  92.24185877  96.31983819
 110.67316708 119.26577684 110.50358569 102.20673495  90.88382048
  92.7303268  102.93810955 112.08880183 111.04489312 100.3312074
  92.32724311  96.87622119 110.61511662 119.62514448 111.51466726
 102.6356859   91.05419074  92.87252427 103.02046889 112.04817408]
In [14]:
predict = pd.DataFrame(predict, index = test.index, columns=['Prediction'])

The plot with the forecasted and the actual values shows that the model that was trained only on a part of the data did a good job grasping the seasonal pattern and made a prediction of high accuracy. We can also calculate RMSE value for the prediction:

In [15]:
plt.plot(predict, c = 'r', label = 'Predicted Value')
plt.plot(test, c = 'b', label = 'Actual Value')
plt.legend(loc = 'upper left')
plt.show()
In [16]:
from sklearn.metrics import mean_squared_error
from math import sqrt

rms = sqrt(mean_squared_error(test, predict))
print('The Root Mean Square Error of the forecasts is {}'.format(round(rms, 2)))
The Root Mean Square Error of the forecasts is 3.82

References:

In [ ]: