Predicting the future is a tricky task. In many situations, there are too many factors in play to be certain. As a result, in the stock market analysis for instance, the academic convention is that the best predictor of tomorrow is the situation on the market today. Still there is a big room for data-driven forecasting models: using text analysis or including other economic indicators in a multivariate analysis. One of the features, that can help is a seasonal pattern of change. In this example I will apply a seasonal ARIMA model to forecast the production of utilities.
ARIMA stands for AutoRegressive Integrated Moving Average:
Autoregression (AR) shows a changing variable that regresses on its own prior values.
Integrated (I) represents the differencing of raw observations so that data values are replaced by the difference between the data values and the previous values.
Moving average (MA) incorporates the dependency between an observation and a residual error from a moving average model applied to lagged observations.
The dataset I'll be using here describes the industrial production of electric and gas utilities in the United States between 1985–2018 with monthly frequency being production output. Available here https://fred.stlouisfed.org/series/IPG2211A2N
Let's get started!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pmdarima import auto_arima
import warnings
from pylab import rcParams
import statsmodels.api as sm
df = pd.read_csv('electric_production_dataset.csv',index_col=0)
df.head()
Before we move on, let's rename the energy production column and importantly transform Date column to datetime format:
df.columns = ['Energy Production']
df.index = pd.to_datetime(df.index)
df.index
df.head()
Looking at the data on the graphs, one can see that the data exhibits a strong seasonal pattern which we'll try to grasp with the model, and a positive trend (though it became less prominent in the last decade).
Residuals have increased as the amplitude of the observed values has also increased.
warnings.filterwarnings("ignore")
plt.figure(figsize = (15,5))
plt.plot(df, 'b')
plt.title('Energy Production by Years')
plt.xlabel('Years')
plt.ylabel('Amount')
plt.autoscale(enable=True, axis='x', tight=True)
from pylab import rcParams
import statsmodels.api as sm
rcParams['figure.figsize'] = 15, 8
decomposition = sm.tsa.seasonal_decompose(df, model='additive')
fig = decomposition.plot()
plt.show()
Now that the data has been prepared and explored, let's create a model for the time series.
First, it is essential that we split the data into train and test sets.
Training the model on the data that we want to predict will naturally result in a very good precision, but such an experiment would make no sense: there would be no need to make prediction in the first place if we had the true data before the experiment.
It is also of importance that we do not apply a random split method: random dates would introduce significant bias into the model.
Let's create train and test sets with two slices of dates:
train_percent = 0.9
split_point = round(len(df) * train_percent)
train, test = df[0:split_point], df[split_point:]
In this example, we'll be using Auto ARIMA algorithm (to import it, you may need to install 'pmdarima' library with pip first)
The algorithm does a GRID search to choose the best combination of values for p, d and q hyperparameters. Auto ARIMA supports seasonality and can autotune its hyperparameters as well. That is where it differs from SARIMA.
As a result, we don't need to perform differentiation on the time series and plot autocorrelation and partial autocorrelation functions to find the best p, d, q.
We use only the train subset to fine-tune the model.
from pmdarima import auto_arima
model = auto_arima(train, start_p=1, start_q=1, max_p=3, max_q=3, m=12,
start_P=0, seasonal=True, d=1, D=1, trace=True,
error_action='ignore', suppress_warnings=True,
stepwise=True)
print(model.aic())
model.fit(train)
Now we need to check the size of the test sample to tune 'number of periods' hyperparameter for prediction:
test.size
predict = model.predict (n_periods = 40)
print(predict)
predict = pd.DataFrame(predict, index = test.index, columns=['Prediction'])
The plot with the forecasted and the actual values shows that the model that was trained only on a part of the data did a good job grasping the seasonal pattern and made a prediction of high accuracy. We can also calculate RMSE value for the prediction:
plt.plot(predict, c = 'r', label = 'Predicted Value')
plt.plot(test, c = 'b', label = 'Actual Value')
plt.legend(loc = 'upper left')
plt.show()
from sklearn.metrics import mean_squared_error
from math import sqrt
rms = sqrt(mean_squared_error(test, predict))
print('The Root Mean Square Error of the forecasts is {}'.format(round(rms, 2)))
References: