# Bayesian Estimation and Forecasting of Time Series in statsmodels

## Abstract¶

`Statsmodels`

, a Python library for statistical and econometric analysis,
has traditionally focused on frequentist inference, including in its
models for time series data. This paper introduces the powerful features
for Bayesian inference of time series models that exist in `statsmodels`

, with
applications to model fitting, forecasting, time series decomposition,
data simulation, and impulse response functions.

## Introduction¶

`Statsmodels`

Seabold & Perktold, 2010 is a well-established Python
library for statistical and econometric analysis, with support for a wide range
of important model classes, including linear regression, ANOVA, generalized
linear models (GLM), generalized additive models (GAM), mixed effects models,
and time series models, among many others. In most cases, model fitting proceeds
by using frequentist inference, such as maximum likelihood estimation (MLE). In
this paper, we focus on the class of time series models
McKinney *et al.*, 2011, support for which has grown substantially in
`statsmodels`

over the last decade. After introducing several of the most
important new model classes – which are by default fitted using MLE – and
their features – which include forecasting, time series decomposition and
seasonal adjustment, data simulation, and impulse response analysis – we
describe the powerful functions that enable users to apply Bayesian methods to
a wide range of time series models.

Support for Bayesian inference in Python outside of `statsmodels`

has also grown
tremendously, particularly in the realm of probabilistic programming, and
includes powerful libraries such as
`PyMC3`

Salvatier *et al.*, 2016, `PyStan`

Carpenter *et al.*, 2017,
and `TensorFlow Probability`

Dillon *et al.*, 2017. Meanwhile,
`ArviZ`

Kumar *et al.*, 2019 provides many excellent tools for associated
diagnostics and vizualisations. The aim of these libraries is to provide support
for Bayesian analysis of a large class of models, and they make available both
advanced techniques, including auto-tuning algorithms, and flexible model
specification. By contrast, here we focus on simpler techniques. However, while
the libraries above do include some support for time series models, this has not
been their primary focus. As a result, introducing Bayesian inference for the
well-developed stable of time series models in `statsmodels`

, and providing
access to the rich associated feature set already mentioned, presents a
complementary option to these more general-purpose libraries.^{[1]}

## Time series analysis in `statsmodels`

¶

A time series is a sequence of observations ordered in time, and time series data appear commonly in statistics, economics, finance, climate science, control systems, and signal processing, among many other fields. One distinguishing characteristic of many time series is that observations that are close in time tend to be more correlated, a feature known as autocorrelation. While successful analyses of time series data must account for this, statistical models can harness it to decompose a time series into trend, seasonal, and cyclical components, produce forecasts of future data, and study the propagation of shocks over time.

We now briefly review the models for time series data that are available in
`statsmodels`

and describe their features.^{[2]}

### Exponential smoothing models¶

Exponential smoothing models are constructed by combining one or more simple
equations that each describe some aspect of the evolution of univariate
time series data. While originally somewhat *ad hoc*, these models can be
defined in terms of a proper statistical model (for example, see
Hyndman *et al.* (2008)). They have enjoyed
considerable popularity in forecasting (for example, see the implementation in
R described by Hyndman & Athanasopoulos (2018)). A prototypical example that
allows for trending data and a seasonal component – often known as the additive
“Holt-Winters’ method” – can be written as

where $l_t$ is the level of the series, $b_t$ is the trend,
$s_t$ is the seasonal component of period $m$, and
$\alpha, \beta, \gamma$ are parameters of the model. When augmented with
an error term with some given probability distribution (usually Gaussian),
likelihood-based inference can be used to estimate the parameters.
In `statsmodels`

, additive exponential smoothing models
can be constructed using the `statespace.ExponentialSmoothing`

class.^{[3]}
The following code shows how to apply the additive Holt-Winters model above to
model quarterly data on consumer prices:

```
import statsmodels.api as sm
# Load data
mdata = sm.datasets.macrodata.load().data
# Compute annualized consumer price inflation
y = np.log(mdata['cpi']).diff().iloc[1:] * 400
# Construct the Holt-Winters model
model_hw = sm.tsa.statespace.ExponentialSmoothing(
y, trend=True, seasonal=12)
```

### Structural time series models¶

Structural time series models, introduced by Harvey (1990) and also sometimes known as unobserved components models, similarly decompose a univariate time series into trend, seasonal, cyclical, and irregular components:

where $\mu_t$ is the trend, $\gamma_t$ is the seasonal component,
$c_t$ is the cyclical component, and
$\varepsilon_t \sim N(0, \sigma^2)$ is the error term. However, this
equation can be augmented in many ways, for example to include explanatory
variables or an autoregressive component. In addition, there are many possible
specifications for the trend, seasonal, and cyclical components, so that a wide
variety of time series characteristics can be accommodated. In `statsmodels`

,
these models can be constructed from the `UnobservedComponents`

class; a
few examples are given in the following code:

```
# "Local level" model
model_ll = sm.tsa.UnobservedComponents(y, 'llevel')
# "Local linear trend", with seasonal component
model_arma11 = sm.tsa.UnobservedComponents(
y, 'lltrend', seasonal=4)
```

These models have become popular for time series analysis and forecasting, as
they are flexible and the estimated components are intuitive. Indeed, Google’s
Causal Impact library Brodersen *et al.*, 2015 uses a Bayesian
structural time series approach directly, and Facebook’s Prophet library
Taylor & Letham, 2017 uses a conceptually similar framework and is
estimated using PyStan.

### Autoregressive moving-average models¶

Autoregressive moving-average (ARMA) models, ubiquitous in time series
applications, are well-supported in `statsmodels`

, including their
generalizations, abbreviated as “SARIMAX”, that allow for integrated time series
data, explanatory variables, and seasonal effects.^{[4]} A general version of
this model, excluding integration, can be written as

where $\varepsilon_t \sim N(0, \sigma^2)$. These are constructed in
`statsmodels`

with the `ARIMA`

class; the following code shows how to
construct a variety of autoregressive moving-average models for consumer price
data:

```
# AR(2) model
model_ar2 = sm.tsa.ARIMA(y, order=(2, 0, 0))
# ARMA(1, 1) model with explanatory variable
X = mdata['realint']
model_arma11 = sm.tsa.ARIMA(
y, order=(1, 0, 1), exog=X)
# SARIMAX(p, d, q)x(P, D, Q, s) model
model_sarimax = sm.tsa.ARIMA(
y, order=(p, d, q), seasonal_order=(P, D, Q, s))
```

While this class of models often produces highly competitive forecasts, it does not produce a decomposition of a time series into, for example, trend and seasonal components.

### Vector autoregressive models¶

While the SARIMAX models above handle univariate series, `statsmodels`

also has
support for the multivariate generalization to vector autoregressive (VAR)
models.^{[5]} These models are written

where $y_t$ is now considered as an $m \times 1$ vector. As a result,
the intercept ν is also an $m \times 1$ vector, the
coefficients $\Phi_i$ are each $m \times m$ matrices, and the error
term is $\varepsilon_t \sim N(0_m, \Omega)$, with Ω an
$m \times m$ matrix. These models can be constructed in `statsmodels`

using the `VARMAX`

class, as follows^{[6]}

```
# Multivariate dataset
z = (np.log(mdata['realgdp', 'realcons', 'cpi'])
.diff().iloc[1:])
# VAR(1) model
model_var = sm.tsa.VARMAX(z, order=(1, 0))
```

### Dynamic factor models¶

`statsmodels`

also supports a second model for multivariate time series: the
dynamic factor model (DFM). These models, often used for dimension reduction,
posit a few unobserved factors, with autoregressive dynamics, that are used to
explain the variation in the observed dataset. In `statsmodels`

, there are two
model classes, `DynamicFactor`

and `DynamicFactorMQ`

, that can fit
versions of the DFM. Here we focus on the `DynamicFactor`

class, for which
the model can be written

Here again, the observation is assumed to be $m \times 1$, but the factors are $k \times 1$, where it is possible that $k << m$. As before, we assume conformable coefficient matrices and Gaussian errors.

The following code shows how to construct a DFM in `statsmodels`

```
# DFM with 2 factors that evolve as a VAR(3)
model_dfm = sm.tsa.DynamicFactor(
z, k_factors=2, factor_order=3)
```

### Linear Gaussian state space models¶

In `statsmodels`

, each of the model classes introduced above (
`statespace.ExponentialSmoothing`

, `UnobservedComponents`

,
`ARIMA`

, `VARMAX`

, `DynamicFactor`

, and
`DynamicFactorMQ`

) are implemented as part of a broader class of models,
referred to as linear Gaussian state space models (hereafter for brevity, simply
“state space models” or SSM). This class of models can be written as

where $\alpha_t$ represents an unobserved vector containing the “state” of the dynamic system. In general, the model is multivariate, with $y_t$ and $\varepsilon_t$ $m \times 1$ vector, $\alpha_t$ $k \times 1$, and $\eta_t$ $r \times 1$.

Powerful tools exist for state space models to estimate the
values of the unobserved state vector, compute the value of the likelihood
function for frequentist inference, and perform posterior sampling for Bayesian
inference. These tools include the celebrated Kalman filter and smoother and
a simulation smoother, all of which are important for conducting Bayesian
inference for these models.^{[7]} The implementation in `statsmodels`

largely follows
the treatment in Durbin & Koopman (2012), and is described in more detail in
Fulton (2015).

In addition to these key tools, state space models also admit general implementations of useful features such as forecasting, data simulation, time series decomposition, and impulse response analysis. As a consequence, each of these features extends to each of the time series models described above. Figure 1 presents a diagram showing how to produce these features, and the code below briefly introduces a subset of them.

```
# Construct the Model
model_ll = sm.tsa.UnobservedComponents(y, 'llevel')
# Construct a simulation smoother
sim_ll = model_ll.simulation_smoother()
# Parameter values (variance of error and
# variance of level innovation, respectively)
params = [4, 0.75]
# Compute the log-likelihood of these parameters
llf = model_ll.loglike(params)
# `smooth` applies the Kalman filter and smoother
# with a given set of parameters and returns a
# Results object
results_ll = model_ll.smooth(params)
# Produce forecasts for the next 4 periods
fcast = results_ll.forecast(4)
# Produce a draw from the posterior distribution
# of the state vector
sim_ll.simulate()
draw = sim_ll.simulated_state
```

Nearly identical code could be used for any of the model classes introduced above, since they are all implemented as part of the same state space model framework. In the next section, we show how these features can be used to perform Bayesian inference with these models.

## Bayesian inference via Markov chain Monte Carlo¶

We begin by giving a cursory overview of the key elements of Bayesian
inference required for our purposes here.^{[8]} In brief, the Bayesian approach
stems from Bayes’ theorem, in which the posterior distribution for an object of
interest is derived as proportional to the combination of a prior distribution
and the likelihood function

Here, we will be interested in the posterior distribution of the parameters
of our model and of the unobserved states, conditional on the chosen model
specification and the observed time series data. While in most cases the form
of the posterior cannot be derived analytically, simulation-based methods such
as Markov chain Monte Carlo (MCMC) can be used to draw samples that approximate
the posterior distribution nonetheless. While PyMC3, PyStan, and TensorFlow
Probability emphasize Hamiltonian Monte Carlo (HMC) and no-U-turn sampling
(NUTS) MCMC methods, we focus on the simpler random walk Metropolis-Hastings
(MH) and Gibbs sampling (GS) methods. These are standard MCMC methods that
have enjoyed great success in time series applications and which are simple to
implement, given the state space framework already available in `statsmodels`

.
In addition, the ArviZ library is designed to work with MCMC output from any
source, and we can easily adapt it to our use.

With either Metropolis-Hastings or Gibbs sampling, our procedure will produce a sequence of sample values (of parameters and / or the unobserved state vector) that approximate draws from the posterior distribution arbitrarily well, as the number of length of the chain of samples becomes very large.

### Random walk Metropolis-Hastings¶

In random walk Metropolis-Hastings (MH), we begin with an arbitrary point as the initial sample, and then iteratively construct new samples in the chain as follows. At each iteration, (a) construct a proposal by perturbing the previous sample by a Gaussian random variable, and then (b) accept the proposal with some probability. If a proposal is accepted, it becomes the next sample in the chain, while if it is rejected then the previous sample value is carried over. Here, we show how to implement Metropolis-Hastings estimation of the variance parameter in a simple model, which only requires the use of the log-likelihood computation introduced above.

```
import arviz as az
from scipy import stats
# Construct the model
model_rw = sm.tsa.UnobservedComponents(y, 'rwalk')
# Specify the prior distribution. With MH, this
# can be freely chosen by the user
prior = stats.uniform(0.0001, 100)
# Specify the Gaussian perturbation distribution
perturb = stats.norm(scale=0.1)
# Storage
niter = 100000
samples_rw = np.zeros(niter + 1)
# Initialization
samples_rw[0] = y.diff().var()
llf = model_rw.loglike(samples_rw[0])
prior_llf = prior.logpdf(samples_rw[0])
# Iterations
for i in range(1, niter + 1):
# Compute the proposal value
proposal = samples_rw[i - 1] + perturb.rvs()
# Compute the acceptance probability
proposal_llf = model_rw.loglike(proposal)
proposal_prior_llf = prior.logpdf(proposal)
accept_prob = np.exp(
proposal_llf - llf
+ prior_llf - proposal_prior_llf)
# Accept or reject the value
if accept_prob > stats.uniform.rvs():
samples_rw[i] = proposal
llf = proposal_llf
prior_llf = proposal_prior_llf
else:
samples_rw[i] = samples_rw[i - 1]
# Convert for use with ArviZ and plot posterior
samples_rw = az.convert_to_inference_data(
samples_rw)
# Eliminate the first 10000 samples as burn-in;
# thin by factor of 10 to reduce autocorrelation
az.plot_posterior(samples_rw.posterior.sel(
{'draw': np.s_[10000::10]}), kind='bin',
point_estimate='median')
```

The approximate posterior distribution, constructed from the sample chain, is shown in Figure 2.

### Gibbs sampling¶

Gibbs sampling (GS) is a special case of Metropolis-Hastings (MH) that is applicable when it is possible to produce draws directly from the conditional distributions of every variable, even though it is still not possible to derive the general form of the joint posterior. While this approach can be superior to random walk MH when it is applicable, the ability to derive the conditional distributions typically requires the use of a “conjugate” prior – i.e., a prior from some specific family of distributions. For example, above we specified a uniform distribution as the prior when sampling via MH, but that is not possible with Gibbs sampling. Here, we show how to implement Gibbs sampling estimation of the variance parameter, now making use of an inverse Gamma prior, and the simulation smoother introduced above.

```
# Construct the model and simulation smoother
model_ll = sm.tsa.UnobservedComponents(y, 'llevel')
sim_ll = model_ll.simulation_smoother()
# Specify the prior distributions. With GS, we must
# choose an inverse Gamma prior for each variance
priors = [stats.invgamma(0.01, scale=0.01)] * 2
# Storage
niter = 100000
samples_ll = np.zeros((niter + 1, 2))
# Initialization
samples_ll[0] = [y.diff().var(), 1e-5]
# Iterations
for i in range(1, niter + 1):
# (a) Update the model parameters
model_ll.update(samples_ll[i - 1])
# (b) Draw from the conditional posterior of
# the state vector
sim_ll.simulate()
sample_state = sim_ll.simulated_state.T
# (c) Compute / draw from conditional posterior
# of the parameters:
# ...observation error variance
resid = y - sample_state[:, 0]
post_shape = len(resid) / 2 + 0.01
post_scale = np.sum(resid**2) / 2 + 0.01
samples_ll[i, 0] = stats.invgamma(
post_shape, scale=post_scale).rvs()
# ...level error variance
resid = sample_state[1:] - sample_state[:-1]
post_shape = len(resid) / 2 + 0.01
post_scale = np.sum(resid**2) / 2 + 0.01
samples_ll[i, 1] = stats.invgamma(
post_shape, scale=post_scale).rvs()
# Convert for use with ArviZ and plot posterior
samples_ll = az.convert_to_inference_data(
{'parameters': samples_ll[None, ...]},
coords={'parameter': model_ll.param_names},
dims={'parameters': ['parameter']})
az.plot_pair(samples_ll.posterior.sel(
{'draw': np.s_[10000::10]}), kind='hexbin');
```

The approximate posterior distribution, constructed from the sample chain, is shown in Figure 3.

## Illustrative examples¶

For clarity and brevity, the examples in the previous section gave results for simple cases. However, these basic methods carry through to each of the models introduced earlier, including in cases with multivariate data and hundreds of parameters. Moreover, the Metropolis-Hastings approach can be combined with the Gibbs sampling approach, so that if the end user wishes to use Gibbs sampling for some parameters, they are not restricted to choose only conjugate priors for all parameters.

In addition to sampling the posterior distributions of the parameters, this method allows sampling other objects of interest, including forecasts of observed variables, impulse response functions, and the unobserved state vector. This last possibility is especially useful in cases such as the structural time series model, in which the unobserved states correspond to interpretable elements such as the trend and seasonal components. We provide several illustrative examples of the various types of analysis that are possible.

### Forecasting and Time Series Decomposition¶

In our first example, we apply the Gibbs sampling approach to a structural time series model in order to forecast U.S. Industrial Production and to produce a decomposition of the series into level, trend, and seasonal components. The model is

Here, we set the seasonal periodicity to `s=12`

, since Industrial Production is
a monthly variable. We can construct this model in Statsmodels as^{[9]}

```
model = sm.tsa.UnobservedComponents(
y, 'lltrend', seasonal=12)
```

To produce the time-series decomposition into level, trend, and seasonal
components, we will use samples from the posterior of the state vector
$(\mu_t, \beta_t, \gamma_t)$ for each time period $t$. These are
immediately available when using the Gibbs sampling approach; in the earlier
example, the draw at each iteration was assigned to the variable `sample_state`

.
To produce forecasts, we need to draw from the posterior predictive
distribution for horizons $h = 1, 2, \dots H$. This can
be easily accomplished by using the `simulate`

method introduced earlier. To be
concrete, we can accomplish these tasks by modifying section (b) of our Gibbs
sampler iterations as follows:

```
# (b') Draw from the conditional posterior of
# the state vector
model.update(params[i - 1])
sim.simulate()
# save the draw for use later in time series
# decomposition
states[i] = sim.simulated_state.T
# Draw from the posterior predictive distribution
# using the `simulate` method
n_fcast = 48
fcast[i] = model.simulate(
params[i - 1], n_fcast,
initial_state=states[i, -1]).to_frame()
```

These forecasts and the decomposition into level, trend, and seasonal components
are summarized in Figure 4 and Figure 5,
which show the median values along with 80% credible intervals. Notably, the
intervals shown incorporate for both the uncertainty arising from the stochastic
terms in the model as well as the need to estimate the models’ parameters.^{[10]}

### Casual impacts¶

A closely related procedure described in Brodersen *et al.* (2015) uses
a Bayesian structural time series model to estimate the “causal impact” of
some event on some observed variable. This approach stops estimation of the
model just before the date of an event and produces a forecast by drawing from
the posterior predictive density, using the procedure described just above. It
then uses the difference between the actual path of the data and the forecast to
estimate impact of the event.

An example of this approach is shown in Figure 6, in which we
use this method to illustrate the effect of the COVID-19 pandemic on U.S. Sales
in Manufacturing and Trade Industries.^{[11]}

## Extensions¶

There are many extensions to the time series models presented here that are
made possible when using Bayesian inference. First, it is easy to create custom
state space models within the `statsmodels`

framework. As one example, the
`statsmodels`

documentation describes how to create a model that extends the
typical VAR described above with time-varying parameters.^{[12]} These custom
state space models automatically inherit all the functionality described above,
so that Bayesian inference can be conducted in exactly the same way.

Second, because the general state space model available in `statsmodels`

and
introduced above allows for time-varying system matrices, it is possible using
Gibbs sampling methods to introduce support for automatic outlier handling,
stochastic volatility, and regime switching models, even though these are
largely infeasible in `statsmodels`

when using frequentist methods such as
maximum likelihood estimation.^{[13]}

## Conclusion¶

This paper introduces the suite of time series models available in `statsmodels`

and shows how Bayesian inference using Markov chain Monte Carlo methods can be
applied to estimate their parameters and produce analyses of interest, including
time series decompositions and forecasts.

- Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and Statistical Modeling with Python. In S. van der Walt & J. Millman (Eds.),
*Proceedings of the 9th Python in Science Conference*(pp. 92–96). 10.25080/Majora-92bf1922-011 - McKinney, W., Perktold, J., & Seabold, S. (2011). Time Series Analysis in Python with statsmodels. In S. van der Walt & J. Millman (Eds.),
*Proceedings of the 10th Python in Science Conference*(pp. 107–113). 10.25080/Majora-ebaa42b7-012 - Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. (2016). Probabilistic programming in Python using PyMC3.
*PeerJ Computer Science*,*2*, e55. 10.7717/peerj-cs.55 - Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., & Riddell, A. (2017). Stan : A Probabilistic Programming Language.
*Journal of Statistical Software*,*76*(1). 10.18637/jss.v076.i01 - Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., Patton, B., Alemi, A., Hoffman, M., & Saurous, R. A. (2017).
*TensorFlow Distributions*(Techreport arXiv:1711.10604). arXiv. 10.48550/arXiv.1711.10604