Abstract

TensorFlow Probability is a powerful library for statistical analysis in Python. Using TensorFlow Probability’s implementation of Bayesian methods, modelers can incorporate prior information and obtain parameter estimates and a quantified degree of belief in the results. Resampling methods like Markov Chain Monte Carlo can also be used to perform Bayesian analysis. As an alternative, we show how to use numerical optimization to estimate model parameters, and then show how numerical differentiation can be used to get a quantified degree of belief. How to perform simulation in Python to corroborate our results is also demonstrated.

Keywords:Bayesian statisticsresamplingmaximum likelihoodnumerical differentiation

Introduction

Some machine learning algorithms output only a single number or decision. It can be useful to have a measure of confidence in the output of the algorithm, a quantified degree of belief. Bayesian statistical methods can be used to provide both estimates and confidence for users.

A model with parameters θ\boldsymbol{\theta} governs the process we are investigating. We begin with a prior belief about the probability distribution of θ\boldsymbol{\theta}, the density π(θ)\pi(\boldsymbol{\theta}).

Then the data we observed gives us a refined belief about the distribution θ\boldsymbol{\theta}. We obtain the posterior density π(θx)\displaystyle \pi(\boldsymbol{\theta}\vert {\bf x}).

We can estimate values of θ\boldsymbol{\theta} with the posterior mode of π(θx)\displaystyle \pi(\boldsymbol{\theta}\vert {\bf x}), θ^\widehat{\boldsymbol \theta}.

Then we can estimate the posterior variance of θ\boldsymbol{\theta}, and with some knowledge of π(θx)\displaystyle \pi(\boldsymbol{\theta}\vert {\bf x}) obtain confidence in our estimate θ^\widehat{\boldsymbol \theta}.

Normal Approximation to the Posterior

We will use numerical optimization to obtain the posterior mode θ^\widehat{\boldsymbol \theta}, maximizing the posterior π(θx)\displaystyle \pi(\boldsymbol{\theta}\vert {\bf x}).

The posterior is proportional (where the scaling does not depend on θ\boldsymbol{\theta}) to the prior and likelihood (or density of the data).

π(θx)L(θx)π(θ)\pi(\boldsymbol{\theta}\vert {\bf x}) \propto L(\boldsymbol{\theta}\vert {\bf x}) \pi(\boldsymbol{\theta})

As in maximum likelihood, we directly maximize the log-posterior, logπ(θx)\log \pi(\boldsymbol{\theta}\vert {\bf x}) because it is more numerically stable.

Now, as described in section 4.1 of Gelman et al. (2013), we can approximate lnπ(θx)\ln \pi(\boldsymbol{\theta}\vert {\bf x}) using a second order Taylor Expansion around θ^\widehat{\boldsymbol \theta}.

logπ(θx)logπ(θ^x)+(θθ^)TS(θ)θ=θ^+12(θθ^)TH(θ^)(θθ^)\begin{align*} \log \pi(\boldsymbol{\theta}\vert {\bf x}) & \approx & \log \pi(\widehat{\boldsymbol \theta}\vert{\bf x}) + (\boldsymbol{\theta} - \widehat{\boldsymbol \theta} )^TS({\boldsymbol \theta})\vert_{{\boldsymbol \theta}={\widehat{\boldsymbol \theta}}} \\ & & + \frac{1}{2}(\boldsymbol{\theta} - \widehat{\boldsymbol \theta} )^T H(\widehat{\boldsymbol \theta}) (\boldsymbol{\theta} - \widehat{\boldsymbol \theta} ) \end{align*}

Where S(θ)S(\boldsymbol{\theta}) is the score function

S(θ)=δδθlogπ(θx)S(\boldsymbol{\theta}) = \frac{\delta}{\delta \boldsymbol{\theta}} \log \pi(\boldsymbol{\theta}\vert {\bf x})

and H(θ)H(\boldsymbol{\theta}) is the Hessian function.

H(θ)=δδθTS(θ)H(\boldsymbol{\theta}) = \frac{\delta}{\delta \boldsymbol{\theta}^T} S(\boldsymbol{\theta})

We assume that θ^\widehat{\boldsymbol \theta} is in the interior of the parameter space (or support) of θ\boldsymbol{\theta}. Also, π(θx)\pi(\boldsymbol{\theta}\vert {\bf x}) is a continuous function of θ\boldsymbol{\theta}.

Finally the Hessian matrix, H(θ)H(\boldsymbol{\theta}) is negative definite, so H(θ)-H(\boldsymbol{\theta}) is positive definite. This means that we can invert H(θ)-H(\boldsymbol{\theta}) and get a matrix that is a valid covariance.

With these assumptions, as the sample size nn\to\infty the quadratic approximation for logπ(θx)\log \pi(\boldsymbol{\theta}\vert {\bf x}) becomes more accurate. At the posterior mode θ=θ^{\boldsymbol \theta}={\widehat{\boldsymbol \theta}}, logπ(θx)\log \pi(\boldsymbol{\theta}\vert {\bf x}) is maximized and 0=S(θ)θ=θ^0=S({\boldsymbol \theta})\vert_{{\boldsymbol \theta}={\widehat{\boldsymbol \theta}}}.

Given this, we can exponentiate the approximation to get

π(θx)π(θ^x)exp(12(θθ^)TH(θ^)(θθ^))\pi(\boldsymbol{\theta}\vert {\bf x}) \approx \pi(\widehat{\boldsymbol \theta}\vert{\bf x}) \exp(\frac{1}{2} (\boldsymbol{\theta} - \widehat{\boldsymbol \theta} )^T H(\widehat{\boldsymbol \theta}) (\boldsymbol{\theta} - \widehat{\boldsymbol \theta} ))

So for large nn, the posterior distribution of θ{\boldsymbol \theta} is approximately proportional to a multivariate normal density with mean θ^\widehat{\boldsymbol{\theta}} and covariance H(θ^)1-H(\widehat{\boldsymbol{\theta}})^{-1}.

θxDN(θ^,H(θ^)1){\boldsymbol \theta} \vert x \approx_D N(\widehat{\boldsymbol{\theta}}, -H(\widehat{\boldsymbol{\theta}})^{-1})

Another caveat for this result is that the prior should be proper, or at least lead to a proper posterior. By proper we mean that the function corresponds to a probability density function. Our asymptotic results are depending on probabilities integrating to 1.

We could get a quantified degree of beief by using resampling methods like Markov chain Monte Carlo (MCMC) Gelman et al. (2013) directly. We would have to use fewer assumptions. However, resampling can be computationally intensive.

Parameter Constraints and Transformations

Optimization can be easier if the parameters are defined over the entire real line. Parameters that do not follow this rule are plentiful. Variances are only positive. Probabilities are in [0,1].

We can perform optimization over the real line by creating unconstrained parameters from the original parameters of interest. These are continuous functions of the constrained parameters, which may be defined on intervals of the real line.

For example, the unconstrained version of a standard deviation parameter σ is ψ=logσ\psi= \log \sigma. The parameter ψ is defined over the entire real line.

It will be useful for us to consider the constrained parameters as being functions of the unconstrained parameters. So σ=exp(ψ)\sigma=exp(\psi) is our constrained parameter of ψ.

So the posterior mode of the constrained parameters θc{\boldsymbol{\theta_c}} is θ^c=g(θ^)\widehat{\boldsymbol \theta}_{\boldsymbol c} = g(\widehat{\boldsymbol \theta}). We will call gg the constraint function.

Then we can use the delta method Oehlert (1992) on gg to get the posterior distribution of the constrained parameters.

A first-order Taylor approximation of g(θ)g({\boldsymbol \theta}) at θ^\widehat{\boldsymbol \theta} yields

g(θ)g(θ^)+{δδθ^g(θ^)}(θθ^)g({\boldsymbol \theta}) \approx g(\widehat{\boldsymbol \theta}) + \left\{\frac{\delta}{\delta \widehat{\boldsymbol{\theta}}} g(\widehat{\boldsymbol{\theta}})\right\} ({\boldsymbol \theta} - \widehat{\boldsymbol{\theta}})

Remembering that the posterior of θ\boldsymbol \theta is approximately normal, the rules about linear transformations for multivariate normal random vectors tell us that

θcx=g(θ)xDN[g(θ^),{δδθ^g(θ^)}T{H(θ^)1}{δδθ^g(θ^)}]\begin{align*} & & {\boldsymbol{\theta_c}}\vert x = g({\boldsymbol \theta}) \vert x \approx_D \\ & & N \left\lbrack g(\widehat{\boldsymbol{\theta}}), \left\{\frac{\delta}{\delta \widehat{\boldsymbol{\theta}}} g(\widehat{\boldsymbol{\theta}})\right\}^T \left\{-H(\widehat{\boldsymbol{\theta}})^{-1}\right\} \left\{\frac{\delta}{\delta \widehat{\boldsymbol{\theta}}} g(\widehat{\boldsymbol{\theta}})\right\}\right\rbrack \end{align*}

We could use Numpy’s matmul function to multiply the component matrices together. The inv function in the lingalg library could be used to invert the Hessian. So referring the to the gradient of gg as dg, the following python code could be used to compute the constrained covariance.

np.matmul(
    np.matmul(dg,
              np.linalg.inv(hessian)),
    np.transpose(dg))

This involved a first-order approximation of gg. Earlier we used a second order approximation for taking the numeric derivative. Why would we just do a first-order here? Traditionally the delta-method is taught and used as only a first-order method. Usually the functions used in the delta method are not incredibly complex. It is good enough to to use the first-order approximation.

Hessian and Delta Approximation

To be able to use the normal approximation, we need θ^\widehat{\boldsymbol{\theta}}, H(θ^)1H(\widehat{\boldsymbol{\theta}})^{-1}, and δδθ^g(θ^)\frac{\delta}{\delta \widehat{\boldsymbol{\theta}}} g(\widehat{\boldsymbol{\theta}}). As mentioned before, we use numerical optimization to get θ^\widehat{\boldsymbol{\theta}}. Ideally, we would have analytic expressions for HH and the derivatives of gg.

This can be accomplished with automatic differentiation Baydin et al. (2018), which will calculate the derivatives analytically. We can also perform numerical differentiation to get the Hessian and the gradient of the constraint function gg. This will be less accurate than an analytic expression, but may be less computationally intensive in large models.

But once you learn how to take one numeric derivative, you can take the numeric derivative of anything. So using numerical differentiation is a very flexible technique that we can easily apply to all the models we would use.

Numerical Differentiation

So numeric derivatives can be very pragmatic, and flexible. How do you compute them? Are they accurate? We use section 5.7 of Press & Teukolsky (2007) as a guide.

The derivative of the function ff with respect to xx is

f(x)=limh0f(x+h)f(x)hf'(x) = \lim_{h\to0}\frac{f(x+h)-f(x)}{h}

To approximate f(x)f'(x) numerically, couldn’t we just plugin a small value for hh and compute the scaled difference? Yes. And that is basically what happens. We do a little more work to choose hh and use a second-order approximation instead of a first-order.

We can see that the scaled difference is a first-order approximation by looking at the Taylor series expansion around xx.

Taylor’s theorem with remainder gives

f(x+h)=f(x)+((x+h)x)f(x)+.5((x+h)x)2f(ϵ)=f(x)+hf(x)+.5h2f(ϵ)\begin{align*} f(x+h) &=& f(x) + ((x+h)-x)f'(x) + .5((x+h)-x)^2 f''(\epsilon) \\ &=& f(x) + -h f'(x) + .5 h^2 f''(\epsilon) \\ \end{align*}

where ε is between xx and x+hx+h.

Now we can rearrange to get

f(x+h)f(x)hf(x)=.5hf(ϵ)\frac{f(x+h)-f(x)}{h} - f'(x) = .5 h f''(\epsilon)

The right hand side is the truncation error, ϵt\epsilon_t since it’s linear in hh, the bandwidth we call the this approximation a first order method.

We can do second-order approximations for f(x+h)f(x+h) and f(xh)f(x-h) and get a more accurate second order method of approximation for f(x)f'(x).

f(x+h)=f(x)+((x+h)x)f(x)+((x+h)x)2f(x)2!+((x+h)x)3f(ϵ1)3!f(xh)=f(x)+((xh)x)f(x)+((xh)x)2f(x)2!+((xh)x)3f(ϵ2)3!\begin{align*} f(x+h) &=& f(x) + ((x+h)-x)f'(x) \\ & & + \frac{((x+h)-x)^2 f''(x)}{2!} + \frac{((x+h)-x)^3f'''(\epsilon_1)}{3!} \\ f(x-h) &=& f(x) + ((x-h)-x)f'(x) \\ & & + \frac{((x-h)-x)^2 f''(x)}{2!} + \frac{((x-h)-x)^3f'''(\epsilon_2)}{3!} \end{align*}

were ϵ1\epsilon_1 is between xx and x+hx+h and ϵ2\epsilon_2 is between xhx-h and xx.

Then we have

f(x+h)f(xh)2hf(x)=h2f(ϵ1)+f(ϵ2)12\frac{f(x+h) - f(x-h)}{2h} - f'(x) = h^2 \frac{f'''(\epsilon_1)+ f'''(\epsilon_2)}{12}

This is quadratic in hh. The first term takes equal input from both sides of xx, so we call it a centered derivative.

So we choose a small value of hh and plug it into f(x+h)f(xh)2h\frac{f(x+h) - f(x-h)}{2h} to approximate f(x)f'(x).

Our derivation used a single input function ff. The idea applies to partial derivatives of multi-input functions as well. The inputs that you aren’t taking the derivative with respect to are treated as fixed parts of the function.

Choosing a Bandwidth

In practice, second order approximation actually involves two sources of error. Roundoff error, ϵr\epsilon_r arises from being unable to represent xx and hh or functions of them with exact binary represetation.

ϵrϵff(x)h\epsilon_r \approx \epsilon_f\frac{\mid{f(x)}\mid}{h}

where ϵf\epsilon_f is the fractional accuracy with which ff is computed. This is generally machine accuracy. If we are using NumPy Harris et al. (2020) this would be

ϵf=np.finfo(float).eps\epsilon_f = \mbox{np.finfo(float).eps}

Minimizing the roundoff error and truncation error, we obtain

hϵf1/3(ff)1/3h \sim \epsilon_f^{1/3} \left(\frac{f}{f'''}\right)^{1/3}

where (f/f)1/3\left(f / f'''\right)^{1/3} is shorthand for the ratio of f(x)f(x) and the sum of f(ϵ1)+f(ϵ2)f'''(\epsilon_1)+ f'''(\epsilon_2).

We use shorthand here because because we are not going to approximate ff''' (we are already approximating ff'), so there is no point in writing it out.

Call this shorthand

(ff)1/3=xc\left(\frac{f}{f'''}\right)^{1/3}=x_c

the curvature scale, or characteristic scale of the function ff.

There are several algorithms for choosing an optimal scale. The better the scale chosen, the more accurate the approximation is. A good rule of thumb, which is computationally quick, is to just use the absolute value of xx.

xc=xx_c = \mid{x}\mid

Then we would use

h=ϵf1/3xh = \epsilon_f^{1/3} \mid{x}\mid

But what if xx is 0? This is simple to handle, we just add ϵf1/3\epsilon_f^{1/3} to xc=xx_c = \mid x \mid

h=ϵf1/3(x+ϵf1/3)h = \epsilon_f^{1/3} ( \mid{x}\mid + \epsilon_f^{1/3})

Now, Press & Teukolsky (2007) also suggests performing a final sequence of assignment operations that ensures xx and x+hx+h differ by an exactly representable number. You assign x+hx+h to a temporary variable temptemp. Then hh is assigned the value of temphtemp-h.

In Python, the code would simply be

temp = x + h
h = temp - x

Estimating Confidence Intervals after Optimization

With the posterior mode, variance, and normal approximation to the posterior. It is simple to create confidence (credible) intervals for the parameters.

Let’s talk a little bit about what these intervals are. For the parameter γ we want a (1α)(1-\alpha) interval (u,l)(u,l) (defined on the observed data generated by a realization of γ) to be defined such that

Pr(γ(u,l))=1α\mbox{Pr}(\gamma \in (u,l)) = 1-\alpha

The frequentist confidence interval does not meet this criteria. γ is just one fixed value, so it is either in the interval, or it isn’t! The probability is 0 or 1. A credible interval (Bayesian confidence interval) can meet this criteria.

Suppose that we are able to use the normal approximation for γx\gamma \vert \bf{x}

γxDN(γ^,σ^γ2)\gamma\vert{\bf{x}} \approx_D N(\hat\gamma,\hat\sigma_\gamma^2)

Then we have

1α=Pr(lγux)=Pr(lγ^γγ^uγ^x)=Pr(lγ^σ^γγγ^σ^γuγ^σ^γx)\begin{align*} 1-\alpha &=& \mbox{Pr}(l \leq \gamma \leq u \vert{\bf{x}} ) \\ &=& \mbox{Pr}(l - \hat{\gamma} \leq \gamma - \hat{\gamma} \leq u - \hat{\gamma}\vert{\bf{x}} ) \\ &=& \mbox{Pr}\left(\frac{l - \hat{\gamma}}{{\hat\sigma_\gamma}} \leq \frac{\gamma - \hat{\gamma}}{{\hat\sigma_\gamma}} \leq \frac{u - \hat{\gamma}}{{\hat\sigma_\gamma}}\vert{\bf{x}} \right) \end{align*}

Now, (γγ^)/σ^γ2(\gamma - \hat{\gamma}) / \hat\sigma_\gamma^2 is N(0,1)N(0,1), standard normal. So we can use the standard normal quantiles in solving for ll and uu.

The upper α/2\alpha/ 2 quantile of the standard normal distribution, zα/2z_{\alpha/ 2} satisfies

Pr(Zzα/2)=α/2\mbox{Pr}(Z \geq z_{\alpha/ 2}) = \alpha / 2

for standard normal ZZ.

Noting that the standard normal is symmetric, if we can find ll and uu to satisfy

lγ^σ^γ=zα/2uγ^σ^γ=zα/2\begin{align*} \frac{l - \hat{\gamma}}{\hat\sigma_\gamma} &=& - z_{\alpha/ 2} \\ \frac{u - \hat{\gamma}}{\hat\sigma_\gamma} &=& z_{\alpha/ 2} \end{align*}

then we have a valid Bayesian confidence interval.

Simple calculation shows that the solutions are

l=zα/2σ^γ+γ^u=zα/2σ^γ+γ^\begin{align*} l &=& -z_{\alpha/2}\hat\sigma_\gamma + \hat{\gamma} \\ u &=& z_{\alpha/2}\hat\sigma_\gamma + \hat{\gamma} \end{align*}

The zα/2z_{\alpha/ 2} quantile can be easily generated using scipy.stats from SciPy Virtanen et al. (2020). We would use the norm.ppf function.

In Python, we would have

z_alpha_2 = scipy.stats.norm.ppf(1-alpha/2)
l = -z_alpha_2*se_gamma_hat + gamma_hat
u = z_alpha_2*nsd_gamma_hat + gamma_hat

We can also adjust the intervals for inference on many parameters by using Bonferroni correction Bonferroni (1936).

Now we know how to estimate the posterior mode. We also know how to estimate the posterior variance after computing the posterior mode. And we have seen how confidence intervals are made based on this posterior variance, mode, and the normal approximation to the posterior. Let’s discuss some tools that will enable us to perform these operations.

TensorFlow Probability

Now we will introduce TensorFlow Probability, a Python library that we can use to perform the methods we have been discussing. TensorFlow Probability is library built using TensorFlow, a leading software library for machine learning and artificial intelligence Abadi et al. (2015).

TensorFlow Probability is a probabilistic programming language. This lets us build powerful models in a modular way and estimate them automatically. At the heart of TensorFlow Probability is the Distribution class. In theory, a probability distribution is the set of rules that govern the likelihood of how a random variable (vector, or even general tensor) takes its values.

In TensorFlow Probability, distribution rules for scalars and vectors are parametrized, and these are expanded for higher dimensions as independent samples. A distribution object corresponds to a random variable or vector. The parts of a Bayesian model can be represented using different distribution objects for the parameters and observed data.

Example Distribution

As an example, let’s examine a linear regression with a χ2\chi^2 prior for the intercept a and a normal prior for the slope β. Our observed outcome variable is yy with a normal distribution and the predictor is xx.

yiNormal(xiβ+α,1)y_i \sim \mbox{Normal}(x_i\beta + \alpha, 1)

We can store the distribution objects in a dictionary for clear organization. The prior distribution of β is Normal with mean 1 and variance 1, N(1,1)N(1,1). We use the Normal distribution subclass to encode its information in our dictionary.

tfd = tfp.distributions
dist_dict = {}
dist_dict['beta'] = tfd.Normal(1,1)

The β parameter can range over the real line, but the intercept, α should be nonnegative. The Chi2 distribution subclass has support on only the nonegative reals. However, if we are performing optimization on the α parameter, we may take a step where it became negative. We can avoid any complications like this if we use a TransformedDistribution. Transformed distributions can be used together with a Bijector object that represents the transforming function.

For α, we will model an unconstrained parameter, αu=logα\alpha^u = \log \alpha. The natural logarithm can take values over the real line.

tfb = tfp.bijectors
dist_dict['unconstrained_alpha'] = \
tfd.TransformedDistribution(tfd.Chi2(4),tfb.Log())

We can use the sample method on the distribution objects we created to see random realizations. Before we do that we should set the seed, so that we can replicate our work.

tf.random.set_seed(132)
sample_ex=dist_dict['unconstrained_alpha'].sample(10)
sample_ex
<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([ 2.050956  , 0.56120026,  1.8559402,
       -0.05669071, ... ], dtype=float32)>

We see that the results are stored in a tf.Tensor object. This has an easy interface with NumPy, as you can see by the numpy component. We see that the unconstrained α, αu\alpha^u takes positive and negative values.

We can evaluate the density, or it’s natural logarithm using class methods as well. Here is the log density for the sample we just drew.

dist_dict['unconstrained_alpha'].log_prob(sample_ex)
<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([-1.1720479 , -1.1402813 , -0.8732692 ,
       -1.9721189 , ...], dtype=float32)>

Now we can get α from αu\alpha^u by using a callable and the Deterministic distribution.

dist_dict['alpha'] = \
    lambda unconstrained_alpha: \
        tfd.Deterministic(\
            loc= tfb.Log().inverse( \
                unconstrained_alpha))

Now we’ve added all of the parameters to dist_dict. We just need to handle the observed variables yy and xx. In this example xx is exogenous, which means it can be treated as fixed and nonrandom in estimating α and β in the model for yy. yy is endogenous, which means it is a response variable in the model, the outcome we are trying to estimate.

We will define xx separately from our dictionary of distributions. For the example we have to generate values of xx, but once this is done we will treat it as fixed and exogenous

The observed variable xx will have a standard normal distribution. We will start by giving it a sample size of 100.

n = 100
x_dist = tfd.Normal(tf.zeros(n),1)
x = x_dist.sample()

The distribution of yy, which would give us the likelihood, can be formulated using a callable function of the parameters and the fixed value of xx we just obtained.

dist_dict['y'] = \
    lambda alpha, beta: \
        tfd.Normal(loc = alpha + beta*x,scale=1)

With a dictionary of distributions and callables indicating their dependencies, we can work with the joint density. This will correspond to the posterior distribution of the model, augmenting the priors with the likelihood.

The JointDistributionNamed class takes a dictionary as input and behaves similarly to a regular univariate distribution object. We can take samples, which are returned as dictionaries keyed by the parameter and observed variable names. We can also compute log probabilities, which gives us the posterior density.

posterior = tfd.JointDistributionNamed(dist_dict)

Now we have a feel for how TensorFlow Probability can be used to store a Bayesian model. We have what we need to start performing optimization and variance estimation.

Maximum A Posteriori (MAP) with SciPy

We can use SciPy’s implementation of the Limited memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) Fletcher (1987) algorithm to estimate the posterior mode. This is a Quasi-Newton optimization method that does not need to store the entire Hessian matrix during the algorithm, so it can be very fast. If the Hessian was fully stored we could just use it directly in variance estimation, but it would be slower. We do to take advantage of automatic differentiation to calculate the score function, the first derivative of the posterior. TensorFlow Probability provides this through the value_and_gradient function of its math library.

We will use minimize from the optimize SciPy library, which operates on a loss function that takes a vector of parameters as input. We will optimize on unconstrained_alpha and beta, the unconstrained space parameters of the model. In the joint distribution representation, they are separate tensors. But in optimization, we will need to evaluate a single tensor.

We will use the first utility function from the bayes_mapvar library, which will be available with this paper, to accomplish this. The par_vec_from_dict function unpacks the tensors in a dictionary into a vector.

Within our loss function, we must move the vector of input parameters back to a dictionary of tensors to be evaluated by TensorFlow probability. The par_dict_from_vec function moves the unconstrained parameters back into a dictionary, and the constrained parameters are generated by the get_constrained function. Then the posterior density is evaluated by augmenting this dictionary of constrained parameters with the observed endogenoous variables. The get_constrained function is also used to get the final posterior model estimates from the SciPy optimization.

Variance Estimation with SciPy

Once the posterior mode is estimated we can estimate the variance. The first step is calculating the bandwidths. The get_bandwidths function handles this.

def get_bandwidths(unconstrained_par_vec):
    abspars = abs(unconstrained_par_vec)
    epsdouble = np.finfo(float).eps
    epsdouble = epsdouble**(1 / 3)
    scale = epsdouble * (abspars + epsdouble)
    scaleparmstable = scale + abspars
    return scaleparmstable - abspars

With the bandwidths calculated, we step through the parameters and create the Hessian and Delta matrices that we need for variance estimation. The get_hessian_delta_variance function use numeric differentiation to calculate the Hessian, based on numeric derivatives of the automatic derivatives computed by TensorFlow probability for the score function. The Delta matrix is calculated using numeric differentiation of the constrained parameter functions.

Simulation

We evaluated our methodology with a simulation based on the α and β parameter setting discussed earlier. This was an investigation into how well we estimated the posterior mode, variance, and distribution using the methods of TensorFlow Probability, SciPy, and bayes_mapvar.

To evaluate the posterior distributions of the parameters we used the MCMC capabilities of TensorFlow Probability. Particularly the the No-U-Turn Sampler Hoffman & Gelman (2011). We were careful to thin the sample based on effective sample size so that autocorrelation would not be a problem. This was accomplished using TensorFlow Probability’s effective_sample_size function from its mcmc library.

We drew npre=1000n_{pre} = 1000 observations from the unconstrained prior parameter distribution for αi\alpha_i and βi\beta_i. For each of these prior draws, we drew a posterior sample of yi{\bf y}_i and xi{\bf x}_i. yi{\bf y}_i and xi{\bf x}_i were npost=600n_{post} = 600 samples based on each αi\alpha_i and βi\beta_i. The posterior mode and variance were estimated, and nMCMC=500n_{MCMC}=500 posterior draws from MCMC were made. The mean was used in the MCMC draws since it sould coincide with the mode if our assumptions are correct.

To check the distributional results, we used the Anderson-Darling test Stephens (1974). This is given by anderson in scipy.stats. We stored a record of whether the test rejects normality at the .05 significance level for each of the npren_{pre} draws. This test actually checks the mean and variance assumptions as well, since it compares to a standard normal and we are standardizing based on the MAP and get_hessian_delta_variance estimates.

Table 1:Simulation Results, npre=1000n_{pre} = 1000, npost=600n_{post} = 600, nMCMC=500n_{MCMC}=500.

StatisticMeanS.D.
αMAP\alpha_{MAP} mean4.1412.475
αMCMC\alpha_{MCMC} mean3.9892.765
αMAP\alpha_{MAP} S.E.0.0370.004
αMCMC\alpha_{MCMC} S.E.0.0410.001
α A.D. Reject0.0420.201
βMAP\beta_{MAP} mode1.0130.504
βMCMC\beta_{MCMC} mean1.0221.003
βMAP\beta_{MAP} S.E.0.0290.001
βMCMC\beta_{MCMC} S.E.0.0410.002
β A.D. Reject0.0450.207

The results of the simulation are shown in Table 1. We use Standard Error (S.E.) to refer to the 1000 estimates of posterior standard deviations from get_hessian_delta_variance and the MCMC sample standard deviations. The Standard Deviation (S.D.) column represents the statistics calculated over the 1000 estimates. The standard errors are not far from each other, and neither are the modes and means. The rejection rates for the Anderson Darling test are not far from .05 either.

We can perform a hypothesis test of whether the rejection rate is .05 by checking whether .05 is in the confidence interval for the proportion. We will use the proportion_confint function from statsmodels Seabold & Perktold (2010). In Table 2, we see that .05 is comfortably within intervals for both parameters. Our simulation successfully corroborated our assumptions about the model and the consistency of our method for estimating the posterior mode, variance, and distribution.

Table 2:A.D. Confidence Intervals, npre=1000n_{pre} = 1000, npost=600n_{post} = 600, nMCMC=500n_{MCMC}=500.

StatisticsLowerUpper
α AD Reject0.0300.056
β A.D. Reject0.0330.060

Conclusion

We have explored how Bayesian analysis can be performed without resampling and still obtain full inference. With adequate amounts of the data, the posterior mode can be estimated with numeric optimization and the posterior variance can be estimated with numeric or automatic differentiation. The asymptotic normality of the posterior distribution enables simple calculation of posterior probabilities and confidence (credible) intervals as well.

Bayesian methods let us use data from past experience, subject matter expertise, and different levels of certainty to solve data sparsity problems and provide a probabilistic basis for inference. Retail Price Optimization benefits from historical data and different granularities of information. Other fields may also take advantage of access to large amounts of data and be able to use these approximation techniques. These techniques and the tools implementing them can be used by practitioners to make their analysis more efficient and less intimidating.

References
  1. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition. Taylor & Francis. https://books.google.com/books?id=ZXL6AQAAQBAJ
  2. Oehlert, G. W. (1992). A Note on the Delta Method. The American Statistician, 46(1), 27–29. 10.1080/00031305.1992.10475842
  3. Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). Automatic differentiation in machine learning: a survey.
  4. Press, W. H., & Teukolsky, S. A. (2007). Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press. https://books.google.com/books?id=1aAOdzK3FegC
  5. Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., … Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825), 357–362. 10.1038/s41586-020-2649-2