Bayesian Statistics with Python, No Resampling Necessary
Abstract¶
TensorFlow Probability is a powerful library for statistical analysis in Python. Using TensorFlow Probability’s implementation of Bayesian methods, modelers can incorporate prior information and obtain parameter estimates and a quantified degree of belief in the results. Resampling methods like Markov Chain Monte Carlo can also be used to perform Bayesian analysis. As an alternative, we show how to use numerical optimization to estimate model parameters, and then show how numerical differentiation can be used to get a quantified degree of belief. How to perform simulation in Python to corroborate our results is also demonstrated.
Introduction¶
Some machine learning algorithms output only a single number or decision. It can be useful to have a measure of confidence in the output of the algorithm, a quantified degree of belief. Bayesian statistical methods can be used to provide both estimates and confidence for users.
A model with parameters governs the process we are investigating. We begin with a prior belief about the probability distribution of , the density .
Then the data we observed gives us a refined belief about the distribution . We obtain the posterior density .
We can estimate values of with the posterior mode of , .
Then we can estimate the posterior variance of , and with some knowledge of obtain confidence in our estimate .
Normal Approximation to the Posterior¶
We will use numerical optimization to obtain the posterior mode , maximizing the posterior .
The posterior is proportional (where the scaling does not depend on ) to the prior and likelihood (or density of the data).
As in maximum likelihood, we directly maximize the log-posterior, because it is more numerically stable.
Now, as described in section 4.1 of Gelman et al. (2013), we can approximate using a second order Taylor Expansion around .
Where is the score function
and is the Hessian function.
We assume that is in the interior of the parameter space (or support) of . Also, is a continuous function of .
Finally the Hessian matrix, is negative definite, so is positive definite. This means that we can invert and get a matrix that is a valid covariance.
With these assumptions, as the sample size the quadratic approximation for becomes more accurate. At the posterior mode , is maximized and .
Given this, we can exponentiate the approximation to get
So for large , the posterior distribution of is approximately proportional to a multivariate normal density with mean and covariance .
Another caveat for this result is that the prior should be proper, or at least lead to a proper posterior. By proper we mean that the function corresponds to a probability density function. Our asymptotic results are depending on probabilities integrating to 1.
We could get a quantified degree of beief by using resampling methods like Markov chain Monte Carlo (MCMC) Gelman et al. (2013) directly. We would have to use fewer assumptions. However, resampling can be computationally intensive.
Parameter Constraints and Transformations¶
Optimization can be easier if the parameters are defined over the entire real line. Parameters that do not follow this rule are plentiful. Variances are only positive. Probabilities are in [0,1].
We can perform optimization over the real line by creating unconstrained parameters from the original parameters of interest. These are continuous functions of the constrained parameters, which may be defined on intervals of the real line.
For example, the unconstrained version of a standard deviation parameter σ is . The parameter ψ is defined over the entire real line.
It will be useful for us to consider the constrained parameters as being functions of the unconstrained parameters. So is our constrained parameter of ψ.
So the posterior mode of the constrained parameters is . We will call the constraint function.
Then we can use the delta method Oehlert (1992) on to get the posterior distribution of the constrained parameters.
A first-order Taylor approximation of at yields
Remembering that the posterior of is approximately normal, the rules about linear transformations for multivariate normal random vectors tell us that
We could use Numpy’s matmul function to multiply the component matrices together. The inv function in the lingalg library could be used to invert the Hessian. So referring the to the gradient of as dg, the following python code could be used to compute the constrained covariance.
np.matmul(
np.matmul(dg,
np.linalg.inv(hessian)),
np.transpose(dg))
This involved a first-order approximation of . Earlier we used a second order approximation for taking the numeric derivative. Why would we just do a first-order here? Traditionally the delta-method is taught and used as only a first-order method. Usually the functions used in the delta method are not incredibly complex. It is good enough to to use the first-order approximation.
Hessian and Delta Approximation¶
To be able to use the normal approximation, we need , , and . As mentioned before, we use numerical optimization to get . Ideally, we would have analytic expressions for and the derivatives of .
This can be accomplished with automatic differentiation Baydin et al. (2018), which will calculate the derivatives analytically. We can also perform numerical differentiation to get the Hessian and the gradient of the constraint function . This will be less accurate than an analytic expression, but may be less computationally intensive in large models.
But once you learn how to take one numeric derivative, you can take the numeric derivative of anything. So using numerical differentiation is a very flexible technique that we can easily apply to all the models we would use.
Numerical Differentiation¶
So numeric derivatives can be very pragmatic, and flexible. How do you compute them? Are they accurate? We use section 5.7 of Press & Teukolsky (2007) as a guide.
The derivative of the function with respect to is
To approximate numerically, couldn’t we just plugin a small value for and compute the scaled difference? Yes. And that is basically what happens. We do a little more work to choose and use a second-order approximation instead of a first-order.
We can see that the scaled difference is a first-order approximation by looking at the Taylor series expansion around .
Taylor’s theorem with remainder gives
where ε is between and .
Now we can rearrange to get
The right hand side is the truncation error, since it’s linear in , the bandwidth we call the this approximation a first order method.
We can do second-order approximations for and and get a more accurate second order method of approximation for .
were is between and and is between and .
Then we have
This is quadratic in . The first term takes equal input from both sides of , so we call it a centered derivative.
So we choose a small value of and plug it into to approximate .
Our derivation used a single input function . The idea applies to partial derivatives of multi-input functions as well. The inputs that you aren’t taking the derivative with respect to are treated as fixed parts of the function.
Choosing a Bandwidth¶
In practice, second order approximation actually involves two sources of error. Roundoff error, arises from being unable to represent and or functions of them with exact binary represetation.
where is the fractional accuracy with which is computed. This is generally machine accuracy. If we are using NumPy Harris et al. (2020) this would be
Minimizing the roundoff error and truncation error, we obtain
where is shorthand for the ratio of and the sum of .
We use shorthand here because because we are not going to approximate (we are already approximating ), so there is no point in writing it out.
Call this shorthand
the curvature scale, or characteristic scale of the function .
There are several algorithms for choosing an optimal scale. The better the scale chosen, the more accurate the approximation is. A good rule of thumb, which is computationally quick, is to just use the absolute value of .
Then we would use
But what if is 0? This is simple to handle, we just add to
Now, Press & Teukolsky (2007) also suggests performing a final sequence of assignment operations that ensures and differ by an exactly representable number. You assign to a temporary variable . Then is assigned the value of .
In Python, the code would simply be
temp = x + h
h = temp - x
Estimating Confidence Intervals after Optimization¶
With the posterior mode, variance, and normal approximation to the posterior. It is simple to create confidence (credible) intervals for the parameters.
Let’s talk a little bit about what these intervals are. For the parameter γ we want a interval (defined on the observed data generated by a realization of γ) to be defined such that
The frequentist confidence interval does not meet this criteria. γ is just one fixed value, so it is either in the interval, or it isn’t! The probability is 0 or 1. A credible interval (Bayesian confidence interval) can meet this criteria.
Suppose that we are able to use the normal approximation for
Then we have
Now, is , standard normal. So we can use the standard normal quantiles in solving for and .
The upper quantile of the standard normal distribution, satisfies
for standard normal .
Noting that the standard normal is symmetric, if we can find and to satisfy
then we have a valid Bayesian confidence interval.
Simple calculation shows that the solutions are
The quantile can be easily generated using scipy.stats from SciPy Virtanen et al. (2020). We would use the norm.ppf function.
In Python, we would have
z_alpha_2 = scipy.stats.norm.ppf(1-alpha/2)
l = -z_alpha_2*se_gamma_hat + gamma_hat
u = z_alpha_2*nsd_gamma_hat + gamma_hat
We can also adjust the intervals for inference on many parameters by using Bonferroni correction Bonferroni (1936).
Now we know how to estimate the posterior mode. We also know how to estimate the posterior variance after computing the posterior mode. And we have seen how confidence intervals are made based on this posterior variance, mode, and the normal approximation to the posterior. Let’s discuss some tools that will enable us to perform these operations.
TensorFlow Probability¶
Now we will introduce TensorFlow Probability, a Python library that we can use to perform the methods we have been discussing. TensorFlow Probability is library built using TensorFlow, a leading software library for machine learning and artificial intelligence Abadi et al. (2015).
TensorFlow Probability is a probabilistic programming language. This lets us build powerful models in a modular way and estimate them automatically. At the heart of TensorFlow Probability is the Distribution class. In theory, a probability distribution is the set of rules that govern the likelihood of how a random variable (vector, or even general tensor) takes its values.
In TensorFlow Probability, distribution rules for scalars and vectors are parametrized, and these are expanded for higher dimensions as independent samples. A distribution object corresponds to a random variable or vector. The parts of a Bayesian model can be represented using different distribution objects for the parameters and observed data.
Example Distribution¶
As an example, let’s examine a linear regression with a prior for the intercept a and a normal prior for the slope β. Our observed outcome variable is with a normal distribution and the predictor is .
We can store the distribution objects in a dictionary for clear organization. The prior distribution of β is Normal with mean 1 and variance 1, . We use the Normal distribution subclass to encode its information in our dictionary.
tfd = tfp.distributions
dist_dict = {}
dist_dict['beta'] = tfd.Normal(1,1)
The β parameter can range over the real line, but the intercept, α should be nonnegative. The Chi2 distribution subclass has support on only the nonegative reals. However, if we are performing optimization on the α parameter, we may take a step where it became negative. We can avoid any complications like this if we use a TransformedDistribution. Transformed distributions can be used together with a Bijector object that represents the transforming function.
For α, we will model an unconstrained parameter, . The natural logarithm can take values over the real line.
tfb = tfp.bijectors
dist_dict['unconstrained_alpha'] = \
tfd.TransformedDistribution(tfd.Chi2(4),tfb.Log())
We can use the sample method on the distribution objects we created to see random realizations. Before we do that we should set the seed, so that we can replicate our work.
tf.random.set_seed(132)
sample_ex=dist_dict['unconstrained_alpha'].sample(10)
sample_ex
<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([ 2.050956 , 0.56120026, 1.8559402,
-0.05669071, ... ], dtype=float32)>
We see that the results are stored in a tf.Tensor object. This has an easy interface with NumPy, as you can see by the numpy component. We see that the unconstrained α, takes positive and negative values.
We can evaluate the density, or it’s natural logarithm using class methods as well. Here is the log density for the sample we just drew.
dist_dict['unconstrained_alpha'].log_prob(sample_ex)
<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([-1.1720479 , -1.1402813 , -0.8732692 ,
-1.9721189 , ...], dtype=float32)>
Now we can get α from by using a callable and the Deterministic distribution.
dist_dict['alpha'] = \
lambda unconstrained_alpha: \
tfd.Deterministic(\
loc= tfb.Log().inverse( \
unconstrained_alpha))
Now we’ve added all of the parameters to dist_dict. We just need to handle the observed variables and . In this example is exogenous, which means it can be treated as fixed and nonrandom in estimating α and β in the model for . is endogenous, which means it is a response variable in the model, the outcome we are trying to estimate.
We will define separately from our dictionary of distributions. For the example we have to generate values of , but once this is done we will treat it as fixed and exogenous
The observed variable will have a standard normal distribution. We will start by giving it a sample size of 100.
n = 100
x_dist = tfd.Normal(tf.zeros(n),1)
x = x_dist.sample()
The distribution of , which would give us the likelihood, can be formulated using a callable function of the parameters and the fixed value of we just obtained.
dist_dict['y'] = \
lambda alpha, beta: \
tfd.Normal(loc = alpha + beta*x,scale=1)
With a dictionary of distributions and callables indicating their dependencies, we can work with the joint density. This will correspond to the posterior distribution of the model, augmenting the priors with the likelihood.
The JointDistributionNamed class takes a dictionary as input and behaves similarly to a regular univariate distribution object. We can take samples, which are returned as dictionaries keyed by the parameter and observed variable names. We can also compute log probabilities, which gives us the posterior density.
posterior = tfd.JointDistributionNamed(dist_dict)
Now we have a feel for how TensorFlow Probability can be used to store a Bayesian model. We have what we need to start performing optimization and variance estimation.
Maximum A Posteriori (MAP) with SciPy¶
We can use SciPy’s implementation of the Limited memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) Fletcher (1987) algorithm to estimate the posterior mode. This is a Quasi-Newton optimization method that does not need to store the entire Hessian matrix during the algorithm, so it can be very fast. If the Hessian was fully stored we could just use it directly in variance estimation, but it would be slower. We do to take advantage of automatic differentiation to calculate the score function, the first derivative of the posterior. TensorFlow Probability provides this through the value_and_gradient function of its math library.
We will use minimize from the optimize SciPy library, which operates on a loss function that takes a vector of parameters as input. We will optimize on unconstrained_alpha and beta, the unconstrained space parameters of the model. In the joint distribution representation, they are separate tensors. But in optimization, we will need to evaluate a single tensor.
We will use the first utility function from the bayes_mapvar library, which will be available with this paper, to accomplish this. The par_vec_from_dict function unpacks the tensors in a dictionary into a vector.
Within our loss function, we must move the vector of input parameters back to a dictionary of tensors to be evaluated by TensorFlow probability. The par_dict_from_vec function moves the unconstrained parameters back into a dictionary, and the constrained parameters are generated by the get_constrained function. Then the posterior density is evaluated by augmenting this dictionary of constrained parameters with the observed endogenoous variables. The get_constrained function is also used to get the final posterior model estimates from the SciPy optimization.
Variance Estimation with SciPy¶
Once the posterior mode is estimated we can estimate the variance. The first step is calculating the bandwidths. The get_bandwidths function handles this.
def get_bandwidths(unconstrained_par_vec):
abspars = abs(unconstrained_par_vec)
epsdouble = np.finfo(float).eps
epsdouble = epsdouble**(1 / 3)
scale = epsdouble * (abspars + epsdouble)
scaleparmstable = scale + abspars
return scaleparmstable - abspars
With the bandwidths calculated, we step through the parameters and create the Hessian and Delta matrices that we need for variance estimation. The get_hessian_delta_variance function use numeric differentiation to calculate the Hessian, based on numeric derivatives of the automatic derivatives computed by TensorFlow probability for the score function. The Delta matrix is calculated using numeric differentiation of the constrained parameter functions.
Simulation¶
We evaluated our methodology with a simulation based on the α and β parameter setting discussed earlier. This was an investigation into how well we estimated the posterior mode, variance, and distribution using the methods of TensorFlow Probability, SciPy, and bayes_mapvar.
To evaluate the posterior distributions of the parameters we used the MCMC capabilities of TensorFlow Probability. Particularly the the No-U-Turn Sampler Hoffman & Gelman (2011). We were careful to thin the sample based on effective sample size so that autocorrelation would not be a problem. This was accomplished using TensorFlow Probability’s effective_sample_size function from its mcmc library.
We drew observations from the unconstrained prior parameter distribution for and . For each of these prior draws, we drew a posterior sample of and . and were samples based on each and . The posterior mode and variance were estimated, and posterior draws from MCMC were made. The mean was used in the MCMC draws since it sould coincide with the mode if our assumptions are correct.
To check the distributional results, we used the Anderson-Darling test Stephens (1974). This is given by anderson in scipy.stats. We stored a record of whether the test rejects normality at the .05 significance level for each of the draws. This test actually checks the mean and variance assumptions as well, since it compares to a standard normal and we are standardizing based on the MAP and get_hessian_delta_variance estimates.
Table 1:Simulation Results, , , .
Statistic | Mean | S.D. |
---|---|---|
mean | 4.141 | 2.475 |
mean | 3.989 | 2.765 |
S.E. | 0.037 | 0.004 |
S.E. | 0.041 | 0.001 |
α A.D. Reject | 0.042 | 0.201 |
mode | 1.013 | 0.504 |
mean | 1.022 | 1.003 |
S.E. | 0.029 | 0.001 |
S.E. | 0.041 | 0.002 |
β A.D. Reject | 0.045 | 0.207 |
The results of the simulation are shown in Table 1. We use Standard Error (S.E.) to refer to the 1000 estimates of posterior standard deviations from get_hessian_delta_variance and the MCMC sample standard deviations. The Standard Deviation (S.D.) column represents the statistics calculated over the 1000 estimates. The standard errors are not far from each other, and neither are the modes and means. The rejection rates for the Anderson Darling test are not far from .05 either.
We can perform a hypothesis test of whether the rejection rate is .05 by checking whether .05 is in the confidence interval for the proportion. We will use the proportion_confint function from statsmodels Seabold & Perktold (2010). In Table 2, we see that .05 is comfortably within intervals for both parameters. Our simulation successfully corroborated our assumptions about the model and the consistency of our method for estimating the posterior mode, variance, and distribution.
Table 2:A.D. Confidence Intervals, , , .
Statistics | Lower | Upper |
---|---|---|
α AD Reject | 0.030 | 0.056 |
β A.D. Reject | 0.033 | 0.060 |
Conclusion¶
We have explored how Bayesian analysis can be performed without resampling and still obtain full inference. With adequate amounts of the data, the posterior mode can be estimated with numeric optimization and the posterior variance can be estimated with numeric or automatic differentiation. The asymptotic normality of the posterior distribution enables simple calculation of posterior probabilities and confidence (credible) intervals as well.
Bayesian methods let us use data from past experience, subject matter expertise, and different levels of certainty to solve data sparsity problems and provide a probabilistic basis for inference. Retail Price Optimization benefits from historical data and different granularities of information. Other fields may also take advantage of access to large amounts of data and be able to use these approximation techniques. These techniques and the tools implementing them can be used by practitioners to make their analysis more efficient and less intimidating.
- Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition. Taylor & Francis. https://books.google.com/books?id=ZXL6AQAAQBAJ
- Oehlert, G. W. (1992). A Note on the Delta Method. The American Statistician, 46(1), 27–29. 10.1080/00031305.1992.10475842
- Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). Automatic differentiation in machine learning: a survey.
- Press, W. H., & Teukolsky, S. A. (2007). Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press. https://books.google.com/books?id=1aAOdzK3FegC
- Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., … Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825), 357–362. 10.1038/s41586-020-2649-2