Bayesian statistics

Normal case with \(\sigma^2\) known

Suppose \(y_1, \ldots, y_n \sim \mathcal{N}(\theta, \sigma^{2})\), with \(\sigma^{2}\) known. In such a case, the log-likelihood, up to an additive constant, is:

\[ l(\theta; y) = -\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}(y_i-\theta)^2.\] If we frame our analysis in Bayesian inference, this log-likelihood has not to be maximized over, in order to retrieve its maximum (MLE); rather, the log-likelihood is just a quantity connected to the experiment (observation), framed in a broader context where we still need to incorporate a pre-experimental prior belief about \(\theta\): technically, a prior distribution. Several prior choices are allowed, and the literature about the prior choice is still growing.

Conjugate prior

The simplest way in this case is to assume a normal prior:

\[ \theta \sim \pi(\theta) =\mathcal{N}(\mu, \tau^{2}),\] with fixed hyperparameters \(\mu\) and \(\tau^2\). In such a case, according to the Bayes theorem our posterior distribution is:

\[\pi(\theta|y) \propto \pi(\theta)L(\theta;y)\propto \mathcal{N}(\mu^{*}, \tau^{*2}),\]

where:

\[\begin{align} \begin{split} \mu^{*}= & \frac{\frac{n}{\sigma^{2}}\bar{y}+\frac{1}{\tau^{2}}\mu }{\frac{n}{\sigma^{2}}+\frac{1}{\tau^2}}\\ \tau^{*2}= & \left(\frac{n}{\sigma^{2}}+\frac{1}{\tau^2}\right)^{-1} \end{split} \label{eq:posterior_values} \end{align}\]

#Input values
#True mean
theta_sample <- 2
#Likelihood variance
sigma2 <- 2
#Sample size
n <- 10
#Prior mean
mu <- 7
#Prior variance
tau2 <- 2

#Data generation
set.seed(123)
y <- rnorm(n,theta_sample, sqrt(sigma2))

#Posterior mean
mu_star <- ((1/tau2)*mu+(n/sigma2)*mean(y))/( (1/tau2)+(n/sigma2))
mu_star

## [1] 2.550488

#Posterior standard deviation
sd_star <- sqrt(1/( (1/tau2)+(n/sigma2)))
sd_star

## [1] 0.4264014

# Plot of likelihood, prior and posterior distribution 
curve(dnorm(x, theta_sample, sqrt(sigma2/n)),xlim=c(-4,15), lty=2, lwd=1, col="black", ylim=c(0,1.4), 
ylab="density", xlab=expression(theta))

curve(dnorm(x, mu, sqrt(tau2) ), xlim=c(-4,15), col="red", lty=1,lwd=2,  add =T)
curve(dnorm(x, mu_star, sd_star), 
  xlab=expression(theta), ylab="", col="blue", lwd=2, add=T)  
legend(8.5, 0.7, c("Prior", "Likelihood", "Posterior"), 
  c("red", "black", "blue", "grey" ), lty=c(1,2,1),lwd=c(1,1,2), cex=1)

\(\mu^{*}\) is a weighted mean of the MLE \(\bar{y}\) and the prior mean \(\mu\). Hence, by increasing either the prior variance \(\tau^2\) or the sample size \(n\), our posterior will tend to the likelihood, and the posterior mean will approximately coincide with the MLE \(\bar{y}\).

In such a case, we may obtain a closed form for the posterior distribution, which is still a normal distribution with updates parameters. This issue is well known as conjugacy: the prior and the posterior distributions belong to the same parametric family. However, in the majority of the cases the conjugacy is not guaranteed, since the priors we are going to choose do not belong to the same distribution family of the posterior.

For all these cases, a closed form is not obtained, and we require simulation methods, such as Markov Chain Monte Carlo methods. According to these procedures, the posterior distribution is approximately described by a sequence of random draws, a Markov chain.

Several softwares have been developed in these years for these purposes, and this growing work in progress made the Bayesian inference a sort of milestone for computational scientists. For illustration purposes, we will use here the \(\mathsf{Stan}\) software to compute the posterior distribution via simulation.

library(rstan)
#launch Stan model:
# Data specification
data<- list(N=n, y=y, sigma =sqrt(sigma2), mu = mu, tau = sqrt(tau2))

# Fit a Stan model
fit <- stan(file="normal.stan", data = data, chains = 4, iter=2000)

## 
## SAMPLING FOR MODEL 'normal' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0 seconds (Warm-up)
## Chain 1:                0.016 seconds (Sampling)
## Chain 1:                0.016 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'normal' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.015 seconds (Warm-up)
## Chain 2:                0 seconds (Sampling)
## Chain 2:                0.015 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'normal' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.01 seconds (Warm-up)
## Chain 3:                0.01 seconds (Sampling)
## Chain 3:                0.02 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'normal' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 0 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.01 seconds (Warm-up)
## Chain 4:                0.007 seconds (Sampling)
## Chain 4:                0.017 seconds (Total)
## Chain 4:

#extract Stan output
sim <- extract(fit)

#draw the true analytical posterior and the Stan simulated posterior

par(mfrow=c(1,2), pty ="m", oma=c(0,0,0,0))
curve(dnorm(x, theta_sample, sqrt(sigma2/n)),xlim=c(-4,15), lty=2, lwd=1, col="black", ylim=c(0,1.2), 
  ylab="density", xlab=expression(theta), cex.lab=2)

curve(dnorm(x, mu, sqrt(tau2) ), xlim=c(-4,15), col="red", lty=1,lwd=2,  add =T)
curve(dnorm(x, mu_star, sd_star), 
  xlab=expression(theta), ylab="", col="blue", lwd=2,
  cex.lab=2, add=T)  

legend(5, 1, c("Prior", "Likelihood", "True Posterior"), 
  c("red", "black", "blue", "blue" ), lty=c(1,2,1),lwd=c(1,1,2), cex=0.8)


curve(dnorm(x, theta_sample, sqrt(sigma2/n)),xlim=c(-4,15), lty=2, lwd=1, col="black", ylim=c(0,1.2), 
  ylab="density", xlab=expression(theta), cex.lab=2)
curve(dnorm(x, mu, sqrt(tau2) ), xlim=c(-4,15), col="red", lty=1,lwd=2,  add =T)
lines(density(sim$theta, adj=2), col ="blue", lwd=2, lty =1)
legend(5, 1, c("Prior", "Likelihood", "Stan Posterior"),
  c("red", "black", "blue", "blue" ), lty=c(1,2,1),lwd=c(1,1,2), cex=0.8)

As may be seen, the two posteriors quite coincide. In this simple case, performing MCMC does not provide any benefit over the analytical computations. But if we move away from the comfortable case for the couple prior-likelihood, as for the normal-normal example, MCMC becomes essential.

Exercise 4 In \(\mathsf{sim}\) in the code above, you find the MCMC output which allows to approximate the posterior distribution of our parameter of interest with \(S\) draws of \(\theta\). Plot the empirical cumulative distribution function and compare it with the theoretical cumulative distribution function of the posterior distribution.

Exercise 5 Launch the following line of \(\mathsf{R}\) code:

posterior <- as.array(fit)

Use now the \(\mathsf{bayesplot}\) package. Read the help and produce for this example, using the object posterior, the following plots:

posterior intervals.
posterior areas.
marginal posterior distributions for the parameters.

Quickly comment.

Non-conjugate priors

Let’s change the assumption of the model above, and consider now a uniform prior for \(\theta\):

\[ \theta \sim \mathsf{Unif}(-10,10).\]

If we compute the numerator of the Bayes theorem we obtain:

\[ \pi(\theta|y) \propto L(\theta;y)\pi(\theta)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2\sigma^2}\sum_{i}(y_i-\theta)^2}\mbox{I}_{[-10,10]}(\theta), \]

where \(\mbox{I}_{[-10,10]}(\theta)\) is the usual event indicator for the Uniform distribution, equals 1 if \(\theta \in[-10,10]\), and 0 otherwise. This product is not proportional to any known distribution, and in such a case we need simulation.

#launch Stan model with the uniform prior in place of normal prior

data2<- list(N=n, y=y, sigma =sqrt(sigma2), a=-10, b=10)
fit2 <- stan(file="normal_uniform.stan", data = data2, chains = 4, iter=2000,
  refresh=-1)

#extract the Stan output
sim2 <- extract(fit2)

#plot the Stan posterior

curve(dnorm(x, theta_sample, sqrt(sigma2/n)),xlim=c(-12,12), lty=2, lwd=1, col="black", ylim=c(0,1.2), 
  ylab="density", xlab=expression(theta))
curve(dunif(x, -10,10  ), xlim=c(-12,12), col="red", lty=1,lwd=2,  add =T)
lines(density(sim2$theta, adj=2), col ="blue", lwd=2, lty =1)
legend(5, 1, c("Prior", "Likelihood", "Stan Posterior"), 
  c("red", "black", "blue", "blue" ), lty=c(1,2,1),lwd=c(1,1,2), cex=0.8)

Of course, we may control the variance of the Uniform prior \(\mathsf{Unif}(a,b)\), amounting at \(\frac{1}{12}(b-a)^{2}\), and consequently the posterior distribution.

Exercise 6 Suppose you receive \(n = 15\) phone calls in a day, and you want to build a model to assess their average length. Your likelihood for each call length is \(y_i \sim \mathsf{Exponential}(\lambda)\). Now, you have to choose the prior \(\pi(\lambda)\). Please, tell which of these priors is adequate to describe the problem, and provide a short motivation for each of them:

\(\pi(\lambda)= \mathsf{Beta}(4,2)\);
\(\pi(\lambda)= \mathsf{Normal}(1,2)\);
\(\pi(\lambda)=\mathsf{Gamma}(4,2)\);

Now, compute your posterior as \(\pi(\lambda|y)\propto L(\lambda;y)\pi(\lambda)\) for the selected prior. If your first choice was correct, you will be able to compute it analitically.

MCMC world

Inspecting the posterior distribution via simulation according to some Markov Chains draws sounds like magic. Actually, there are strong theorems and complicate statementes behind this fancy result. In the example above, we retrieved the poterior distribution for \(\theta\) via \(\mathsf{Stan}\) software, which relies on an extension of MCMC, the so called Hamiltonian Monte Carlo sampling. But how can visualize these values?

data2<- list(N=n, y=y, sigma =sqrt(sigma2), a=-10, b=10)
fit2 <- stan(file="normal_uniform.stan", data = data2, chains = 4, iter=2000,
  refresh=-1)
sim2 <- extract(fit2)

#traceplot
traceplot(fit2, pars ="theta")

theta_est <- mean(sim2$theta)
theta_est

In this case, running a bunch of chains until they converge to a stationary distribution is the usual way.

We can obtain also some MCMC areas:

library("bayesplot")
library("rstanarm")
library("ggplot2")



#MCMC areas
posterior <- as.matrix(fit2)

plot_title <- ggtitle("Posterior distributions",
                      "with medians and 80% intervals")

mcmc_areas(posterior, 
           pars = "theta", 
           prob = 0.8) + plot_title

print(fit2)

## Inference for Stan model: normal_uniform.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##         mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
## theta   2.11    0.01 0.45   1.24   1.80   2.12   2.41   2.97  1552    1
## lp__  -20.25    0.02 0.73 -22.25 -20.41 -19.98 -19.80 -19.74  1787    1
## 
## Samples were drawn using NUTS(diag_e) at Wed Nov 17 12:30:54 2021.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

If the variance \(\sigma^2\) of our assumed model above is not known in advance, our model above is biparametric. Let’s then extend the model above, and consider \(\sigma^{2}\) as unknown:

\[\begin{align*} y_{1},\ldots,y_{n} & \sim \mathcal{N}(\theta, \sigma^2)\\ \theta &\sim \mathsf{Unif}(-10,10)\\ {\sigma} &\sim \mathsf{Cauchy^{+}}(0, 2.5), \end{align*}\]

where \(\mathsf{Cauchy^{+}}(0, 2.5)\) denotes an Half-Cauchy with location 0 and scale 2.5.

#launch biparametric Stan model

data3<- list(N=n, y=y, a=-10, b=10)
fit3 <- stan(file="biparametric.stan", data = data3, chains = 4, iter=2000,
  refresh=-1)

#extract stan output for biparametric model

sim3 <- extract(fit3)
posterior_biv <- as.matrix(fit3)

theta_est <- mean(sim3$theta)
sigma_est <- mean(sim3$sigma)
c(theta_est, sigma_est)
traceplot(fit3, pars=c("theta", "sigma"))

plot_title <- ggtitle("Posterior distributions",
                      "with medians and 80% intervals")

mcmc_areas(posterior_biv, 
           pars = c("theta","sigma"), 
           prob = 0.8) + plot_title

Exercise 7 Go to this link: rstan, and follow the instructions to download and install the \(\mathsf{rstan}\) library. Once you did it succesfully, open the file model called \(\mathsf{biparametric.stan}\), and replace the line:

\[\mathsf{ target+=cauchy\_lpdf(sigma| 0, 2.5);} \]

with the following one:

\[\mathsf{ target+=uniform\_lpdf(sigma| 0.1, 10);} \]

Which prior are you now assuming for your parameter \(\sigma\)? Reproduce the same plots as above and briefly comment.

Extra # Model comparisons

In Bayesian inference we can still perform some tests in order to verify some assumptions. Rather than verifying a particular value for a parameter, we are here more interested in assessing some model comparisons in term of prior choices. Suppose you want to test the hypothesis

\[H_0: \theta \in \Theta_0; H_1: \theta \in \Theta_1,\]

where \(\Theta_0\) and \(\Theta_1\) form a partition of the parameter space.

The beliefs about the two hypotheses are summarized by the posterior odds ratio:

\[\frac{p_0}{p_1}=\frac{P(\theta\in \Theta_0|y)}{P(\theta\in \Theta_1|y)}=\frac{\int_{\Theta_0}\pi(\theta|y)d\theta}{\int_{\Theta_1}\pi(\theta|y)d\theta}\]

A measure of the evidence provided by the data in support of \(H_0\) over \(H_1\) is the Bayes factor:

\[BF_{01}= \frac{\mbox{posterior odds}}{\mbox{prior odds}}=\frac{p_0/p_1}{\pi_0/\pi_1}=\frac{m_{0}(y)}{m_1(y)},\]

where \(\pi_0\) and \(\pi_1\) are respectively \(\int_{\Theta_0}\pi(\theta)d\theta\) and \(\int_{\Theta_1}\pi(\theta)d\theta\) the probability of the two hypotheses prior observing the data. \(m_0(y)\) and \(m_1(y)\), better known as marginal likelihoods, are \(\int_{\Theta_0}L(\theta_0;y)\pi(\theta_0)d\theta_0\) and \(\int_{\Theta_1}L(\theta_1;y)\pi(\theta_1)d\theta_1\) respectively.

Models for soccer goals

Suppose we are interested in assessing the average number of goals scored by a given team in Major league Soccer, and denote the goals for \(n\) games as \(y_1,\ldots,y_n\). Since goals are relatively rare events, we assume that:

\[y_i \sim \mathsf{Poisson}(\lambda), \ i=1,\ldots,n,\] thus, the likelihood is:

\[L(\lambda; y)=\prod_{i=1}^{n} e^{-\lambda}\frac{\lambda^{y_{i}}}{y_i!}.\] Now, we have to elicit a prior for the parameter \(\lambda\), and several choices are plausible:

Prior 1: \(\lambda \sim \mathsf{Gamma}(4.57, 1.43)\), meaning that the average is \(\alpha/\beta= 3.195804\), with \(\alpha\) the shape parameter and \(\beta\) the rate parameter, and the quartiles for \(\lambda\) are given by 2.10 and 4.04.
Prior 2: \(\log(\lambda) \sim \mathcal{N}(1, .5^2)\). The quartiles for this prior for \(\log(\lambda)\) are 0.66 and 1.34, which translates to prior quartiles for \(\lambda\) of 1.94 and 3.81.
Prior 3: \(\log(\lambda) \sim \mathcal{N}(2, .5^2)\). The quartiles for \(\log(\lambda)\) are 1.66 and 2.33, which translates to prior quartiles for \(\lambda\) of 5.25 and 10.27.
Prior 4: \(\log(\lambda) \sim \mathcal{N}(1, 2^2)\). The quartiles for \(\log(\lambda)\) are -0.34 and 2.34, which translates to prior quartiles for \(\lambda\) of 0.71 and 10.38.

The dataset \(\mathsf{soccergoals}\) in the \(\mathsf{LearnBayes}\) package contains the number of goals across the 35 matches of the 2006 season for a given team. Let’s write the prior functions and plot the four priors and the likelihood for \(\theta=\log(\lambda)\).

The likelihood function is proportional to \[L(\theta) \propto e^{-n\lambda} \lambda^{\sum_{i=1}^{n}y_i} \] and we recognize the kernel of a gamma distribution.
\(\Lambda \sim Ga(\alpha, \beta)\) \[f_\Lambda(\lambda) \propto \lambda^{\alpha-1} e^{-\frac{\lambda}{\beta}}\] Then \(\alpha=\sum_{i=1}^{n}y_i + 1\) and \(\beta=1/n\).

library(LearnBayes)
data(soccergoals)

y <- soccergoals$goals

#write the likelihood function via the gamma distribution


lik_pois<- function(data, theta){
  n <- length(data)
  lambda <- exp(theta)
  dgamma(lambda, shape =sum(data)+1, scale=1/n)
}
# Write the prior (gamma)
 prior_gamma <- function(par, theta){
  lambda <- exp(theta)
  dgamma(lambda, par[1], rate=par[2])*lambda  
}

 # Write the prior (Normal)
 prior_norm <- function(npar, theta){
 dnorm(theta, npar[1], npar[2])
  
}

 # Vectorize the likelihood and the priors
lik_pois_v <- Vectorize(lik_pois, "theta")
prior_gamma_v <- Vectorize(prior_gamma, "theta")
prior_norm_v <- Vectorize(prior_norm, "theta")


#Plot likelihood
 curve(lik_pois_v(theta=x, data=y), xlim=c(-1,4), xlab=expression(theta), ylab = "density", lwd =2 )
 
#prior 1 
 curve(prior_gamma_v(theta=x, par=c(4.57, 1.43)), lty =2, col="red", add = TRUE, lwd =2)

 # prior 2  
 curve(prior_norm_v(theta=x, npar=c(1, .5)), lty =3, col="blue", add =TRUE, lwd=2)

#prior 3  
 curve(prior_norm_v(theta=x, npar=c(2, .5)), lty =4, col="green", add =TRUE, lwd =2)
 
#prior 4 
  curve(prior_norm_v(theta=x, npar=c(1, 2)), lty =5, col="violet", add =TRUE, lwd =2)
  legend(2.6, 1.8, c("Lik.", "Ga(4.57,1.43)", "N(1, 0.25)", "N(2,0.25)","N(1, 4)" ),
  lty=c(1,2,3,4,5), col=c("black", "red", "blue", "green", "violet"),lwd=2, cex=0.9)

Let’s quickly comment. Priors 1 and 2 almost coincide in scale and location, and they propose a mean—in log scale—around 1, which is slightly greater than the likelihood mean, \(\log(\bar{\lambda})=0.49\). Prior 3 is quite extreme, and results to be in conflict with the likelihood. Prior 4 is very flat and provides no much information about the parameter \(\theta\).

Now we have to compute the log-posteriors, using the reparametrization \(\theta= \log(\lambda)\) and calling the previous functions:

\[\log(\pi(\theta|y))\propto l(\theta;y)+log(\pi(\theta)).\]

# Poisson-Gamma
logpoissongamma <- function(theta, datapar){
   data <- datapar$data
   par <- datapar$par
   log_lik <- log(lik_pois(data, theta))
   log_prior <- log(prior_gamma(par, theta))
   return(log_lik+log_prior)
}


logpoissongamma.v <- Vectorize( logpoissongamma, "theta")


#Poisson-Normal
logpoissonnormal <- function( theta, datapar){
 data <- datapar$data
 npar <- datapar$par
 log_lik <- log(lik_pois(data, theta))
 log_prior <- log(prior_norm(npar, theta))
  return(log_lik+log_prior)
}  
logpoissonnormal.v <- Vectorize( logpoissonnormal, "theta")

# Plot:
#log-likelihood

curve(log(lik_pois(y, theta=x)), xlim=c(-1,4),ylim=c(-20,2), lty =1,
   ylab="log-posteriors", xlab=expression(theta))
#log posterior 1
curve(logpoissongamma.v(theta=x, list(data=y, par=c(4.57, 1.43))), col="red", xlim=c(-1,4),ylim=c(-20,2), lty =1, add =TRUE)
#log posterior 2
 curve(logpoissonnormal.v( theta=x, datapar <- list(data=y, par=c(1, .5))), lty =1, col="blue",  add =TRUE)
#log posterior 3
 curve(logpoissonnormal.v( theta=x, datapar <- list(data=y, par=c(2, .5))), lty =1, col="green", add =TRUE, lwd =2)
#log posterior 4
  curve(logpoissonnormal.v( theta=x, list(data=y, par=c(1, 2))), lty =1, col="violet", add =TRUE, lwd =2)
 legend(2.6, 1.3, c( "loglik", "lpost 1", "lpost 2", "lpost 3", "lpost 4" ),
  lty=1, col=c("black", "red", "blue", "green", "violet"),lwd=2, cex=0.9)

Now we have to compute the Bayes factors. The Bayes factor in support of Prior 1 over Prior 2 is:

\[ BF_{12}=\frac{m_1(y)}{m_2(y)}=\frac{\int_{\Theta_1} L(\theta_1;y)\pi(\theta_1)d\theta_1}{\int_{\Theta_2} L(\theta_2;y)\pi(\theta_2)d\theta_2}.\]

We use the function \(\mathsf{laplace()}\) contained in the \(\mathsf{LearnBayes}\) package for computing the posterior modes, the posterior standard deviations and the log-marginal likelihoods, \(\log(m(y))\).

#posterior modes, posterior standard deviations and the log-marginal likelihoods by using laplace function
datapar <- list(data=y, par=c(4.57, 1.43))
fit1 <- laplace(logpoissongamma, .5, datapar)
datapar <- list(data=y, par=c(1, .5))
fit2 <- laplace(logpoissonnormal, .5, datapar)
datapar <- list(data=y, par=c(2, .5))
fit3 <- laplace(logpoissonnormal, .5, datapar)
datapar <- list(data=y, par=c(1, 2))
fit4 <- laplace(logpoissonnormal, .5, datapar)

postmode <- c(fit1$mode, fit2$mode, fit3$mode, fit4$mode )
postsds <- sqrt(c(fit1$var, fit2$var, fit3$var, fit4$var))
logmarg <- c(fit1$int, fit2$int, fit3$int, fit4$int)
cbind(postmode, postsds, logmarg)

##       postmode   postsds   logmarg
## [1,] 0.5248047 0.1274414 -1.502977
## [2,] 0.5207825 0.1260712 -1.255171
## [3,] 0.5825195 0.1224723 -5.076316
## [4,] 0.4899414 0.1320165 -2.137216

Now we may compute the Bayes factors, for instance:

\[BF_{21}=m_{2}(y)/m_{1}(y)=\exp\{-1.255171+1.502877 \}= 1.281,\] which means that Prior 2 is 1.28 times more preferable than Prior 1. This makes sense: if we look at the first plot, Prior 2 was slightly closer to the likelihood.

# bayes factors
BF_matrix <- matrix(1, 4,4)
for (i in 1:3){
  for (j in 2:4){
   BF_matrix[i,j]<- exp(logmarg[i]-logmarg[j])
   BF_matrix[j,i]=(1/BF_matrix[i,j]) 
  }
}

round_bf <- round(BF_matrix,3)
round_bf

##       [,1]  [,2]   [,3]  [,4]
## [1,] 1.000 0.781 35.635 1.886
## [2,] 1.281 1.000 45.656 2.416
## [3,] 0.028 0.022  1.000 0.053
## [4,] 0.530 0.414 18.899 1.000

Every prior is favored over Prior 3, wich is the prior with the greatest likelihood conflict. Prior 2 is always favored over the other priors. Generally, the marginal probability for a prior decreases as the prior density becomes more diffuse.

Exercise 8 Simulate a sample of size \(n=14\) from a Bernoulli distribution parameter \(p=0.5\).

Looking at the \(\mathsf{Stan}\) code for the other models, write a short \(\mathsf{Stan}\) Beta-Binomial model, where \(p\) has a \(\mathsf{Beta}(a,b)\) prior with \(a=3\), \(b=3\).
extract the posterior distribution with the function \(\mathsf{extract()}\);
produce some plots with the \(\mathsf{bayesplot}\) package and comment.
compute analitically the posterior distribution and compare it with the \(\mathsf{Stan}\) distribution.

Laboratory 3 - Part 2

Statical Methods for Data Science

L Egidi, V Gioia, M Stefanucci

17 November 2021