1 Load and explore the dataset in the file pest_data.RDS
. The dataset includes 14 variables, that are:
complaints
: Number of complaints per building per monthbuilding_id
: The unique building identifiertraps
: The number of traps used per month per buildingdate
: The date at which the number of complaints are recordedlive_in_super
: An indicator for whether the building has a live-in superage_of_building
: The age of the buildingtotal_sq_foot
: The total square footage of the buildingaverage_tenant_age
: The average age of the tenants per buildingmonthly_average_rent
: The average monthly rent per buildingfloors
: The number of floors per buildinglibrary(rstan)
library(dplyr)
library(lubridate)
library(ggplot2)
library(bayesplot)
theme_set(bayesplot::theme_default())
# seed for R's pseudo-RNGs, not Stan's
set.seed(1123)
pest_data <- readRDS('pest_data.RDS')
str(pest_data)
## 'data.frame': 120 obs. of 14 variables:
## $ mus : num 0.369 0.359 0.282 0.129 0.452 ...
## $ building_id : int 37 37 37 37 37 37 37 37 37 37 ...
## $ wk_ind : int 1 2 3 4 5 6 7 8 9 10 ...
## $ date : Date, format: "2017-01-15" "2017-02-14" ...
## $ traps : num 8 8 9 10 11 11 10 10 9 9 ...
## $ floors : num 8 8 8 8 8 8 8 8 8 8 ...
## $ sq_footage_p_floor : num 5149 5149 5149 5149 5149 ...
## $ live_in_super : num 0 0 0 0 0 0 0 0 0 0 ...
## $ monthly_average_rent: num 3847 3847 3847 3847 3847 ...
## $ average_tenant_age : num 53.9 53.9 53.9 53.9 53.9 ...
## $ age_of_building : num 47 47 47 47 47 47 47 47 47 47 ...
## $ total_sq_foot : num 41192 41192 41192 41192 41192 ...
## $ month : num 1 2 3 4 5 6 7 8 9 10 ...
## $ complaints : num 1 3 0 1 0 0 4 3 2 2 ...
summary(pest_data)
## mus building_id wk_ind date
## Min. :-0.1008 Min. : 5.0 Min. : 1.00 Min. :2017-01-15
## 1st Qu.: 0.5279 1st Qu.:26.0 1st Qu.: 3.75 1st Qu.:2017-04-07
## Median : 1.0422 Median :46.0 Median : 6.50 Median :2017-06-29
## Mean : 1.1116 Mean :49.6 Mean : 6.50 Mean :2017-06-29
## 3rd Qu.: 1.6837 3rd Qu.:70.0 3rd Qu.: 9.25 3rd Qu.:2017-09-19
## Max. : 2.8030 Max. :98.0 Max. :12.00 Max. :2017-12-11
## traps floors sq_footage_p_floor live_in_super
## Min. : 1.000 Min. : 4.0 Min. :4186 Min. :0.0
## 1st Qu.: 6.000 1st Qu.: 8.0 1st Qu.:4770 1st Qu.:0.0
## Median : 7.000 Median :10.0 Median :5097 Median :0.0
## Mean : 7.033 Mean : 9.9 Mean :4991 Mean :0.3
## 3rd Qu.: 8.000 3rd Qu.:13.0 3rd Qu.:5206 3rd Qu.:1.0
## Max. :11.000 Max. :15.0 Max. :5740 Max. :1.0
## monthly_average_rent average_tenant_age age_of_building total_sq_foot
## Min. :3029 Min. :41.14 Min. :39.0 Min. :19217
## 1st Qu.:3564 1st Qu.:45.14 1st Qu.:47.0 1st Qu.:41192
## Median :3813 Median :48.20 Median :49.0 Median :47096
## Mean :3687 Mean :49.92 Mean :49.4 Mean :49248
## 3rd Qu.:3864 3rd Qu.:53.88 3rd Qu.:51.0 3rd Qu.:59251
## Max. :4019 Max. :65.18 Max. :60.0 Max. :78093
## month complaints
## Min. : 1.00 Min. : 0.000
## 1st Qu.: 3.75 1st Qu.: 1.000
## Median : 6.50 Median : 2.000
## Mean : 6.50 Mean : 3.658
## 3rd Qu.: 9.25 3rd Qu.: 5.250
## Max. :12.00 Max. :18.000
##number of buildings
N_buildings <- length(unique(pest_data$building_id))
N_buildings
## [1] 10
2 Write the following Poisson model in Stan, for \(i=1,\ldots,120\)
\[\begin{align*} \textrm{complaints}_i & \sim \textrm{Poisson}(\lambda_i), \\ \lambda_i & = \exp{(\eta_i)} \\ \eta_i &= \alpha + \beta \, \textrm{traps}_i \end{align*}\] Is the number of complaints associated with the number of traps?
## compile the model
comp_model_P <- stan_model('simple_poisson_regression.stan')
## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -mmacosx-version-min=10.13 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG -D_REENTRANT -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -include stan/math/prim/mat/fun/Eigen.hpp -I/usr/local/include -fPIC -Wall -g -O2 -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:4:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:16: error: expected ';' after top level declarator
## namespace Eigen {
## ^
## ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:4:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
## ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
## arrange data into a list
stan_dat_simple <- list(
N = nrow(pest_data),
complaints = pest_data$complaints,
traps = pest_data$traps
)
str(stan_dat_simple)
## List of 3
## $ N : int 120
## $ complaints: num [1:120] 1 3 0 1 0 0 4 3 2 2 ...
## $ traps : num [1:120] 8 8 9 10 11 11 10 10 9 9 ...
## fit the model
fit_P_real_data <- sampling(comp_model_P, data = stan_dat_simple)
##
## SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 2.2e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.094784 seconds (Warm-up)
## Chain 1: 0.086666 seconds (Sampling)
## Chain 1: 0.18145 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 1e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.08592 seconds (Warm-up)
## Chain 2: 0.098224 seconds (Sampling)
## Chain 2: 0.184144 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 9e-06 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.095638 seconds (Warm-up)
## Chain 3: 0.082074 seconds (Sampling)
## Chain 3: 0.177712 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 1.7e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.091608 seconds (Warm-up)
## Chain 4: 0.090581 seconds (Sampling)
## Chain 4: 0.182189 seconds (Total)
## Chain 4:
## print the parameters
print(fit_P_real_data, pars = c('alpha','beta'))
## Inference for Stan model: simple_poisson_regression.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## alpha 2.58 0.01 0.15 2.27 2.48 2.58 2.69 2.88 826 1.01
## beta -0.19 0.00 0.02 -0.24 -0.21 -0.19 -0.18 -0.15 859 1.01
##
## Samples were drawn using NUTS(diag_e) at Sun May 3 18:32:21 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
mcmc_hist(as.matrix(fit_P_real_data, pars = c('alpha','beta')))
mcmc_scatter(as.matrix(fit_P_real_data, pars = c('alpha','beta')), alpha = 0.2)
It appears that there is association between the number of bait stations and the number of complaints.
3 Verify that the ‘Poisson’ assumption in the Poisson generalised linear model is not satisfied. Improve your model including the indicator variable for whether the building has a live-in super and the total square footage of the building (offset).
## posterior predictive checking
y_rep <- as.matrix(fit_P_real_data, pars = "y_rep")
ppc_dens_overlay(y = stan_dat_simple$complaints, y_rep[1:200,])
## standardised residuals of the observed vs predicted number of complaints
mean_y_rep <- colMeans(y_rep)
std_resid <- (stan_dat_simple$complaints - mean_y_rep) / sqrt(mean_y_rep)
qplot(mean_y_rep, std_resid) + hline_at(2) + hline_at(-2)
Residuals are mostly positive, thus the model tends to underestimate the number of complaints.
We expand the model adding the live-in super variable and the total square footage of the building (offset). You can interpret the parameters as rate of complaints per square foot.
ggplot(pest_data, aes(x = log(total_sq_foot), y = log1p(complaints))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
\[\begin{align*} \textrm{complaints}_i & \sim \textrm{Poisson}(\lambda_i) \\ \lambda_i & = \exp{(\eta_i)} \\ \eta_i &= \alpha + \beta \, \textrm{traps}_i + \beta_{\rm super} \, {\rm super}_{i} + \text{log_sq_foot}_{i} \end{align*}\]
## add the two variables to the list of the data
stan_dat_simple$log_sq_foot <- log(pest_data$total_sq_foot/1e4)
stan_dat_simple$live_in_super <- pest_data$live_in_super
str(stan_dat_simple)
## List of 5
## $ N : int 120
## $ complaints : num [1:120] 1 3 0 1 0 0 4 3 2 2 ...
## $ traps : num [1:120] 8 8 9 10 11 11 10 10 9 9 ...
## $ log_sq_foot : num [1:120] 1.42 1.42 1.42 1.42 1.42 ...
## $ live_in_super: num [1:120] 0 0 0 0 0 0 0 0 0 0 ...
## compile the model
comp_model_P_mult <- stan_model('multiple_poisson_regression.stan')
## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -mmacosx-version-min=10.13 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG -D_REENTRANT -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -include stan/math/prim/mat/fun/Eigen.hpp -I/usr/local/include -fPIC -Wall -g -O2 -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:4:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:16: error: expected ';' after top level declarator
## namespace Eigen {
## ^
## ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:4:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
## ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
fit_model_P_mult_real <- sampling(comp_model_P_mult, data = stan_dat_simple)
##
## SAMPLING FOR MODEL 'multiple_poisson_regression' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 3.4e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.34 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.237967 seconds (Warm-up)
## Chain 1: 0.192576 seconds (Sampling)
## Chain 1: 0.430543 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'multiple_poisson_regression' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 1.7e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.233233 seconds (Warm-up)
## Chain 2: 0.197182 seconds (Sampling)
## Chain 2: 0.430415 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'multiple_poisson_regression' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 1.3e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.13 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.242325 seconds (Warm-up)
## Chain 3: 0.211441 seconds (Sampling)
## Chain 3: 0.453766 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'multiple_poisson_regression' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 3.3e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.33 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.232427 seconds (Warm-up)
## Chain 4: 0.220678 seconds (Sampling)
## Chain 4: 0.453105 seconds (Total)
## Chain 4:
y_rep <- as.matrix(fit_model_P_mult_real, pars = "y_rep")
ppc_dens_overlay(stan_dat_simple$complaints, y_rep[1:200,])
ppc_intervals(
y = stan_dat_simple$complaints,
yrep = y_rep,
x = stan_dat_simple$traps
) +
labs(x = "Number of traps", y = "Number of complaints")
We’ve increased the tails a bit more at the larger numbers of traps but we still have some large observed numbers of complaints that the model would consider extremely unlikely events.
4 Write the following model in stan and compare results with the previous model.
\[\begin{align*} \text{complaints}_i & \sim \text{Neg-Binomial}(\lambda_i, \phi) \\ \lambda_i & = \exp{(\eta_i)} \\ \eta_i &= \alpha + \beta \, {\rm traps}_i + \beta_{\rm super} \, {\rm super}_{i} + \text{log_sq_foot}_{i} \end{align*}\]
In Stan the negative binomial mass function we’ll use is called \(\texttt{neg_binomial_2_log}( \text{reals} \, \eta, \text{reals} \, \phi)\). Like the poisson_log
function, this negative binomial mass function that is parameterized in terms of its log-mean, \(\eta\), but it also has a precision \(\phi\) such that
\[ \mathbb{E}[y] \, = \lambda = \exp(\eta) \]
\[ \text{Var}[y] = \lambda + \lambda^2/\phi = \exp(\eta) + \exp(\eta)^2 / \phi. \]
As \(\phi\) gets larger the term \(\lambda^2 / \phi\) approaches zero and so the variance of the negative-binomial approaches \(\lambda\), i.e., the negative-binomial gets closer and closer to the Poisson.
comp_model_NB <- stan_model('multiple_NB_regression.stan')
fitted_model_NB <- sampling(comp_model_NB, data = stan_dat_simple)
##
## SAMPLING FOR MODEL 'multiple_NB_regression' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 5.6e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.56 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.784487 seconds (Warm-up)
## Chain 1: 0.653912 seconds (Sampling)
## Chain 1: 1.4384 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'multiple_NB_regression' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 5.2e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.52 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.819869 seconds (Warm-up)
## Chain 2: 0.672496 seconds (Sampling)
## Chain 2: 1.49236 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'multiple_NB_regression' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 4.1e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.41 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.80811 seconds (Warm-up)
## Chain 3: 0.748018 seconds (Sampling)
## Chain 3: 1.55613 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'multiple_NB_regression' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 3.4e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.34 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.802971 seconds (Warm-up)
## Chain 4: 0.828982 seconds (Sampling)
## Chain 4: 1.63195 seconds (Total)
## Chain 4:
#fitted_model_NB <- stan(file="multiple_NB_regression.stan", data=stan_dat_simple)
samps_NB <- rstan::extract(fitted_model_NB)
## predictions vs. the data
y_rep <- samps_NB$y_rep
ppc_dens_overlay(stan_dat_simple$complaints, y_rep[1:200,])
## standardised residuals
mean_inv_phi <- mean(samps_NB$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_resid <- (stan_dat_simple$complaints - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_resid) + hline_at(2) + hline_at(-2)
## prediction by number of traps
ppc_intervals(
y = stan_dat_simple$complaints,
yrep = y_rep,
x = stan_dat_simple$traps
) +
labs(x = "Number of traps", y = "Number of complaints")
Looks OK but… Data are clustered by building. A posterior predictive check can help us understanding if it would be a good idea to add the building information into the model.
ppc_stat_grouped(
y = stan_dat_simple$complaints,
yrep = y_rep,
group = pest_data$building_id,
stat = 'mean',
binwidth = 0.2
)
We’re getting plausible predictions for most building means but some are estimated better than others and some have larger uncertainties than we might expect. If we explicitly model the variation across buildings we may be able to get much better estimates.
… to be continued