Intermediate Econometrics

28/11 - 03/12/2025 - Vincenzo Gioia

Different responses

Moving beyond responses defined on the real line

Up to now: responses that are real numbers defined on the whole real line (or when strictly positive real number are coerced to a real number by taking its logarithm)
Now, responses that are not defined on the whole real line

Defined only on a part of the real line
Integers or categorical

The models will be estimated by maximum likelihood: the distribution law of the response will be completely specified and we will estimate the parameters controlling the distributional parameters

Different responses

Moving beyond responses defined on the real line

With respect to the linear model, the main difference is that the conditional expectation of the response is no longer a linear function of the covariates
Here, we will denote $\eta = X \beta$ the linear predictor which is related to the conditional expectation

$\eta$ is a $n$ dimensional vector
$X$ is a $n \times p$ matrix (intercept plus $p-1$ regressors)
$\beta$ is a $p$ dimensional vector

More precisely, $\eta_i =g(\mu_i)$, with $\mu_i = \mathbb{E}(Y_i|x_i)$ and where $g(\cdot)$ is the link function
Under the linear case $\beta_k = \frac{\partial \mbox{E}(Y_i \mid x_i)}{\partial x_{ik}}$, so the marginal effect of the $k$-th covariate was the corresponding coefficient
Here we have $\frac{\partial \mu_i}{\partial x_{ik}}=\beta_k \frac{\partial g ^ {-1}}{\partial \eta_i}(\eta_i)$ so the marginal effect of the $k$-th covariate is now proportional (but not equal) to the corresponding coefficient

Models for binary response

Binary response

Binary responses can take only two mutually exclusive values (coded as 1 and 0)
Common examples:

transport mode choice (car vs. public transport)
working force participation for women
union membership

For this kind of responses, the statistical distribution is obviously a binomial distribution with one trial (that is a Bernoulli distribution)
Denoting 1 as “a success” and 0 as “a fail”, this distribution is fully characterized by a unique parameter, $p$, which is the probability of success and is also the expected value of the variable, that is: $\mbox{E}(Y) = \mu = p$.
The variance of the distribution is: $\mbox{V}(Y) = \mu(1-\mu) = p(1-p)$.

Models for binary response

Binary response

Regression model for a binary response

linear predictor: $\eta = X\beta$.
consider a function $F$ which relates the linear predictor to the unique parameter of the Bernoulli distribution: $p_i = \mu_i= F(\eta_i)$

Different choices of $F$ leads to different models for binary variables

Models for binary response

Simplest choice for $F$: Advantages

\[p_i = F(\eta_i) = F(x^\top_i\beta)\]

Simplest choice: Identity function for $F(\cdot)$ \[\implies p_i = \eta_i = x^\top_i\beta\]
So, the parameter of the Bernoulli distribution is assumed to be a linear function of the covariates
Advantages:

it is very simple to estimate: it is a linear model
it can be simply extended to IV estimation

Being a linear model $\frac{\partial p_i}{\partial x_{ki}} = \beta_k$, so the estimated parameters can be interpreted as the (constant) marginal effects of the corresponding covariate on the probability of success

Models for binary response

Simplest choice for $F$: Pitfalls

\[p_i = F(\eta_i) = F(x^\top_i\beta)\]

Simplest choice: Identity function for $F(\cdot)$

\[\implies p_i = \eta_i = x^\top_i\beta\]

Drawbacks:

the residuals are $e_i = y_i - \hat{p}_i$ but, as $y_i$ is either 0 or 1, the residuals are respectively $-x^\top_i\hat\beta$ or $1-x^\top_i\hat \beta$ and therefore depends on $x$: The linear-probability model, estimated by least-squares is therefore inefficient, as the residuals are heteroscedastic and the standard deviations reported by a least squares program are biased (some solutions: estimate the linear-probability model by GLS or use heteroscedasticity-robust estimator for the covariance matrix of the estimators)
As the fitted probabilities of success are linear functions of the covariates, they are not bounded by 0 and 1 and, therefore, it is possible that the model will predict, for some observations, probabilities that would be either negative or greater than 1.

Models for binary response

Better choices

Consider a functional form $F$ which has the following properties (these are the features of any cumulative density function for continuous variables defined on the whole real line support)

$F(z)$ is increasing in $z$
$\lim_{z\rightarrow -\infty} F(z) = 0$
$\lim_{z\rightarrow +\infty} F(z) = 1$

Two common choices for the functional form of $F$ are:

Normal ($\Phi$) leading to the probit regression model

\[F(z) = \Phi(z) = \displaystyle \int_{-\infty} ^ z \phi(t) dt = \int_{-\infty} ^ z \frac{1}{\sqrt{2\pi}} e ^{-\frac{1}{2}t ^ 2} dt\]

Logistic ($\Lambda$) leading to the logistic regression model (or logit model) \[F(z) = \Lambda(z) = \displaystyle \frac{e^z}{1 + e ^ z}\]

Models for binary response

Logit/Probit models

The density functions for the Normal distribution $\phi(z) = f_Z(z) = \frac{1}{\sqrt{2\pi}} e ^{-\frac{1}{2}z ^ 2}$
The density functions for the logistic distribution $\lambda(z) = f_Z(z) = \frac{e^z}{(1 + e ^ z)^2}$
Both density functions are symmetric around 0 and are “bell-shaped”, but they have two important differences

the variance of the standard normal distribution is 1 and is equal to $\pi ^ 2/3$ for the logistic distribution
the logistic distribution has much heavier tails than the normal density

curve(dnorm(x), from = -4, to = 4)
curve(dlogis(x), from = -4, to = 4, add = TRUE, col = "red")

Models for binary response

Logit/Probit models

As $\mu_i=F(\eta_i)$ (with $\eta = X\beta$), the marginal effect of the $k$-th covariate on the probability is:

\[ \frac{\partial \mu_i}{\partial x_{ik}} = \beta_k f(\eta_i) \]

where $f$ is the first derivative of $F$ (the normal and logistic densities, respectively)

Then, the marginal effect is obtained by multiplying the coefficient by $f(\eta_i)$ which depends on the value of the covariates for a given observation (the marginal effect is observation-dependent, but the ratio of two marginal effects for two covariates is not, as it is obviously equal to the ratio of the two corresponding coefficients)
As the coefficient of proportionality is the normal/logistic density, the maximum marginal effect is for $\eta_i = 0$, which results in a probability of success of 0.5.

Models for binary response

Logit/Probit models

In correspondence of $\eta_i = 0$ the values of the densities are 0.4 and 0.25 for the normal and logistic densities
Therefore, a rule of thumb to interpret coefficients is to multiply them respectively by 0.4 and 0.25 for the probit and logit model to get an estimation of the maximum marginal effect

curve(dnorm(x), from = -4, to = 4)
curve(dlogis(x), from = -4, to = 4, add = TRUE, col = "red")

Models for binary response

Better choices

The coefficients of the logit and probit can therefore not be compared. This is due to the fact that they are scaled differently, as the standard deviation of the logistic distribution is $\pi/\sqrt{3} \approx 1.81$, compared to 1 for the normal distribution
Therefore, it would be tempting to multiply the probit coefficients by 1.81 to compare them to the logit coefficients, but, empirically, the value of 1.6 performs better

Models for binary response

Example: Car versus Public Transport

As an example, we consider the data set used by Horowitz (1993) which concerns the transport mode chosen for work trips by a sample of 842 individuals in Washington DC in the late sixties
The response mode is 1 for car and 0 for public transport.
The covariates are the in- and out-vehicle times (ivtime and ovtime) and the cost differences between car and public transport
A positive value indicates that car trip is longer/more expensive than the corresponding trip using public transport

Models for binary response

Example: Car versus Public Transport

The dataset is mode_choice in the micsr R package
We make some transformations to the data

Dividing the ivtime and ovtime by 60, while the cost by 100
We multiply the cost by 8.42 to obtain dollars in 2022 (the CPI for 2022 is 842 with a 100 base in 1967)
The generalized cost of a trip is the sum of the monetary cost and the value of the time spent in the transport. We use two-thirds of the minimum hourly wage (about $1.4 in the US in the late sixties, which is about $8 in dollars of 2022) to evaluate an hour of transport

library(micsr)
data(mode_choice)
mode_choice$ivtime <- mode_choice$ivtime/60
mode_choice$ovtime <- mode_choice$ovtime/60
mode_choice$cost <- mode_choice$cost/100
mode_choice$cost <- mode_choice$cost * 8.42
mode_choice$gcost <- (mode_choice$ivtime + mode_choice$ovtime) * 8 + mode_choice$cost

Models for binary response

Example: Car versus Public Transport

Models can be fitted via the glm() function

linprob <- lm(mode ~ gcost, mode_choice)
probit <- glm(mode ~ gcost, mode_choice, family = binomial(link = "probit"))
logit <- glm(mode ~ gcost, mode_choice, family = binomial(link = "logit"))

round(summary(linprob)$coefficients[2,],2)

  Estimate Std. Error    t value   Pr(>|t|) 
      0.02       0.00       9.56       0.00

round(summary(probit)$coefficients[2,],2)

  Estimate Std. Error    z value   Pr(>|z|) 
      0.11       0.01       8.65       0.00

round(summary(logit)$coefficients[2,],2)

  Estimate Std. Error    z value   Pr(>|z|) 
      0.21       0.02       8.52       0.00

Models for binary response

Example: Car versus Public Transport

The coefficient of gcost for the linear-probability model is 0.0226, which means that a one dollar increase of the generalized cost differential will increase the probability of using the car by 2.26 percentage points.
If we use the previously described rule of thumb to multiply the probit/logit coefficients by 0.4/0.25 in order to have an upper limit for the marginal effect, we get 4.51 and 5.28 percentage points, which are much higher values than for the linear probability model
This is because the coefficient of the linear model estimates the marginal effect at the sample mean. In our sample, the mean value of the covariate is 2.9

c(coef(linprob)[2], 
      coef(probit)[2] * 0.4, 
      coef(logit)[2] * 0.25)

     gcost      gcost      gcost 
0.02255435 0.04514873 0.05280576

Models for binary response

Example: Car versus Public Transport

To get comparable marginal effects for the probit/logit models, we should first compute $\hat{\beta_1} + \hat{\beta_2} \bar{x}$ (1.15 and 2 respectively for the probit and logit models) and use these values with the relevant densities ($\phi(1.15) = 0.206$ and $\lambda(2) = 0.105$)
At the sample mean, the marginal effects are then $\phi(1.15)\times 0.1129 = 0.023$ and $\lambda(2) \times 0.2112 = 0.02255435$: they are therefore very close to the linear probability model coefficient

xmean <- mean(mode_choice$gcost)
lprobit <- coef(probit)[1] + coef(probit)[2] * xmean
llogit <- coef(logit)[1] + coef(logit)[2] * xmean

round(dnorm(lprobit) * coef(probit)[2], 3)

(Intercept) 
      0.023

round(dlogis(llogit) * coef(logit)[2], 3)

(Intercept) 
      0.022

coef(linprob)[2]

     gcost 
0.02255435

Models for binary response

Example: Car versus Public Transport

We can see the scatterplot and the fitted probability curves

xlim <- c(-30, 30)
set.seed(123)
y_jitter <- jitter(as.numeric(mode_choice$mode), amount = 0.02)

plot(mode_choice$gcost, y_jitter, xlim = xlim, pch = 16,
     cex = 0.2, xlab = "gcost", ylab = "mode")

curve(coef(linprob)[1] + coef(linprob)[2] * x, add = TRUE, lty = 1 )
curve(pnorm(coef(probit)[1] + coef(probit)[2] * x), add = TRUE, lty = 2)
curve(plogis(coef(logit)[1] + coef(logit)[2] * x), add = TRUE, lty = 3)

legend("topleft", legend = c("linear", "probit", "logit"),
       lty = c(1, 2, 3), bty = "n")

Models for binary response

Example: Car versus Public Transport

The fitted probabilities are given by

a straight line for the linear probability model
an S curve for the probit and logit models. These last two curves are very similar except for low values of the covariate.

Models for binary response

Example: Airbnb

The data set called airbnb has been analyzed by Edelman, Luca, and Svirsky (2017)
Aim: analyze the presence of racial discrimination on the Airbnb platform
The authors create guest accounts that differ by the first name chosen. More specifically, the race of the applicant is suggested by the choice of the first name, either a “white” (Emily, Sarah, Greg) or an “African American” (Lakisha or Kareem) first name.
The response is acceptance and is 1 if the host gave a positive response and 0 otherwise
We use only three covariates:

guest’s race suggested by the first name guest_race
the price price (in logs)
city, the cities where the experience took place (Baltimore, Dallas, Los Angeles, St. Louis and Washington, DC)

Models for binary response

Example: Airbnb

Note: the mean of the response is $0.45$ which is a distinctive feature of this data set compared to the previous one (mode_choice)
As the mean value of the probability of success is close to 50% we can expect that the rule of the thumb which consists of multiplying the logit/probit coefficients by 0.25/0.4 would give an estimated value for the marginal effect close to the one directly obtained in the linear probability model

airbnb <- as.data.frame(micsr.data::airbnb)
str(airbnb)

'data.frame':   6235 obs. of  10 variables:
 $ acceptance  : Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 2 2 1 1 ...
 $ guest_race  : Factor w/ 2 levels "white","black": 1 2 1 1 2 1 2 1 2 2 ...
 $ guest_gender: Factor w/ 2 levels "male","female": 1 2 1 2 2 1 2 1 2 2 ...
 $ host_race   : Factor w/ 2 levels "other","black": 1 1 1 1 2 1 1 1 1 1 ...
 $ host_gender : Factor w/ 2 levels "female","male": 1 2 2 1 2 1 1 1 1 2 ...
 $ multlistings: num  0 0 0 0 0 0 1 0 0 1 ...
 $ shared      : num  0 1 0 1 1 1 1 1 1 0 ...
 $ tenreviews  : num  1 1 0 0 1 0 0 0 0 1 ...
 $ price       : int  120 74 150 110 50 250 85 62 125 119 ...
 $ city        : Factor w/ 5 levels "Baltimore","Dallas",..: 2 2 2 2 2 2 2 2 2 2 ...

mean(as.numeric(airbnb$acceptance) - 1)

[1] 0.4484362

table(airbnb$acceptance)/nrow(airbnb)


       no       yes 
0.5515638 0.4484362

Models for binary response

Example: Airbnb

The dataset contains some NA values for the covariates we are including in the model: we will simply remove them
We also assign the value 0 to No and 1 to Yes to the outcome variable (acceptance)

apply(is.na(airbnb), 2, sum)

  acceptance   guest_race guest_gender    host_race  host_gender multlistings 
           0            0            0            0            0            0 
      shared   tenreviews        price         city 
           0            0           67            5

airbnb <- na.omit(airbnb)
airbnb$acceptance <- ifelse(airbnb$acceptance == "no", 0, 1)
str(airbnb)

'data.frame':   6168 obs. of  10 variables:
 $ acceptance  : num  1 0 1 0 1 0 1 1 0 0 ...
 $ guest_race  : Factor w/ 2 levels "white","black": 1 2 1 1 2 1 2 1 2 2 ...
 $ guest_gender: Factor w/ 2 levels "male","female": 1 2 1 2 2 1 2 1 2 2 ...
 $ host_race   : Factor w/ 2 levels "other","black": 1 1 1 1 2 1 1 1 1 1 ...
 $ host_gender : Factor w/ 2 levels "female","male": 1 2 2 1 2 1 1 1 1 2 ...
 $ multlistings: num  0 0 0 0 0 0 1 0 0 1 ...
 $ shared      : num  0 1 0 1 1 1 1 1 1 0 ...
 $ tenreviews  : num  1 1 0 0 1 0 0 0 0 1 ...
 $ price       : int  120 74 150 110 50 250 85 62 125 119 ...
 $ city        : Factor w/ 5 levels "Baltimore","Dallas",..: 2 2 2 2 2 2 2 2 2 2 ...
 - attr(*, "na.action")= 'omit' Named int [1:67] 817 834 2106 2401 2443 2470 2471 2580 2803 2967 ...
  ..- attr(*, "names")= chr [1:67] "817" "834" "2106" "2401" ...

Models for binary response

Example: Airbnb

We fit the models (linear probability, logit and probit models)
We compare the coefficients (those for logit and probit are, respectively, multiplied by 0.25 and 0.4, easing the comparison) and we report the ratio between logit and probit models (showing that is close to 1.6)
For a 100% increase of the (log-) price, the probability of acceptance reduces by 4.16 percentage points
The estimated marginal effect for black guests is about -8.5 percentage points (we do not need to compute the derivatives because the covariate is a dummy)

fit_lprob <- lm(acceptance ~ guest_race + I(log(price)) + city, data = airbnb)
fit_logit <- glm(acceptance ~ guest_race + I(log(price)) + city, 
                 data = airbnb, family = binomial("logit"))
fit_probit <- glm(acceptance ~ guest_race + I(log(price)) + city, 
                  
                  data = airbnb, family = binomial("probit"))

round(cbind(coef(fit_lprob), coef(fit_logit) * 0.25, 
      coef(fit_probit)* 0.4, coef(fit_logit)/coef(fit_probit)),3)

                  [,1]   [,2]   [,3]  [,4]
(Intercept)      0.708  0.215  0.214 1.604
guest_raceblack -0.084 -0.085 -0.085 1.602
I(log(price))   -0.045 -0.047 -0.047 1.603
cityDallas       0.023  0.023  0.023 1.593
cityLos-Angeles  0.015  0.016  0.016 1.590
citySt-Louis     0.010  0.010  0.010 1.573
cityWashington  -0.037 -0.037 -0.037 1.611

Models for binary response

Example: Airbnb

We should then better compute the difference between the probabilities of acceptance, everything other being equal, which means here for a given price of the property (e.g. The average price being equal to $182) and for the reference city (Baltimore). In the following, we are computing the estimate of

\[P(Y_i=1|x_{i1} = 1, x_{i2} = 1, x_{i3} = \log(mean(price))) - P(Y_i=1|x_{i1} = 1, x_{i2} = 0, x_{i3} = \log(mean(price)))\]

For the probit model: \[\Phi(0.536 - 0.213 - 0.116 \times \ln 182) - \Phi(0.536 - 0.116 \times \ln 182) = -0.084\]
For the logit model: \[\Lambda(0.859 - 0.341 - 0.187 \times\ln 182) - \Lambda(0.859 - 0.187\times\ln 182) = -0.084\]

See next slide for the numerical values

Models for binary response

Example: Airbnb

At least in this example, the previous computation of the derivative gives an extremely accurate approximation of the effect of this dummy covariate

round(rbind(coef(fit_logit),  coef(fit_probit)),3)

     (Intercept) guest_raceblack I(log(price)) cityDallas cityLos-Angeles
[1,]       0.859          -0.341        -0.187      0.093           0.063
[2,]       0.536          -0.213        -0.116      0.059           0.040
     citySt-Louis cityWashington
[1,]        0.039         -0.150
[2,]        0.025         -0.093

pnorm(coef(fit_probit)[1] + coef(fit_probit)[2] + coef(fit_probit)[3] * log(182)) - 
  pnorm(coef(fit_probit)[1] + coef(fit_probit)[3] * log(182))

(Intercept) 
-0.08337756

plogis(coef(fit_logit)[1] + coef(fit_logit)[2] + coef(fit_logit)[3] * log(182)) - 
  plogis(coef(fit_logit)[1] + coef(fit_logit)[3] * log(182))

(Intercept) 
-0.08328864

Models for binary response

Latent variable structure

Provide a theoretical foundation to the probit/logit models (we will consider the case where there is a unique covariate)
We observe that $y$ is equal to 0 or 1, but we now assume that this values is related to a latent continuous variable (called $y^*$) which is unobserved

$y=1$ will result for “high” values of $y ^ *$
$y=0$ for low values of $y ^ *$

More specifically we’ll assume that the observation rule is:

\[ \left\{ \begin{array}{rcl} y = 0 & \mbox{if } & y ^ * \leq \psi \\ y = 1 & \mbox{if } & y ^ * > \psi \\ \end{array} \right. \]

where $\psi$ is an unknown threshold

Models for binary response

Latent variable structure

Let’s assume that the value of $y^ *$ is partly explained by an observable covariate $x$
The unexplained part being modelized by a random error $\epsilon$
We then have: $y ^ * = \beta_1 + \beta_2 x + \epsilon$, so that the observation rule becomes:

\[ \left\{ \begin{array}{rcl} y = 0 & \mbox{if } & \epsilon \leq \psi - \beta_1 - \beta_2 x\\ y = 1 & \mbox{if } & \epsilon > \psi - \beta_1 - \beta_2 x \\ \end{array} \right. \]

This observation rule depends on $\psi - \beta_1$ and not on the separate values of $\psi$ and $\beta_1$
Therefore, $\psi$ can be set to any arbitrary value, for example 0

Models for binary response

Latent variable structure

Then, the probability of success is: $1 - F(-\beta_1 - \beta_2 x)$ where $F$ is the cumulative density function of $\epsilon$
For example, if $\epsilon \sim \mathcal{N} (0, \sigma^2)$, then $\mbox{P}(Y = 1 \mid x) = 1 - \Phi(-(\beta_1 + \beta_2 x) / \sigma)$

We can see from this expression that only $\beta_1 / \sigma$ and $\beta_2 / \sigma$ can be identified, so that, $\sigma$ can be set to any arbitrary value, for example 1
Moreover, by the symmetry of the normal distribution, we have $1 - \Phi(-z) = \Phi(z)$, so that the probability of success becomes \[\mbox{P}(Y = 1 \mid x) = \Phi(\beta_1 + \beta_2 x)\] which defines the probit model

Assuming that the distribution of $\epsilon$ is logistic, we have a probability of success equal to $\mbox{P}(Y = 1 \mid x) = 1 - \Lambda(- \beta_1-\beta_2 x)$ which reduces, as the logistic distribution is also symmetric, to: \[\mbox{P}(Y = 1 \mid x) = \Lambda(\beta_1 + \beta_2 x) = e^{\beta_1 + \beta_2 x} / (1 + e ^ {\beta_1 + \beta_2 x})\]

Models for binary response

Random utility model

Consider now that we can define a utility function for the two alternatives that correspond to the two values of the binary response
As an example, $y$ equals 1 or 0 if car or public transport is chosen, and the only covariate $x$ is the generalized cost
The utility of choosing a transport mode doesn’t depend only on the generalized cost, but also on some other unobserved variables. The effect of these variables are modelized as the realization of a random variable $\epsilon$
We can therefore define the following random utility functions:

\[ \left\{ \begin{array}{rcl} U_0 &=& \beta_{01}+ \beta_2 x_0 + \epsilon_0 \\ U_1 &=& \beta_{11} + \beta_2 x_1 + \epsilon_1 \end{array} \right. \]

where $\beta_2$ is the marginal utility of $1.

Models for binary response

Random utility model

The choice of the individual is deterministic: They will choose the car if the utility of this mode is greater than the utility of public transport
Then, we have the following observation rule:

\[ \left\{ \begin{array}{rcl} y = 0 & \mbox{if } & \epsilon_1 - \epsilon_0 \leq - (\beta_{11} - \beta_{01}) - \beta_2 (x_1 - x_0)\\ y = 1 & \mbox{if } & \epsilon_1 - \epsilon_0 > - (\beta_{11} - \beta_{01}) - \beta_2 (x_1 - x_0)\\ \end{array} \right. \]

Denoting $\epsilon = \epsilon_1 - \epsilon_0$ the error difference, $\beta_1 = \beta_{11} - \beta_{01}$ and $x = x_1 - x_0$ the difference of generalized cost for the two modes, we have:

\[ \left\{ \begin{array}{rcl} y = 0 & \mbox{if } & \epsilon \leq - (\beta_1 + \beta_2 x)\\ y = 1 & \mbox{if } & \epsilon > - (\beta_1 + \beta_2 x)\\ \end{array} \right. \]

The probability of “success” (here choosing the car) is therefore $\mbox{P}(Y = 1 \mid x) = 1 - F(- \beta_1 - \beta_2 x)$, with $F$ as the cumulative density of $\epsilon$. If the distribution is symmetric, this probability reduces once again to $\mbox{P}(Y = 1 \mid x) = F(\beta_1 + \beta_2 x)$, and the probit or logit models are obtained by choosing the normal or the logistic distribution, respectively

Generalized linear models

Binomial models are a case of generalized linear models

The generalized linear models (GLM) are a wide family of models that are intended to extend the linear model
These models have the following components:

a random component which specifies the distribution of the response, as a member of the exponential family, and in particular the expected value $\mbox{E}(Y_i) = \mu_i$
a systematic component: some covariates $x_1 (=1), x_2, \ldots, x_p$ produce a linear predictor $\eta_i = x^\top_i \beta$
the link function $g$ which specifies the relation between the random and the systematic components: $\eta_i = g(\mu_i)$

Exponential family

The exponential family is defined by the following density function:

\[ f(y;\theta,\phi) = e ^ {\displaystyle\left(y\theta - b(\theta)\right)/\phi + c(y, \phi)} \]

with $\theta$ and $\phi$ being respectively a position and a scale parameter

Generalized Linear Models

Linear models

Linear models are a specific case of generalized linear models with a normal distribution and an identity link
We have in this case the following density function:

\[ \phi(y;\mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma} e ^ {-\frac{1}{2}\frac{(y - \mu)^2}{\sigma ^ 2}}= e^{\frac{y\mu - 0.5 \mu ^ 2}{\sigma ^ 2}- 0.5 y ^ 2 / \sigma ^ 2 - 0.5 \ln(2\pi\sigma ^ 2)} \]

which is a member of the exponential family with $\theta = \mu$, $\phi = \sigma^ 2$, $b(\theta) = \theta ^ 2/2$ and $c(y, \phi) = -(y ^ 2 / \phi + \ln(2\pi\phi))/2$

For the Gaussian model with an identity link, we have, for a given set of estimates (which leads to the proposed model): $\hat{\mu}_i = \hat{\eta}_i = x^\top_i \hat \beta$ and the log-likelihood function is:

\[ \ln L(y, \hat{\mu}) = - \frac{n}{2}\ln(2 \pi + \sigma ^ 2) - \frac{1}{2\sigma ^2} \sum_{i=1} ^ n (y_i - \hat{\mu}_i) ^ 2 \]

Generalized Linear Models

Linear models

For a hypothetical “perfect” or saturated model with a perfect fit, we would have $\hat{\mu}_i = y_i$, so that the log-likelihood would be $- \frac{n}{2}\ln(2 \pi + \sigma ^ 2)$
Minus two times the difference of these two values of the log likelihood function is called the scaled deviance of the proposed model: \[ D^*(y;\hat\mu) = \sum_{i=1} ^ n\frac{(y_i - \hat{\mu}_i) ^ 2}{\sigma ^ 2} \]
The deviance is obtained by multiplying the scaled deviance by $\sigma ^ 2$

\[ D(y; \hat{\mu}) = \sum_{i=1} ^ n(y_i - \hat{\mu}_i) ^ 2 \]

which is simply, for the linear model, the sum of square residuals

Generalized linear models

Binomial models

The probability mass function is given by the probability of success $\mu$ (also denoted with $p$ previously) if $y = 1$ and by the probability of failure $1 - \mu$ if $y=0$
This probability can be compactly written as \[f_Y(y; \mu) = P(Y=y; \mu) = \mu ^ y (1 - \mu) ^ {1 - y}\] or as:

\[ f(y;\mu) = e ^ {y \ln \mu + (1 - y) \ln(1 - \mu)}=e ^ {y \ln \frac{\mu}{1 - \mu} + \ln(1 - \mu)}= e^{y\theta - \ln (1 + e ^ \theta)} \]

which is a member of the exponential family with:

$\theta = \ln\frac{\mu}{1 -\mu}$
$b(\theta) = \ln(1 + e ^ \theta)$ $c(\theta,y) = 0$
$\phi=1$

Generalized linear models

Binomial models

The model is fully characterized once the link is specified
For the logit model, we have $\mu = \frac{e ^ \eta}{1+e ^ \eta}$, so that $\eta = \ln \frac{\mu}{1 - \mu} = g(\mu)$
We then have $\theta = \eta$, so that the logit link is called the canonical link for binomial models
As the density for the binomial model returns a probability, the log-likelihood for the saturated model is zero. Therefore, the deviance is:

\[ D(y;\hat{\mu}) = 2 \sum_{i=1} ^ n \left(y_i \ln \hat{\mu}_i + (1 - y_i) \ln(1 - \hat{\mu}_i)\right) \]

The null model is a model with only an intercept: in this case, $\hat{\mu}_i = \hat{\mu}_0$ and the maximum likelihood estimate of $\mu_0$ is $\sum_{i=1} ^ n y_i / n$, i.e., the share of success in the sample
The deviance of this model is called the null deviance

Generalized linear models

Binomial models

An alternative to the deviance as a measure of the fit of the model is the generalized Pearson statistic, defined as:

\[ X ^ 2 = \sum_{i=1} ^ n \frac{(y_i - \hat{\mu}_i) ^ 2}{{\hat V} (\hat{\mu}_i)}= \sum_{i=1} ^ n \frac{(y_i - \hat{\mu}_i) ^ 2}{\hat{\mu}_i(1 - \hat{\mu}_i)} \]

Generalized linear models

Residuals under the linear models

In the linear model, residuals have several interesting properties:

they are homoscedastic (or at least they may be homoscedastic if the variance of the conditional distribution of the response is constant)
they have an intuitive meaning, as they are the difference between the actual and the fitted values of the response, the latter being an estimate of the conditional mean of the response
they are related to the value of the objective function, which is the sum of square residuals

Generalized linear models

Residuals for GLMs: Pearson’s residuals

The most obvious definition of the residuals for binomial models is the response residuals, which are simply the difference between the response and the prediction of the model (the fitted probability of success $\hat{\mu}$)
However, these residuals ($y_i - \hat{\mu}_i$) are necessarily heteroscedastic, as the variance of $Y_i$ is $\mu_i(1 - \mu_i)$
Scaling the response residuals by their standard deviation leads to Pearson’s residuals: \[\frac{y_i - \hat{\mu}_i}{\sqrt{\hat{\mu}_i(1 - \hat{\mu}_i)}}\]
The sum of squares of Pearson’s residuals is the generalized Pearson statistic

Generalized linear models

Residuals for GLMs: Deviance residuals

The deviance residuals are such that the sum of their squares equals the deviance statistic $D$
They are defined by: \[(2 y_i - 1) \sqrt{2}\sqrt{y_i \ln \hat{\mu}_i + (1 - y_i) \ln (1 - \hat{\mu}_i)}\]

the term $2 y_n - 1$ gives a positive sign for the residuals of observations for which $y_n = 1$ and a negative sign for $y_n = 0$, as for the two other types of residuals

Generalized linear models

Estimation with glm()

We have already seen that the R routine to fit a logit/probit model is based on the glm() function. When fitting a generalized linear model, three models are considered:

the saturated model, for which there is one parameter for every observation and a perfect fit; therefore the log-likelihood, the deviance and the number of degrees of freedom are 0
the null model, with only one estimated coefficient and $n-1$ degrees of freedom
the proposed model, with $p$ estimated parameters, and therefore $n - p$ degrees of freedom.

Generalized linear models

Estimation with glm()

Let’s see the summary output: The output indicates the deviance of the null and the proposed model, along with their respective degrees of freedom (The latter is called the residual deviance)

summary(logit)


Call:
glm(formula = mode ~ gcost, family = binomial(link = "logit"), 
    data = mode_choice)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.39048    0.09977  13.937   <2e-16 ***
gcost        0.21122    0.02478   8.524   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 741.33  on 841  degrees of freedom
Residual deviance: 647.39  on 840  degrees of freedom
AIC: 651.39

Number of Fisher Scoring iterations: 5

Generalized linear models

Estimation with glm()

The residual deviance can be extracted accessing to the glm object

logit$deviance

[1] 647.3874

logit$df.residual

[1] 840

# Equivalently 
deviance(logit)

[1] 647.3874

df.residual(logit)

[1] 840

Generalized linear models

Estimation with glm()

The null deviance can be extracted accessing to the glm object, as well
We can see that is equivalent to fit a model only including an intercept

logit$null.deviance

[1] 741.332

logit$df.null

[1] 841

summary(glm(mode ~ 1, mode_choice, family = "binomial"))


Call:
glm(formula = mode ~ 1, family = "binomial", data = mode_choice)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.65576    0.09392   17.63   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 741.33  on 841  degrees of freedom
Residual deviance: 741.33  on 841  degrees of freedom
AIC: 743.33

Number of Fisher Scoring iterations: 3

Generalized linear models

Estimation with glm()

The residuals can be extracted from the fitted model using resid
The resid method for glm objects has a type argument which can be equal to the following three options

“response”
“pearson”
“deviance”

However, the most interesting plot to consider is based on the binned residuals (not shown here)

head(resid(logit, type = "response"))

          1           2           3           4           5           6 
 0.03651330 -0.55944482  0.23717556 -0.66782129  0.04603174 -0.80628781

head(resid(logit, type = "pearson"))

         1          2          3          4          5          6 
 0.1946716 -1.1268821  0.5575999 -1.4178955  0.2196654 -2.0401710

head(resid(logit, type = "deviance"))

         1          2          3          4          5          6 
 0.2727512 -1.2804059  0.7358361 -1.4846428  0.3070012 -1.8118398

Generalized linear models

Estimation with glm()

The fitted values of the model can be expressed on the scale of the linear predictor or the response

head(predict(logit, type = "link"))

        1         2         3         4         5         6 
3.2728820 0.2389092 1.1682273 0.6983475 3.0312992 1.4260673

head(predict(logit, type = "response"))

        1         2         3         4         5         6 
0.9634867 0.5594448 0.7628244 0.6678213 0.9539683 0.8062878

head(predict(logit, type = "link", newdata = data.frame(gcost = 5)))

       1 
2.446591

head(predict(logit, type = "response", newdata = data.frame(gcost = 5)))

        1 
0.9203118

Generalized linear models

Model estimation

The stats::glm function uses an iterative weighted least squares method to fit GLMs.
Probit and logit models are usually estimated by maximum likelihood
Given the log-likelihood, the log-likelihood derivatives (up to the second order are obtained) and they are used in algorithms (like Fisher-Scoring or Newton-Raphson) to obtain the maximum likelihood estimates of the model parameters.
The second order derivative is used for computing the (estimated) standard errors of the regression coefficient estimators, using the asymptotic distribution of the maximum likelihood estimator
We do not go in the details, but it is important to know that other estimators of the variance-covariance matrix of the regression coefficients can be used

Generalized linear models

Model evaluation

Once several models are estimated, the evaluation and the selection process of one of them can be based on several indicators

The value of the objective function, which is the log-likelihood
Closely related to the log-likelihood is the deviance, which is the opposite of twice the log-likelihood

These measures usually indicate an important difference between the constrained and unconstrained model
However, the comparison between the constrained and unconstrained models is spurious, because adding further covariates, even if they are irrelevant, necessarily increases the fit of the model
Therefore, it is important to consider indicators that penalize highly parametrized models. The two most popular indicators are the Akaike and the Bayesian information criteria (AIC and BIC):

$\mbox{AIC} = - 2 \ln L + 2 p$
$\mbox{BIC} = - 2 \ln L + p \ln n$.

They are obtained by augmenting the deviance by a term which is a multiple of the number of fitted parameters
The rule being to select the model for which the statistic is lower (lower is the best)

Generalized linear models

Model estimation and evaluation

Let’s consider the example of the mode_choice:

we assign to $y$ and to $X$, respectively, the outcome and the model matrix
we implement the log-likelihood
we set the starting values
we optimize the log-likelihood using numerical optimization (minimizing the negative log-likelihood)
we analyze the estimate and compare with those obtained via the glm function
we extract the standard errors using the observed information matrix (the negative of the hessian matrix) and compare them with those obtained via the glm function
we analyze the log-likelihood in the maximum
we obtaine the deviance and the AIC

Generalized linear models

Log-likelihood and needed quantities

we assign to $y$ and to $X$, respectively, the outcome and the model matrix
we implement the log-likelihood

y <- mode_choice$mode
X <- model.matrix(~ gcost, data = mode_choice)

loglik_logit <- function(beta, X, y) {
  eta <- X %*% beta
  p <- exp(eta) / (1 + exp(eta))
  sum(y * log(p) + (1 - y) * log(1 - p))
}

Generalized linear models

Maximum likelihood estimates

we set the starting values
we optimize the log-likelihood using numerical optimization (minimizing the negative log-likelihood)
we analyze the estimate and compare with those obtained via the glm function ::

beta_start <- rep(0, ncol(X))

opt <- optim(
  par = beta_start,
  fn  = function(b) -loglik_logit(b, X, y),
  hessian = TRUE,
  method = "BFGS"
)

beta_hat <- opt$par
cbind(beta_hat, logit$coefficients)

             beta_hat          
(Intercept) 1.3904764 1.3904762
gcost       0.2112228 0.2112231

Generalized linear models

Standard Errors

we extract the standard errors using the observed information matrix (the negative of the hessian matrix) and compare them with those obtained via the glm function

H_obs <- opt$hessian
vcov_obs <- solve(H_obs)
se_obs <- sqrt(diag(vcov_obs))

cbind(se_obs, summary(logit)$coefficients[,2])

                se_obs           
(Intercept) 0.09977194 0.09977098
gcost       0.02478110 0.02478069

p_hat <- 1 / (1 + exp(-X %*% beta_hat))
W <- diag(as.vector(p_hat * (1 - p_hat)))

Fisher <- t(X) %*% W %*% X
vcov_fisher <- solve(Fisher)
se_fisher <- sqrt(diag(vcov_fisher))

cbind(se_fisher, summary(logit)$coefficients[,2])

             se_fisher           
(Intercept) 0.09977194 0.09977098
gcost       0.02478121 0.02478069

Generalized linear models

Log-likelihood, deviance and Information Criteria

we analyze the log-likelihood in the maximum
we obtain the deviance and the AIC

cbind(logLik(logit), loglik_logit(beta_hat, X, y))

          [,1]      [,2]
[1,] -323.6937 -323.6937

cbind(-2*loglik_logit(beta_hat, X, y), logit$deviance)

         [,1]     [,2]
[1,] 647.3874 647.3874

cbind(- 2*loglik_logit(beta_hat, X, y) + 4, AIC(logit))

         [,1]     [,2]
[1,] 651.3874 651.3874

Generalized linear models

Example: Model comparison

Let’s consider the logit and probit regression model to the mode choice data and we compare the specifications including

the generalized cost (gcost)
the monetary cost (cost), in- and out-vehicle time (ivtime and ovtime)

logit2 <- glm(mode ~ cost + ivtime + ovtime, data = mode_choice, family = binomial("logit"))
probit2 <- glm(mode ~ cost + ivtime + ovtime, data = mode_choice, family = binomial("probit"))

cbind(summary(logit)$coefficients[,1:2], summary(probit)$coefficients[,1:2])

             Estimate Std. Error  Estimate Std. Error
(Intercept) 1.3904762 0.09977098 0.8211603 0.05634293
gcost       0.2112231 0.02478069 0.1128718 0.01305510

cbind(summary(logit2)$coefficients[,1:2], summary(probit2)$coefficients[,1:2])

             Estimate Std. Error   Estimate Std. Error
(Intercept) 1.0618581  0.1949694 0.66376405 0.10919704
cost        0.1562312  0.0355831 0.08553217 0.01953444
ivtime      0.5445625  0.4545989 0.30822412 0.23822344
ovtime      4.7595607  0.9582451 2.38034964 0.49523659

Generalized linear models

Example: Model comparison

Let’s consider the logit and probit regression model to the mode choice data and we compare the specifications including

the generalized cost (gcost)
the monetary cost (cost), in- and out-vehicle time (ivtime and ovtime)

cbind(logLik(logit), logLik(logit2), logLik(probit), logLik(probit2))

          [,1]      [,2]      [,3]      [,4]
[1,] -323.6937 -317.3219 -324.2697 -318.7685

cbind(AIC(logit), AIC(logit2), AIC(probit), AIC(probit2))

         [,1]     [,2]     [,3]    [,4]
[1,] 651.3874 642.6438 652.5393 645.537

cbind(BIC(logit), BIC(logit2), BIC(probit), BIC(probit2))

         [,1]     [,2]     [,3]     [,4]
[1,] 660.8589 661.5869 662.0109 664.4802

cbind(logit$deviance, logit2$deviance, probit$deviance, probit2$deviance)

         [,1]     [,2]     [,3]    [,4]
[1,] 647.3874 634.6438 648.5393 637.537

cbind(logit$null.deviance, logit2$null.deviance, probit$null.deviance, probit2$null.deviance)

        [,1]    [,2]    [,3]    [,4]
[1,] 741.332 741.332 741.332 741.332

Generalized linear models

Testing: Tests of nested models

To test hypotheses involving nested models, three asymptotically equivalent tests are commonly used:

Likelihood Ratio (LR) Test: The LR test is based on the comparison of the maximized log-likelihoods of the restricted (R) and unrestricted (U) models:

\[LR=2(\ell(\hat \beta^U)−\ell(\hat \beta^R))\]

Wald Test: The Wald test relies only on the unrestricted estimator and tests whether the parameter restrictions are satisfied:

\[W= (\hat \beta^U − \beta_0)^\top Var(\hat \beta^U)^{−1} (\hat \beta^U−\beta_0)\]

where $\beta_0$ satisfies the null restrictions (e.g., $\beta_0=0$)

Score Test: The Score test is based only on the restricted model and evaluates whether the slope of the log-likelihood at the restricted estimate is significantly different from zero:

\[Score =S(\hat \beta^R)^\top I(\hat \beta^R)^{−1} S(\hat \beta^R)\]

where $S(β)=\partial \ell (\beta)/\partial \beta$ is the score vector and $I(\beta)$ is the Fisher information matrix (the expected value of the negative hessian)

Under the null hypothesis, all three test statistics converge in distribution to a $\chi^2(q)$ where q is the number of restrictions

Generalized linear models

Testing: Tests of nested models

We estimated a model with the generalized cost ($g$) as a unique covariate, which was computed as: $g_i = c_i + 8(i_i + o_i)$, where $c$, $i$, and $o$ are the differences in monetary cost, in-vehicle time and out-vehicle time, based on the hypothesis that time value was $8 per hour
The unconstrained model for the probit case is: \[ P(Y_i = 1) = \Phi(\beta_0 + \beta_c c_i + \beta_i i_i + \beta_o o_i) \]
The constrained model implies the two following hypotheses: $H_0: \beta_o = \beta_i = 8 \beta_c$. It is more convenient to rewrite the model so that, under $H_0$, a subset of the parameters are 0:

\[ \begin{array}{rcl} P(Y_i = 1) &=& \Phi\left(\beta_0 + \beta_c \left(c_i + 8 (i_i + o_i)\right) + (\beta_i - 8\beta_c)i_i + (\beta_o - 8 \beta_c) o_i\right)\\ &=& \Phi(\beta_0 + \beta_c g_i + \beta_i'i_i + \beta_o'o_i) \end{array} \]

where $\beta_i' = (\beta_i - 8\beta_c)$ and $\beta_o' = (\beta_o - 8\beta_c)$ are the reduced form parameters of the binomial regression with the generalized cost, the in-vehicle and out-vehicle time as covariates

With this parametrization, the set of hypotheses is simply $\beta_i' = \beta_o' = 0$

Generalized linear models

Testing: Tests of nested models

The two statistics are very close and in both cases we are rejectung the null hypothesis of

library(lmtest)
lrtest(probit, probit2)

Likelihood ratio test

Model 1: mode ~ gcost
Model 2: mode ~ cost + ivtime + ovtime
  #Df  LogLik Df  Chisq Pr(>Chisq)   
1   2 -324.27                        
2   4 -318.77  2 11.002   0.004082 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

car::linearHypothesis(probit2, 
                      c("ivtime = 8 * cost", "ovtime = 8 * cost"))


Linear hypothesis test:
- 8 cost  + ivtime = 0
- 8 cost  + ovtime = 0

Model 1: restricted model
Model 2: mode ~ cost + ivtime + ovtime

  Res.Df Df  Chisq Pr(>Chisq)   
1    840                        
2    838  2 10.986   0.004115 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Generalized linear models

Testing: Conditional moment tests

We can use three test (included in the micsr package) to assess

the normality of latent error term (under probit)
the homoscedasticity assumption (of the latent error term)
if we are missing non-linear terms or omitted variables

Our probit model seems to be correctly specified, as the three hypotheses are not rejected

cmtest(probit2, test = "normality")


    Conditional Expectation Test for Normality

data:  mode ~ cost + ivtime + ovtime
chisq = 4.6997, df = 2, p-value = 0.09538

cmtest(probit2, test = "heterosc")


    Heteroscedasticity Test

data:  mode ~ cost + ivtime + ovtime
chisq = 4.129, df = 3, p-value = 0.2479

cmtest(probit2, test = "reset")


    Reset test

data:  mode ~ cost + ivtime + ovtime
chisq = 3.2182, df = 2, p-value = 0.2001