Introduction to regression and diagnostic checks

Regression is trying to describe a dependent variable \(y\) through some statistically significant predictors \(x\), linking them through a function \(f(\cdot)\):

\[y = f(x)+\varepsilon,\] where \(\varepsilon\) is a measure of noise implicit in modelling. Checking the assumptions of a regression model is not just a theoretical matter, but it is something related to scientific reproducibility, scientific transparency and visualization power. Although it is impossible to write down an exaustive list of ‘what to do for checking a model’, I put here just some initial steps for visualizing and inspecting the sign of your variables, possible correlations between them and the inclusion of further predictors. The suggested step refer to the classical linear model, but they contain principles valid for statistical modelling in general.

Before fitting a model

Examine the distribution of each of the explanatory variables, and of the dependent variable. Skewness, presence of outliers. When the distrubution is skew, consider symmetric transformation.
Examine the scatterplot involving all the explanatory variables, checking for non linearity . If some pairwise plot shows non-linearity, consider some possible transformation.
Examine the range of the variables.
Examine the degree of correlation between the explanatory variables.

After fitting a model

Plot residuals against fitted values and check for patterns in the residual (homoschedasticity).
Examine the Cook’s distance for possible outliers.

Regression with a single predictor

The dataset nihills of the DAAG package contains data from the 2007 calendar for the Northern Ireland Mountain Running Association. The number of observations (races) is \(n = 23\), and for each of them we register the following variables: the distances in miles dist, the amount of climb expressed in feet climb, the record time in hours for males time and the record time in hours for females timef.

library(DAAG)
data(nihills)
n <- nrow(nihills)

Objective: we want to regress the times through some predictors. Here below there is the data structure.

	dist	climb	time	timef
Binevenagh	7.5	1740	0.8583333	1.0644444
Slieve Gullion	4.2	1110	0.4666667	0.6230556
Glenariff Mountain	5.9	1210	0.7030556	0.8869444
Donard & Commedagh	6.8	3300	1.0386111	1.2141667
McVeigh Classic	5.0	1200	0.5411111	0.6375000
Tollymore Mountain	4.8	950	0.4833333	0.5886111
Slieve Martin	4.3	1600	0.5505556	0.7016667
Moughanmore	3.0	1500	0.4636111	0.6475000
Hen & Cock	2.5	1500	0.4497222	0.6075000
Annalong Horseshoe	12.0	5080	1.9491667	2.4805556
Monument Race	4.0	1000	0.4716667	0.5947222
Loughshannagh Horseshoe	4.3	1700	0.6469444	0.8822222
Rocky	4.0	1300	0.5230556	0.6652778
Meelbeg Meelmore	3.5	1800	0.4544444	0.6086111
Donard Forest	4.5	1400	0.5186111	0.6433333
Slieve Donard	5.5	2790	0.9483333	1.2086111
Flagstaff to Carling	11.0	3000	1.4569444	2.0344444
Slieve Bearnagh	4.0	2690	0.6877778	0.7991667
Seven Sevens	18.9	8775	3.9027778	5.9855556
Lurig Challenge	4.0	1000	0.4347222	0.5755556
Scrabo Hill Race	2.9	750	0.3247222	0.4091667
Slieve Gallion	4.6	1440	0.6361111	0.7494444
BARF Turkey Trot	5.7	1430	0.7130556	0.9383333

Suppose we use for this first example only the predictor dist for regressing the dependent variable time. Let us have a quick look to the data:

# Scatterplot: dist against time
plot(nihills$dist, nihills$time, pch=19, xlab="distance", ylab="time")

# Alternatively
# with(nihills, plot(dist, time, pch=19))

Apart one, all the data times are lower than 2 hours. As an initial attempt, let’s fit a simple linear model, plot the fitted values and manually compute the \(R^{2}\): \[time_i =\beta_0+ \beta_1dist_i+\varepsilon_i, \ \ i=1,\ldots,n, \quad \varepsilon_i\sim N(0,\sigma^2).\]

# Fit the linear model 1: time = beta_0+beta_1*dist + epsilon
lm1 <- lm(time ~ dist, data=nihills)

#See the output 
summary(lm1)

## 
## Call:
## lm(formula = time ~ dist, data = nihills)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42812 -0.12193 -0.00250  0.09232  0.43028 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.32530    0.07629  -4.264 0.000345 ***
## dist         0.20094    0.01120  17.934 3.28e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1935 on 21 degrees of freedom
## Multiple R-squared:  0.9387, Adjusted R-squared:  0.9358 
## F-statistic: 321.6 on 1 and 21 DF,  p-value: 3.278e-14

##plot of the residuals
par(mfrow=c(2,2))
plot(lm1)

##plot of the fitted values
par(mfrow=c(1,1))
with(nihills, plot(dist, time, pch=19))
abline(coef(lm1), col="red", lty="solid")
# or
#curve(predict(lm1, data.frame(dist=x)), col="red", lty="solid", lwd=2, add=TRUE)

text(13,3,expression(time==hat(beta)[0]+hat(beta)[1]*dist), col="red")

points(nihills$dist, predict(lm1), col="red", pch=19, cex=0.8)
# or
#points(nihills$dist, fitted(lm1), col="red", pch=19, cex=0.8)

nihills[17,]

##                      dist climb     time    timef
## Flagstaff to Carling   11  3000 1.456944 2.034444

segments(nihills[17,]$dist,nihills[17,]$time, 
         nihills[17,]$dist,fitted(lm1)[17], lty="dashed")

# or
#segments(nihills[17,"dist"],nihills[17,"time"], 
#         nihills[17,"dist"],fitted(lm1)[17], lty="dashed")
# or 
#segments(nihills[17,1],nihills[17,3], 
#         nihills[17,1],fitted(lm1)[17], lty="dashed")

# Compute the R^2 by hand
# R_sq
Tot_SS <- with(nihills, sum((time-mean(time))^2)) 
Res_SS <- with(nihills, sum((predict(lm1)-time)^2))
Mod_SS <- with(nihills, sum((predict(lm1)-mean(time))^2))

R_sq <- 1-Res_SS/Tot_SS
R_sq

## [1] 0.9387112

# Similarly
Mod_SS/Tot_SS

## [1] 0.9387112

#[1] 0.9387112

# In the simple linear regression (one predictor) the R-square is 
# equal to the square of the correlation coefficient

with(nihills, cor(time, dist))^2

## [1] 0.9387112

The value computed here coincides with the value obtained in the R summary window.

Of course, the fitting seems good enough. However, the relationship between time and dist from the first plot above seems not to be purely linear…and residual plots seem to suggest a lack of fit for a few points.

Exercise 1: Obtain the quantities in the output of the summary function by hand.

Multiple linear regression

Fit and plot diagnostics

Checking the model fit is of vital importance, and many graphical tools are available to this purpose.

We may start with some scatterplot matricx, focusing now on male results, the dependent variable time in the nihills dataset.

# Scatterplot matrix
library(PerformanceAnalytics)
chart.Correlation(nihills[,c("dist", "climb", "time")])

It seems to exist a linear relationship between the variables climb and dist, apart from a possible outlier. Now we may fit a first multiple linear model, with the dependent variable time, denoted by \(y\), explained by dist and climb:

\[ time_{i} = \beta_0+\beta_1 dist_i+\beta_2 climb_i+\varepsilon_i,\ \ \ i=1,\ldots,n, \ \ \varepsilon_i\sim \mathcal{N}(0,\sigma^2).\]

# Fit the Model 2: time= beta0+beta1*dist+beta2*climb
lm2 <- lm(time ~ dist + climb, data = nihills)
# see the output
summary(lm2)

## 
## Call:
## lm(formula = time ~ dist + climb, data = nihills)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.19857 -0.04824  0.01701  0.05539  0.21083 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.286e-01  4.025e-02  -5.679 1.47e-05 ***
## dist         1.008e-01  1.382e-02   7.293 4.72e-07 ***
## climb        2.298e-04  2.893e-05   7.941 1.31e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0973 on 20 degrees of freedom
## Multiple R-squared:  0.9852, Adjusted R-squared:  0.9838 
## F-statistic: 667.6 on 2 and 20 DF,  p-value: < 2.2e-16

We may also produce some diagnostic plots

# Plot the residuals
par(mfrow = c(2,2))
plot(lm2)

From the first plot (top left) and the third plot (bottom left) it emergs clearly that the variability is not constant, while the race Seven Sevens is an outlier according to the QQ plot (top right) and the bottom right plot.

Let see the estimates:

# Extract the estimates
summary(lm2)$coef

##                  Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept) -0.2285705228 4.024640e-02 -5.679279 1.473367e-05
## dist         0.1007603621 1.381667e-02  7.292669 4.724521e-07
## climb        0.0002297608 2.893385e-05  7.940899 1.306900e-07

# or only the coefficient estimates
coef(lm2)

##   (Intercept)          dist         climb 
## -0.2285705228  0.1007603621  0.0002297608

# R-squared
summary(lm2)$adj.r.squared

## [1] 0.9837661

According to the summary, all the coeffcients are statistically significant, and the adjusted \(R^2\) is 0.9838. The estimated equation is then:

\[ \hat{time}_{i}= -0.2286 + 0.1008 dist_i+ 2\times 10^{-4} climb_i,\]

For each mile of dist, there is a positive 0.1 effect on the time (6 minutes).

Looking at the residuals plot vs the continuous explanatory variable is a good check to highlight (if any) residual structures in the residuals.

# Plot the residuals against the explanatory  variable dist
par(mfrow=c(1,1))
plot(nihills$dist, lm2$residuals)
abline(h=0, lty="dashed")

The following model extends the simple linear regression model lm1 modelling the explanatory variable dist as a polynomial of degree two.

\[time_i =\beta_0+ \beta_1dist_i+\beta_2dist^2_i+\varepsilon_i, \ \ i=1,\ldots,n, \quad \varepsilon_i\sim N(0,\sigma^2).\]

Fit the model in R and comment.

# Fit the Model 3: time= beta0+beta1*dist+beta2*dist^2
lm3 <- lm(time ~ dist + I(dist^2), data=nihills)

summary(lm3)

## 
## Call:
## lm(formula = time ~ dist + I(dist^2), data = nihills)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.168138 -0.082280 -0.002996  0.050920  0.260472 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.278699   0.100906   2.762    0.012 *  
## dist        0.026388   0.027061   0.975    0.341    
## I(dist^2)   0.008728   0.001315   6.640 1.83e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1108 on 20 degrees of freedom
## Multiple R-squared:  0.9809, Adjusted R-squared:  0.979 
## F-statistic: 512.8 on 2 and 20 DF,  p-value: < 2.2e-16

# Plot the residuals and the fitted values.
par(mfrow=c(2,2))
plot(lm3)

par(mfrow=c(1,1))
with(nihills, plot(dist, time, pch=19))

curve(predict(lm3, data.frame(dist=x)), col="red", lty="solid", lwd=2, add=TRUE)

text(10,3,expression(time==hat(beta)[0]+hat(beta)[1]*dist+hat(beta)[1]*dist^2), col="red")

points(nihills$dist, predict(lm3), col="red", pch=19, cex=0.8)

Log transformation

Many times, it may be convenient to consider a suited transformation of the data. However, statistical modelling is not an exact science, and motivations for transforming data may vary among the different datasets and situations. In this case, it is worth to summarize the following issues.

The range of time is large, \((min(y), max(y))= (0.32 , 3.9)\) (see the DAAG book for details about range values), and taking a look at the data distribution in R we may have some clues:

# See the summary of data
summary(nihills)

##       dist            climb           time            timef       
##  Min.   : 2.500   Min.   : 750   Min.   :0.3247   Min.   :0.4092  
##  1st Qu.: 4.000   1st Qu.:1205   1st Qu.:0.4692   1st Qu.:0.6158  
##  Median : 4.500   Median :1500   Median :0.5506   Median :0.7017  
##  Mean   : 5.778   Mean   :2098   Mean   :0.8358   Mean   :1.1107  
##  3rd Qu.: 5.800   3rd Qu.:2245   3rd Qu.:0.7857   3rd Qu.:1.0014  
##  Max.   :18.900   Max.   :8775   Max.   :3.9028   Max.   :5.9856

par(mfrow=c(1,2))
# Look the histograms of time, dist and climb
par(mfrow=c(1,3))
hist(nihills$time, probability=TRUE, breaks=15)
hist(nihills$climb, probability=TRUE, breaks=15)
hist(nihills$dist, probability=TRUE, breaks=15)

Remember that a linear model relies on the assumption:

\[ time_i \sim \mathcal{N}(\beta_0+\beta_1dist_i+\beta_2climb_i, \sigma^2), \ \ i=1,\ldots,n\]

According to the (left) histogram above, the distribution has a long right tail, and lot of values are shrunk towards zero. Furthermore, the extreme point on the right tail —the Seven sevens race— influences a lot the estimation of the equation line, it has large leverage. Maybe, we need a more symmetric distribution, such as \(\log(y_i)\) above. It is also reasonable to take the logarithm of the explanatory variables.

# Transform the variable in the log-scale 
#(add the variables to the nihills data.frame)
nihills$log_time<-log(nihills$time)
nihills$log_climb<-log(nihills$climb)
nihills$log_dist<-log(nihills$dist)

# See the scatterplot matrix
chart.Correlation(nihills[,c("log_time","log_climb", "log_dist")])

We assume then:

\[ \log(time_i) = \beta_0+\beta_1\log(dist_i)+\beta_2\log(climb_i) + \varepsilon_i, \ \ i=1,\ldots,n \] and it is clear that the effects of the explanatory variables on the response variable is not linear, indeed \[ time_i = e^{\beta_1}dist_i^{\beta_1}climb_i^{\beta_2} + \varepsilon_i, \ \ i=1,\ldots,n \]

We fit the model with the transformed variables

# Fit the model 4
lm4 <- lm(log_time ~ log_dist + log_climb, data=nihills)

and we print the summary and plot the residuals of the model.

#  See the output
summary(lm4)

## 
## Call:
## lm(formula = log_time ~ log_dist + log_climb, data = nihills)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17223 -0.04229 -0.02538  0.05222  0.13150 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.96113    0.27387  -18.11 7.09e-14 ***
## log_dist     0.68136    0.05518   12.35 8.19e-11 ***
## log_climb    0.46576    0.04530   10.28 1.98e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0766 on 20 degrees of freedom
## Multiple R-squared:  0.9831, Adjusted R-squared:  0.9814 
## F-statistic: 582.7 on 2 and 20 DF,  p-value: < 2.2e-16

# Plot the residuals
par(mfrow=c(2,2))
plot(lm4)

# R-squared (adjusted)
summary(lm4)$adj.r.squared

## [1] 0.981441

The residuals do not show any kind of particular misfit and from the pairs plot above we notice that the data distributions are now less shrunk towards zero and the outliers likely to be less influential. In terms of \(R^{2}\), the log model behaves analogously to the model lm2, with \(R^2= 0.984\) against lm4 with \(R^2=0.985\). Notice that both these multiple linear models—the one with not transformed data and the other with log data transformation—provide a better fit than the simple linear model lm1 with one predictor, whose \(R^2\) was 0.93.

AIC

\(R^2\) and adjusted \(R^2\) represent the easiest way for comparing models, or, at least, for assessing whether a given model fits or not the data. But some other usefuls statistics exist, such as the Akaike Information Criterion (AIC) and some related statistics, which are designed to choose, from among a small number of alternatives, the model with the best predictive power. AIC is defined as:

\[\mbox{AIC}=-2\log L(\beta_0,\beta_1,\beta_2; y)+k*edf,\]

where \(\log L(\beta_0,\beta_1,\beta_2; y)\) is the log-likelihood for the multiple linear model, \(edf\) is the number of equivalent degrees of freedom for the fitted model and \(k\) the weight associated to \(edf\). AIC and related statistics may be seen as a sum between two components: the first one refers to a pure measure of fit, the second one to the model complexity. For an equal (or similar) measure of fit, these statistics will favor the most parimonious models. The R function extractAIC computes the generalized AIC criterion, where k equals 2 yields the AIC statistic, while k equals \(\log(n)\) yields the BIC criterion.

# AIC
AIC <- rbind(extractAIC(lm1)[2],extractAIC(lm2)[2],
             extractAIC(lm3)[2],extractAIC(lm4)[2])
# BIC
BIC <- rbind(extractAIC(lm1, k=log(n))[2],extractAIC(lm2,k=log(n))[2],
             extractAIC(lm3,k=log(n))[2],extractAIC(lm4,k=log(n))[2])
cbind(AIC, BIC)

##            [,1]       [,2]
## [1,]  -73.64315  -71.37216
## [2,] -104.39068 -100.98419
## [3,]  -98.42631  -95.01983
## [4,] -115.39756 -111.99108

The lower the AIC, and the better is the predictive power of the model. The linear model in log-scale nihills.log.lm yields the lowest AIC.

Prediction accuracy

Once we fit a model and we compute some useful statistics for checking the goodness of fit—AIC, BIC, \(R^{2}\), t-tests for single coefficients—it is of interest assess how precisely this model predicts the same items. One method is to produce coverage intervals for predicted values for the dependent variable \(log(time)\).

# Evaluate the prediction accuracy for the log-model:
# coverage intervals for predicted values
coverage_log <- exp(predict(lm4, interval="confidence")) 
coverage_log[1:5,]

##                          fit       lwr       upr
## Binevenagh         0.8932015 0.8452499 0.9438735
## Slieve Gullion     0.4880304 0.4672452 0.5097402
## Glenariff Mountain 0.6404213 0.6041025 0.6789236
## Donard & Commedagh 1.1256934 1.0677265 1.1868074
## McVeigh Classic    0.5699144 0.5439268 0.5971437

freq_coverage_log <- mean(nihills$time >= coverage_log[,2] & 
                          nihills$time < coverage_log[,3])
freq_coverage_log

## [1] 0.4782609

However, it is worth noting that this assessment of predictive accuracy has some limitations:

it is crucially dependent on the model assumptions—independence of data points, homogeneity of variance, normality.
it applies only on the population from which these data have been sampled, and checking the accuracy for new out-of-sample races could be appropriate.

Model selection

Once we propose competing nested models, analysis of deviance is required for assessing whether the more complex model better describe data at hand. Let us set up a test for nihills races, where the dependent variable is still \(\log(time)\), and possible exploratory variables are \(\log(dist)\) and \(\log(climb)\). Two nested competing models are:

\[ \begin{cases} \mathcal{M}_1: & \log(y_i)=\beta_o+\beta_1\log(dist_i)+\varepsilon_i, \ \ i=1,\ldots,n\\ \mathcal{M}_2: & \log(y_i)=\beta_o+\beta_1\log(dist_i)+\beta_2log(climb_i)+\varepsilon_i, \ \ i=1,\ldots,n \end{cases} \]

We can then set up the following test:

\[ \begin{cases} H_{0}: & \beta_{2}=0\\ H_{1}: & \beta_2 \ne 0 \end{cases} \]

using as a test statistic the following ratio:

\[ F= \frac{ \frac{RSS_1-RSS_2}{p_2-p_1}}{\frac{RSS_2}{n-p_2}},\] where \(p_{1}\) is the number of parameters in \(\mathcal{M}_1\), \(p_2\) the number of parameters in \(\mathcal{M}_2\), with \(p_2>p_1\) (in this case, 3>2). High values for the statistic \(F\) suggest to select the complex model in place of the simplest one.

# Model selection:
# Compare the summary of model 1 with anova()

summary(lm1)

## 
## Call:
## lm(formula = time ~ dist, data = nihills)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42812 -0.12193 -0.00250  0.09232  0.43028 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.32530    0.07629  -4.264 0.000345 ***
## dist         0.20094    0.01120  17.934 3.28e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1935 on 21 degrees of freedom
## Multiple R-squared:  0.9387, Adjusted R-squared:  0.9358 
## F-statistic: 321.6 on 1 and 21 DF,  p-value: 3.278e-14

anova(lm1)

## Analysis of Variance Table
## 
## Response: time
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## dist       1 12.0446 12.0446  321.64 3.278e-14 ***
## Residuals 21  0.7864  0.0374                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Computing the F-value by hand
m0<- lm(time ~ 1, data = nihills)
m1<- lm(time ~ dist, data = nihills)

F<-(sum(residuals(m0)^2)-sum(residuals(m1)^2))/(sum(residuals(m1)^2)/(n-2))
F

## [1] 321.6401

pf(F, 1,21,lower.tail=FALSE)

## [1] 3.27848e-14

# Model selection for model 4:
anova(lm4)

## Analysis of Variance Table
## 
## Response: log_time
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## log_dist   1 6.2174  6.2174  1059.7 < 2.2e-16 ***
## log_climb  1 0.6202  0.6202   105.7 1.981e-09 ***
## Residuals 20 0.1173  0.0059                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The table above first computes the \(F\) statistic for the null model \(\mathcal{M}_0\) with the intercept only and \(\mathcal{M}_1\), with a corresponding p-value \(< 2.2e-16\) (\(\mathcal{M_{1}}\) preferred to \(\mathcal{M_{0}}\), and difference of degrees of freedom df=2-1 =1 ); then, the test suggests to accept \(\mathcal{M}_2\) in place of \(\mathcal{M}_1\), since p-value \(\approx\) 0. Here the difference of degrees of freedom df is always 1, while residuals is the number of degrees of freedom of the final model \(\mathcal{M}_2\), \(=n-p_2= 23-3=20\).

Multicollinearity

A common feature in regression modelling is that some predictors may be lineary related to combinations of one or more of the other explanatory variables, i.e. \(x_{1}=a+bx_2\). In some sense, we would avoid this behaviour, and we should break up the multicollinearity existing in our predictors.

The variance inflation factor (VIF) measures the effect of correlation with other variables in increasing the standard error of a regression coefficient. Let suppose to have the following linear model:

\[y_{i}= \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_{3}x_{i3}+\varepsilon_{i}, \ \ i=1,\ldots, n.\]

One may prove that the estimated variance of the estimate of \(\hat{\beta_{j}}, \ j=1,\ldots,3\), can be expressed as:

\[ \widehat{Var}({\hat{\beta}_{j}})=\frac{\hat{\sigma}^{2}}{(n-1)\widehat{Var}(x_j)}*\frac{1}{1-R^{2}_{j}},\]

where \(R^{2}_{j}\) is the multiple \(R^{2}\) for the regression of \(x_{j}\) on the other covariates (a regression that does not involve the response variable Y), and \(\frac{1}{1-R^{2}_{j}}\) is the VIF. As reported by Wikipedia:

It reflects all other factors that influence the uncertainty in the coefficient estimates. The VIF equals 1 when the vector \(x_j\) is orthogonal to each column of the design matrix for the regression of \(x_j\) on the other covariates. By contrast, the VIF is greater than 1 when the vector \(x_j\) is not orthogonal to all columns of the design matrix for the regression of \(x_j\) on the other covariates. Finally, note that the VIF is invariant to the scaling of the variables (that is, we could scale each variable \(x_j\) by a constant \(c_j\) without changing the VIF).

A rule of thumb is that if \(VIF(\hat{\beta}_{j})>10\), then multicollinearity is high.

# Compute the variance inflation factor: vif()
vif(lm4)

##  log_dist log_climb 
##    2.5543    2.5543

#by hand
lm_log_dist_climb<- lm(log_dist ~ log_climb, data=nihills)
1/(1-summary(lm_log_dist_climb)$r.squared)

## [1] 2.554308

In this case there is no suggestion of multicollinearity between log(dist) and log(climb).

Let us set up the following naive exercise, including the further predictor:

\[\log(sum)= 4+3*\log(dist)-2*climb\]

Let’s fit the model, and then check for the VIFs.

# Naive exercise: include in the model the 
#       predictor logLC = 4 + 3*log(dist) - 2*climb+
nihills$logLC = 4 + 3*nihills$log_dist - 2*nihills$climb

lm5 <- lm(log_time ~ log_dist+log_climb+logLC ,data=nihills)

# Compute the vif
vif(lm5)

##  log_dist log_climb     logLC 
##    2.9491    8.0909    9.2378

# by hand

lm_log1<- lm(log_dist ~ log_climb + logLC, data=nihills)
1/(1-summary(lm_log1)$r.squared)

## [1] 2.94907

lm_log2<- lm(log_climb ~ log_dist+logLC, data=nihills)
1/(1-summary(lm_log2)$r.squared)

## [1] 8.090903

lm_log3<- lm(logLC ~ log_dist+log_climb, data=nihills)
1/(1-summary(lm_log3)$r.squared)

## [1] 9.237841

The new predictor, which is a linear combination between \(log(dist)\) and \(climb\), approximates the treshold value 10.

As an illustration purpose, let’s now transform the variables and consider:

\[ \begin{cases} x_s = & \log(climb)+\log(dist), \\ x_d = & \log(climb)-\log(dist), \end{cases} \]

and fit the model

\[log(y_i)= \beta_0+ \beta_1 x_{is}+\beta_2 x_{id}+\varepsilon_i, \ \ i=1,\ldots,n.\]

# Another example: generate 
# x1 = log(climb) + log(dist)
# x2 = log(climb) - log(dist)

#Estimate the model (log_time=beta0 + beta1*x1 +beta2*x2 )
nihills$sum <- nihills$log_climb + nihills$log_dist
nihills$diff <- nihills$log_climb - nihills$log_dist

lm6 <- lm(log_time ~ sum + diff, data=nihills)

# Compute the vif
vif(lm6)

##    sum   diff 
## 1.1006 1.1006

No correlation here, orthogonal coefficients!

Ridge regression and Lasso

Ridge regression is a penalized likelihood framework which results to be useful for alleviating the effects of multicollinearity. We remind that it corresponds to introducing the following penalty \[\lambda\sum_{j=1}^p \beta_j^2=\lambda\boldsymbol{\beta}^T\boldsymbol{\beta}\]

in estimating the coefficient vector \(\boldsymbol{\beta}\). With large \(\lambda\) the penalty term dominates and all (or almost all) coefficients are shrunk toward 0. \(\lambda\) is usually chosen by \(k\)-fold cross validation.

It is possible to implement Ridge regression in R with the function lm.ridge() in MASS package.

#Ridge regression
library(MASS)
# Tuning parameter = 0 implies least square estimates
lm_ridge<- lm.ridge(log_time ~ log_dist + log_climb, lambda=0, data=nihills)
coef(lm_ridge)

##              log_dist  log_climb 
## -4.9611313  0.6813596  0.4657575

# Check with lm 4
coef(lm4)

## (Intercept)    log_dist   log_climb 
##  -4.9611313   0.6813596   0.4657575

# select lambda by GCV in the model with logLC
grid.ridge<-lm.ridge(log_time ~ log_dist + log_climb + logLC, 
       lambda=seq(0.1,10,0.001), data=nihills)

lambda_selected<-grid.ridge$lambda[which(grid.ridge$GCV==min(grid.ridge$GCV))]

# or see the lambda in the last line of the following 
# select(lm.ridge(log_time ~ log_dist + log_climb + logLC, 
#       lambda=seq(0.1,10,0.001), data=nihills))

lm_ridge_GCV <- lm.ridge(log_time ~ log_dist + log_climb + logLC, 
                         lambda=lambda_selected, data=nihills)

coef(lm_ridge_GCV)

##                    log_dist     log_climb         logLC 
## -4.097312e+00  6.295220e-01  3.467162e-01 -2.569142e-05

# Comparison with the simple linear model lm5
coef(lm5)

##   (Intercept)      log_dist     log_climb         logLC 
## -4.278904e+00  6.500840e-01  3.695947e-01 -2.035603e-05

LASSO is another regularization technique involving the penalization:

\[\lambda \sum_{j=1}^p |\beta_j|.\]

We may implement LASSO regression through the lasso2 package, via function l1ce:

# Lasso regression
library(lasso2)
# Fit the lasso model and see the output 
lm.lasso <- l1ce(log_time ~ log_dist + log_climb + logLC, data=nihills)

summary(lm.lasso)$coef

##                     Value   Std. Error    Z score  Pr(>|Z|)
## (Intercept) -2.304966e+00 2.141078e+00 -1.0765445 0.2816838
## log_dist     3.583981e-01 2.287124e-01  1.5670250 0.1171088
## log_climb    1.757895e-01 3.187876e-01  0.5514315 0.5813379
## logLC       -7.166288e-06 5.617907e-05 -0.1275615 0.8984960

coef(lm.lasso)

##   (Intercept)      log_dist     log_climb         logLC 
## -2.304966e+00  3.583981e-01  1.757895e-01 -7.166288e-06

Laboratory 4

Egidi L, Gioia V, Stefanucci M

24/11/2021