Regression is trying to describe a dependent variable \(y\) through some statistically significant predictors \(x\), linking them through a function \(f(\cdot)\):
\[y = f(x)+\varepsilon,\] where \(\varepsilon\) is a measure of noise implicit in modelling. Checking the assumptions of a regression model is not just a theoretical matter, but it is something related to scientific reproducibility, scientific transparency and visualization power. Although it is impossible to write down an exaustive list of ‘what to do for checking a model’, I put here just some initial steps for visualizing and inspecting the sign of your variables, possible correlations between them and the inclusion of further predictors. The suggested step refer to the classical linear model, but they contain principles valid for statistical modelling in general.
Before fitting a model
Examine the distribution of each of the explanatory variables, and of the dependent variable. Skewness, presence of outliers. When the distrubution is skew, consider symmetric transformation.
Examine the scatterplot involving all the explanatory variables, checking for non linearity . If some pairwise plot shows non-linearity, consider some possible transformation.
Examine the range of the variables.
Examine the degree of correlation between the explanatory variables.
After fitting a model
Plot residuals against fitted values and check for patterns in the residual (homoschedasticity).
Examine the Cook’s distance for possible outliers.
The dataset nihills
of the DAAG
package contains data from the 2007 calendar for the Northern Ireland Mountain Running Association. The number of observations (races) is \(n = 23\), and for each of them we register the following variables: the distances in miles dist
, the amount of climb expressed in feet climb
, the record time in hours for males time
and the record time in hours for females timef
.
library(DAAG)
data(nihills)
n <- nrow(nihills)
Objective: we want to regress the times through some predictors. Here below there is the data structure.
dist | climb | time | timef | |
---|---|---|---|---|
Binevenagh | 7.5 | 1740 | 0.8583333 | 1.0644444 |
Slieve Gullion | 4.2 | 1110 | 0.4666667 | 0.6230556 |
Glenariff Mountain | 5.9 | 1210 | 0.7030556 | 0.8869444 |
Donard & Commedagh | 6.8 | 3300 | 1.0386111 | 1.2141667 |
McVeigh Classic | 5.0 | 1200 | 0.5411111 | 0.6375000 |
Tollymore Mountain | 4.8 | 950 | 0.4833333 | 0.5886111 |
Slieve Martin | 4.3 | 1600 | 0.5505556 | 0.7016667 |
Moughanmore | 3.0 | 1500 | 0.4636111 | 0.6475000 |
Hen & Cock | 2.5 | 1500 | 0.4497222 | 0.6075000 |
Annalong Horseshoe | 12.0 | 5080 | 1.9491667 | 2.4805556 |
Monument Race | 4.0 | 1000 | 0.4716667 | 0.5947222 |
Loughshannagh Horseshoe | 4.3 | 1700 | 0.6469444 | 0.8822222 |
Rocky | 4.0 | 1300 | 0.5230556 | 0.6652778 |
Meelbeg Meelmore | 3.5 | 1800 | 0.4544444 | 0.6086111 |
Donard Forest | 4.5 | 1400 | 0.5186111 | 0.6433333 |
Slieve Donard | 5.5 | 2790 | 0.9483333 | 1.2086111 |
Flagstaff to Carling | 11.0 | 3000 | 1.4569444 | 2.0344444 |
Slieve Bearnagh | 4.0 | 2690 | 0.6877778 | 0.7991667 |
Seven Sevens | 18.9 | 8775 | 3.9027778 | 5.9855556 |
Lurig Challenge | 4.0 | 1000 | 0.4347222 | 0.5755556 |
Scrabo Hill Race | 2.9 | 750 | 0.3247222 | 0.4091667 |
Slieve Gallion | 4.6 | 1440 | 0.6361111 | 0.7494444 |
BARF Turkey Trot | 5.7 | 1430 | 0.7130556 | 0.9383333 |
Suppose we use for this first example only the predictor dist
for regressing the dependent variable time
. Let us have a quick look to the data:
# Scatterplot: dist against time
plot(nihills$dist, nihills$time, pch=19, xlab="distance", ylab="time")
# Alternatively
# with(nihills, plot(dist, time, pch=19))
Apart one, all the data times are lower than 2 hours. As an initial attempt, let’s fit a simple linear model, plot the fitted values and manually compute the \(R^{2}\): \[time_i =\beta_0+ \beta_1dist_i+\varepsilon_i, \ \ i=1,\ldots,n, \quad \varepsilon_i\sim N(0,\sigma^2).\]
# Fit the linear model 1: time = beta_0+beta_1*dist + epsilon
lm1 <- lm(time ~ dist, data=nihills)
#See the output
summary(lm1)
##
## Call:
## lm(formula = time ~ dist, data = nihills)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42812 -0.12193 -0.00250 0.09232 0.43028
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.32530 0.07629 -4.264 0.000345 ***
## dist 0.20094 0.01120 17.934 3.28e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1935 on 21 degrees of freedom
## Multiple R-squared: 0.9387, Adjusted R-squared: 0.9358
## F-statistic: 321.6 on 1 and 21 DF, p-value: 3.278e-14
##plot of the residuals
par(mfrow=c(2,2))
plot(lm1)
##plot of the fitted values
par(mfrow=c(1,1))
with(nihills, plot(dist, time, pch=19))
abline(coef(lm1), col="red", lty="solid")
# or
#curve(predict(lm1, data.frame(dist=x)), col="red", lty="solid", lwd=2, add=TRUE)
text(13,3,expression(time==hat(beta)[0]+hat(beta)[1]*dist), col="red")
points(nihills$dist, predict(lm1), col="red", pch=19, cex=0.8)
# or
#points(nihills$dist, fitted(lm1), col="red", pch=19, cex=0.8)
nihills[17,]
## dist climb time timef
## Flagstaff to Carling 11 3000 1.456944 2.034444
segments(nihills[17,]$dist,nihills[17,]$time,
nihills[17,]$dist,fitted(lm1)[17], lty="dashed")
# or
#segments(nihills[17,"dist"],nihills[17,"time"],
# nihills[17,"dist"],fitted(lm1)[17], lty="dashed")
# or
#segments(nihills[17,1],nihills[17,3],
# nihills[17,1],fitted(lm1)[17], lty="dashed")
# Compute the R^2 by hand
# R_sq
Tot_SS <- with(nihills, sum((time-mean(time))^2))
Res_SS <- with(nihills, sum((predict(lm1)-time)^2))
Mod_SS <- with(nihills, sum((predict(lm1)-mean(time))^2))
R_sq <- 1-Res_SS/Tot_SS
R_sq
## [1] 0.9387112
# Similarly
Mod_SS/Tot_SS
## [1] 0.9387112
#[1] 0.9387112
# In the simple linear regression (one predictor) the R-square is
# equal to the square of the correlation coefficient
with(nihills, cor(time, dist))^2
## [1] 0.9387112
The value computed here coincides with the value obtained in the R
summary window.
Of course, the fitting seems good enough. However, the relationship between time
and dist
from the first plot above seems not to be purely linear…and residual plots seem to suggest a lack of fit for a few points.
Exercise 1: Obtain the quantities in the output of the summary function by hand.
Checking the model fit is of vital importance, and many graphical tools are available to this purpose.
We may start with some scatterplot matricx, focusing now on male results, the dependent variable time
in the nihills
dataset.
# Scatterplot matrix
library(PerformanceAnalytics)
chart.Correlation(nihills[,c("dist", "climb", "time")])
It seems to exist a linear relationship between the variables climb
and dist
, apart from a possible outlier. Now we may fit a first multiple linear model, with the dependent variable time
, denoted by \(y\), explained by dist
and climb
:
\[ time_{i} = \beta_0+\beta_1 dist_i+\beta_2 climb_i+\varepsilon_i,\ \ \ i=1,\ldots,n, \ \ \varepsilon_i\sim \mathcal{N}(0,\sigma^2).\]
# Fit the Model 2: time= beta0+beta1*dist+beta2*climb
lm2 <- lm(time ~ dist + climb, data = nihills)
# see the output
summary(lm2)
##
## Call:
## lm(formula = time ~ dist + climb, data = nihills)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.19857 -0.04824 0.01701 0.05539 0.21083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.286e-01 4.025e-02 -5.679 1.47e-05 ***
## dist 1.008e-01 1.382e-02 7.293 4.72e-07 ***
## climb 2.298e-04 2.893e-05 7.941 1.31e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0973 on 20 degrees of freedom
## Multiple R-squared: 0.9852, Adjusted R-squared: 0.9838
## F-statistic: 667.6 on 2 and 20 DF, p-value: < 2.2e-16
We may also produce some diagnostic plots
# Plot the residuals
par(mfrow = c(2,2))
plot(lm2)
From the first plot (top left) and the third plot (bottom left) it emergs clearly that the variability is not constant, while the race Seven Sevens
is an outlier according to the QQ plot (top right) and the bottom right plot.
Let see the estimates:
# Extract the estimates
summary(lm2)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2285705228 4.024640e-02 -5.679279 1.473367e-05
## dist 0.1007603621 1.381667e-02 7.292669 4.724521e-07
## climb 0.0002297608 2.893385e-05 7.940899 1.306900e-07
# or only the coefficient estimates
coef(lm2)
## (Intercept) dist climb
## -0.2285705228 0.1007603621 0.0002297608
# R-squared
summary(lm2)$adj.r.squared
## [1] 0.9837661
According to the summary, all the coeffcients are statistically significant, and the adjusted \(R^2\) is 0.9838. The estimated equation is then:
\[ \hat{time}_{i}= -0.2286 + 0.1008 dist_i+ 2\times 10^{-4} climb_i,\]
For each mile of dist
, there is a positive 0.1 effect on the time (6 minutes).
Looking at the residuals plot vs the continuous explanatory variable is a good check to highlight (if any) residual structures in the residuals.
# Plot the residuals against the explanatory variable dist
par(mfrow=c(1,1))
plot(nihills$dist, lm2$residuals)
abline(h=0, lty="dashed")
The following model extends the simple linear regression model lm1
modelling the explanatory variable dist
as a polynomial of degree two.
\[time_i =\beta_0+ \beta_1dist_i+\beta_2dist^2_i+\varepsilon_i, \ \ i=1,\ldots,n, \quad \varepsilon_i\sim N(0,\sigma^2).\]
Fit the model in R and comment.
# Fit the Model 3: time= beta0+beta1*dist+beta2*dist^2
lm3 <- lm(time ~ dist + I(dist^2), data=nihills)
summary(lm3)
##
## Call:
## lm(formula = time ~ dist + I(dist^2), data = nihills)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.168138 -0.082280 -0.002996 0.050920 0.260472
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.278699 0.100906 2.762 0.012 *
## dist 0.026388 0.027061 0.975 0.341
## I(dist^2) 0.008728 0.001315 6.640 1.83e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1108 on 20 degrees of freedom
## Multiple R-squared: 0.9809, Adjusted R-squared: 0.979
## F-statistic: 512.8 on 2 and 20 DF, p-value: < 2.2e-16
# Plot the residuals and the fitted values.
par(mfrow=c(2,2))
plot(lm3)
par(mfrow=c(1,1))
with(nihills, plot(dist, time, pch=19))
curve(predict(lm3, data.frame(dist=x)), col="red", lty="solid", lwd=2, add=TRUE)
text(10,3,expression(time==hat(beta)[0]+hat(beta)[1]*dist+hat(beta)[1]*dist^2), col="red")
points(nihills$dist, predict(lm3), col="red", pch=19, cex=0.8)
Many times, it may be convenient to consider a suited transformation of the data. However, statistical modelling is not an exact science, and motivations for transforming data may vary among the different datasets and situations. In this case, it is worth to summarize the following issues.
The range of time
is large, \((min(y), max(y))= (0.32 , 3.9)\) (see the DAAG book for details about range values), and taking a look at the data distribution in R
we may have some clues:
# See the summary of data
summary(nihills)
## dist climb time timef
## Min. : 2.500 Min. : 750 Min. :0.3247 Min. :0.4092
## 1st Qu.: 4.000 1st Qu.:1205 1st Qu.:0.4692 1st Qu.:0.6158
## Median : 4.500 Median :1500 Median :0.5506 Median :0.7017
## Mean : 5.778 Mean :2098 Mean :0.8358 Mean :1.1107
## 3rd Qu.: 5.800 3rd Qu.:2245 3rd Qu.:0.7857 3rd Qu.:1.0014
## Max. :18.900 Max. :8775 Max. :3.9028 Max. :5.9856
par(mfrow=c(1,2))
# Look the histograms of time, dist and climb
par(mfrow=c(1,3))
hist(nihills$time, probability=TRUE, breaks=15)
hist(nihills$climb, probability=TRUE, breaks=15)
hist(nihills$dist, probability=TRUE, breaks=15)
Remember that a linear model relies on the assumption:
\[ time_i \sim \mathcal{N}(\beta_0+\beta_1dist_i+\beta_2climb_i, \sigma^2), \ \ i=1,\ldots,n\]
According to the (left) histogram above, the distribution has a long right tail, and lot of values are shrunk towards zero. Furthermore, the extreme point on the right tail —the Seven sevens
race— influences a lot the estimation of the equation line, it has large leverage. Maybe, we need a more symmetric distribution, such as \(\log(y_i)\) above. It is also reasonable to take the logarithm of the explanatory variables.
# Transform the variable in the log-scale
#(add the variables to the nihills data.frame)
nihills$log_time<-log(nihills$time)
nihills$log_climb<-log(nihills$climb)
nihills$log_dist<-log(nihills$dist)
# See the scatterplot matrix
chart.Correlation(nihills[,c("log_time","log_climb", "log_dist")])
We assume then:
\[ \log(time_i) = \beta_0+\beta_1\log(dist_i)+\beta_2\log(climb_i) + \varepsilon_i, \ \ i=1,\ldots,n \] and it is clear that the effects of the explanatory variables on the response variable is not linear, indeed \[ time_i = e^{\beta_1}dist_i^{\beta_1}climb_i^{\beta_2} + \varepsilon_i, \ \ i=1,\ldots,n \]
We fit the model with the transformed variables
# Fit the model 4
lm4 <- lm(log_time ~ log_dist + log_climb, data=nihills)
and we print the summary and plot the residuals of the model.
# See the output
summary(lm4)
##
## Call:
## lm(formula = log_time ~ log_dist + log_climb, data = nihills)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.17223 -0.04229 -0.02538 0.05222 0.13150
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.96113 0.27387 -18.11 7.09e-14 ***
## log_dist 0.68136 0.05518 12.35 8.19e-11 ***
## log_climb 0.46576 0.04530 10.28 1.98e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0766 on 20 degrees of freedom
## Multiple R-squared: 0.9831, Adjusted R-squared: 0.9814
## F-statistic: 582.7 on 2 and 20 DF, p-value: < 2.2e-16
# Plot the residuals
par(mfrow=c(2,2))
plot(lm4)
# R-squared (adjusted)
summary(lm4)$adj.r.squared
## [1] 0.981441
The residuals do not show any kind of particular misfit and from the pairs plot above we notice that the data distributions are now less shrunk towards zero and the outliers likely to be less influential. In terms of \(R^{2}\), the log model behaves analogously to the model lm2
, with \(R^2= 0.984\) against lm4
with \(R^2=0.985\). Notice that both these multiple linear models—the one with not transformed data and the other with log data transformation—provide a better fit than the simple linear model lm1
with one predictor, whose \(R^2\) was 0.93.
\(R^2\) and adjusted \(R^2\) represent the easiest way for comparing models, or, at least, for assessing whether a given model fits or not the data. But some other usefuls statistics exist, such as the Akaike Information Criterion (AIC) and some related statistics, which are designed to choose, from among a small number of alternatives, the model with the best predictive power. AIC is defined as:
\[\mbox{AIC}=-2\log L(\beta_0,\beta_1,\beta_2; y)+k*edf,\]
where \(\log L(\beta_0,\beta_1,\beta_2; y)\) is the log-likelihood for the multiple linear model, \(edf\) is the number of equivalent degrees of freedom for the fitted model and \(k\) the weight associated to \(edf\). AIC and related statistics may be seen as a sum between two components: the first one refers to a pure measure of fit, the second one to the model complexity. For an equal (or similar) measure of fit, these statistics will favor the most parimonious models. The R
function extractAIC
computes the generalized AIC criterion, where k
equals 2 yields the AIC statistic, while k
equals \(\log(n)\) yields the BIC criterion.
# AIC
AIC <- rbind(extractAIC(lm1)[2],extractAIC(lm2)[2],
extractAIC(lm3)[2],extractAIC(lm4)[2])
# BIC
BIC <- rbind(extractAIC(lm1, k=log(n))[2],extractAIC(lm2,k=log(n))[2],
extractAIC(lm3,k=log(n))[2],extractAIC(lm4,k=log(n))[2])
cbind(AIC, BIC)
## [,1] [,2]
## [1,] -73.64315 -71.37216
## [2,] -104.39068 -100.98419
## [3,] -98.42631 -95.01983
## [4,] -115.39756 -111.99108
The lower the AIC, and the better is the predictive power of the model. The linear model in log-scale nihills.log.lm
yields the lowest AIC.
Once we fit a model and we compute some useful statistics for checking the goodness of fit—AIC, BIC, \(R^{2}\), t-tests for single coefficients—it is of interest assess how precisely this model predicts the same items. One method is to produce coverage intervals for predicted values for the dependent variable \(log(time)\).
# Evaluate the prediction accuracy for the log-model:
# coverage intervals for predicted values
coverage_log <- exp(predict(lm4, interval="confidence"))
coverage_log[1:5,]
## fit lwr upr
## Binevenagh 0.8932015 0.8452499 0.9438735
## Slieve Gullion 0.4880304 0.4672452 0.5097402
## Glenariff Mountain 0.6404213 0.6041025 0.6789236
## Donard & Commedagh 1.1256934 1.0677265 1.1868074
## McVeigh Classic 0.5699144 0.5439268 0.5971437
freq_coverage_log <- mean(nihills$time >= coverage_log[,2] &
nihills$time < coverage_log[,3])
freq_coverage_log
## [1] 0.4782609
However, it is worth noting that this assessment of predictive accuracy has some limitations:
it is crucially dependent on the model assumptions—independence of data points, homogeneity of variance, normality.
it applies only on the population from which these data have been sampled, and checking the accuracy for new out-of-sample races could be appropriate.
Once we propose competing nested models, analysis of deviance is required for assessing whether the more complex model better describe data at hand. Let us set up a test for nihills
races, where the dependent variable is still \(\log(time)\), and possible exploratory variables are \(\log(dist)\) and \(\log(climb)\). Two nested competing models are:
\[ \begin{cases} \mathcal{M}_1: & \log(y_i)=\beta_o+\beta_1\log(dist_i)+\varepsilon_i, \ \ i=1,\ldots,n\\ \mathcal{M}_2: & \log(y_i)=\beta_o+\beta_1\log(dist_i)+\beta_2log(climb_i)+\varepsilon_i, \ \ i=1,\ldots,n \end{cases} \]
We can then set up the following test:
\[ \begin{cases} H_{0}: & \beta_{2}=0\\ H_{1}: & \beta_2 \ne 0 \end{cases} \]
using as a test statistic the following ratio:
\[ F= \frac{ \frac{RSS_1-RSS_2}{p_2-p_1}}{\frac{RSS_2}{n-p_2}},\] where \(p_{1}\) is the number of parameters in \(\mathcal{M}_1\), \(p_2\) the number of parameters in \(\mathcal{M}_2\), with \(p_2>p_1\) (in this case, 3>2). High values for the statistic \(F\) suggest to select the complex model in place of the simplest one.
# Model selection:
# Compare the summary of model 1 with anova()
summary(lm1)
##
## Call:
## lm(formula = time ~ dist, data = nihills)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42812 -0.12193 -0.00250 0.09232 0.43028
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.32530 0.07629 -4.264 0.000345 ***
## dist 0.20094 0.01120 17.934 3.28e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1935 on 21 degrees of freedom
## Multiple R-squared: 0.9387, Adjusted R-squared: 0.9358
## F-statistic: 321.6 on 1 and 21 DF, p-value: 3.278e-14
anova(lm1)
## Analysis of Variance Table
##
## Response: time
## Df Sum Sq Mean Sq F value Pr(>F)
## dist 1 12.0446 12.0446 321.64 3.278e-14 ***
## Residuals 21 0.7864 0.0374
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Computing the F-value by hand
m0<- lm(time ~ 1, data = nihills)
m1<- lm(time ~ dist, data = nihills)
F<-(sum(residuals(m0)^2)-sum(residuals(m1)^2))/(sum(residuals(m1)^2)/(n-2))
F
## [1] 321.6401
pf(F, 1,21,lower.tail=FALSE)
## [1] 3.27848e-14
# Model selection for model 4:
anova(lm4)
## Analysis of Variance Table
##
## Response: log_time
## Df Sum Sq Mean Sq F value Pr(>F)
## log_dist 1 6.2174 6.2174 1059.7 < 2.2e-16 ***
## log_climb 1 0.6202 0.6202 105.7 1.981e-09 ***
## Residuals 20 0.1173 0.0059
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The table above first computes the \(F\) statistic for the null model \(\mathcal{M}_0\) with the intercept only and \(\mathcal{M}_1\), with a corresponding p-value \(< 2.2e-16\) (\(\mathcal{M_{1}}\) preferred to \(\mathcal{M_{0}}\), and difference of degrees of freedom df
=2-1 =1 ); then, the test suggests to accept \(\mathcal{M}_2\) in place of \(\mathcal{M}_1\), since p-value \(\approx\) 0. Here the difference of degrees of freedom df
is always 1, while residuals
is the number of degrees of freedom of the final model \(\mathcal{M}_2\), \(=n-p_2= 23-3=20\).
A common feature in regression modelling is that some predictors may be lineary related to combinations of one or more of the other explanatory variables, i.e. \(x_{1}=a+bx_2\). In some sense, we would avoid this behaviour, and we should break up the multicollinearity existing in our predictors.
The variance inflation factor (VIF) measures the effect of correlation with other variables in increasing the standard error of a regression coefficient. Let suppose to have the following linear model:
\[y_{i}= \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_{3}x_{i3}+\varepsilon_{i}, \ \ i=1,\ldots, n.\]
One may prove that the estimated variance of the estimate of \(\hat{\beta_{j}}, \ j=1,\ldots,3\), can be expressed as:
\[ \widehat{Var}({\hat{\beta}_{j}})=\frac{\hat{\sigma}^{2}}{(n-1)\widehat{Var}(x_j)}*\frac{1}{1-R^{2}_{j}},\]
where \(R^{2}_{j}\) is the multiple \(R^{2}\) for the regression of \(x_{j}\) on the other covariates (a regression that does not involve the response variable Y), and \(\frac{1}{1-R^{2}_{j}}\) is the VIF. As reported by Wikipedia:
It reflects all other factors that influence the uncertainty in the coefficient estimates. The VIF equals 1 when the vector \(x_j\) is orthogonal to each column of the design matrix for the regression of \(x_j\) on the other covariates. By contrast, the VIF is greater than 1 when the vector \(x_j\) is not orthogonal to all columns of the design matrix for the regression of \(x_j\) on the other covariates. Finally, note that the VIF is invariant to the scaling of the variables (that is, we could scale each variable \(x_j\) by a constant \(c_j\) without changing the VIF).
A rule of thumb is that if \(VIF(\hat{\beta}_{j})>10\), then multicollinearity is high.
# Compute the variance inflation factor: vif()
vif(lm4)
## log_dist log_climb
## 2.5543 2.5543
#by hand
lm_log_dist_climb<- lm(log_dist ~ log_climb, data=nihills)
1/(1-summary(lm_log_dist_climb)$r.squared)
## [1] 2.554308
In this case there is no suggestion of multicollinearity between log(dist)
and log(climb)
.
Let us set up the following naive exercise, including the further predictor:
\[\log(sum)= 4+3*\log(dist)-2*climb\]
Let’s fit the model, and then check for the VIFs.
# Naive exercise: include in the model the
# predictor logLC = 4 + 3*log(dist) - 2*climb+
nihills$logLC = 4 + 3*nihills$log_dist - 2*nihills$climb
lm5 <- lm(log_time ~ log_dist+log_climb+logLC ,data=nihills)
# Compute the vif
vif(lm5)
## log_dist log_climb logLC
## 2.9491 8.0909 9.2378
# by hand
lm_log1<- lm(log_dist ~ log_climb + logLC, data=nihills)
1/(1-summary(lm_log1)$r.squared)
## [1] 2.94907
lm_log2<- lm(log_climb ~ log_dist+logLC, data=nihills)
1/(1-summary(lm_log2)$r.squared)
## [1] 8.090903
lm_log3<- lm(logLC ~ log_dist+log_climb, data=nihills)
1/(1-summary(lm_log3)$r.squared)
## [1] 9.237841
The new predictor, which is a linear combination between \(log(dist)\) and \(climb\), approximates the treshold value 10.
As an illustration purpose, let’s now transform the variables and consider:
\[ \begin{cases} x_s = & \log(climb)+\log(dist), \\ x_d = & \log(climb)-\log(dist), \end{cases} \]
and fit the model
\[log(y_i)= \beta_0+ \beta_1 x_{is}+\beta_2 x_{id}+\varepsilon_i, \ \ i=1,\ldots,n.\]
# Another example: generate
# x1 = log(climb) + log(dist)
# x2 = log(climb) - log(dist)
#Estimate the model (log_time=beta0 + beta1*x1 +beta2*x2 )
nihills$sum <- nihills$log_climb + nihills$log_dist
nihills$diff <- nihills$log_climb - nihills$log_dist
lm6 <- lm(log_time ~ sum + diff, data=nihills)
# Compute the vif
vif(lm6)
## sum diff
## 1.1006 1.1006
No correlation here, orthogonal coefficients!
Ridge regression is a penalized likelihood framework which results to be useful for alleviating the effects of multicollinearity. We remind that it corresponds to introducing the following penalty \[\lambda\sum_{j=1}^p \beta_j^2=\lambda\boldsymbol{\beta}^T\boldsymbol{\beta}\]
in estimating the coefficient vector \(\boldsymbol{\beta}\). With large \(\lambda\) the penalty term dominates and all (or almost all) coefficients are shrunk toward 0. \(\lambda\) is usually chosen by \(k\)-fold cross validation.
It is possible to implement Ridge regression in R with the function lm.ridge()
in MASS
package.
#Ridge regression
library(MASS)
# Tuning parameter = 0 implies least square estimates
lm_ridge<- lm.ridge(log_time ~ log_dist + log_climb, lambda=0, data=nihills)
coef(lm_ridge)
## log_dist log_climb
## -4.9611313 0.6813596 0.4657575
# Check with lm 4
coef(lm4)
## (Intercept) log_dist log_climb
## -4.9611313 0.6813596 0.4657575
# select lambda by GCV in the model with logLC
grid.ridge<-lm.ridge(log_time ~ log_dist + log_climb + logLC,
lambda=seq(0.1,10,0.001), data=nihills)
lambda_selected<-grid.ridge$lambda[which(grid.ridge$GCV==min(grid.ridge$GCV))]
# or see the lambda in the last line of the following
# select(lm.ridge(log_time ~ log_dist + log_climb + logLC,
# lambda=seq(0.1,10,0.001), data=nihills))
lm_ridge_GCV <- lm.ridge(log_time ~ log_dist + log_climb + logLC,
lambda=lambda_selected, data=nihills)
coef(lm_ridge_GCV)
## log_dist log_climb logLC
## -4.097312e+00 6.295220e-01 3.467162e-01 -2.569142e-05
# Comparison with the simple linear model lm5
coef(lm5)
## (Intercept) log_dist log_climb logLC
## -4.278904e+00 6.500840e-01 3.695947e-01 -2.035603e-05
LASSO is another regularization technique involving the penalization:
\[\lambda \sum_{j=1}^p |\beta_j|.\]
We may implement LASSO regression through the lasso2
package, via function l1ce
:
# Lasso regression
library(lasso2)
# Fit the lasso model and see the output
lm.lasso <- l1ce(log_time ~ log_dist + log_climb + logLC, data=nihills)
summary(lm.lasso)$coef
## Value Std. Error Z score Pr(>|Z|)
## (Intercept) -2.304966e+00 2.141078e+00 -1.0765445 0.2816838
## log_dist 3.583981e-01 2.287124e-01 1.5670250 0.1171088
## log_climb 1.757895e-01 3.187876e-01 0.5514315 0.5813379
## logLC -7.166288e-06 5.617907e-05 -0.1275615 0.8984960
coef(lm.lasso)
## (Intercept) log_dist log_climb logLC
## -2.304966e+00 3.583981e-01 1.757895e-01 -7.166288e-06