First of all, install and load the required libraries : rms, Hmisc and epiDisplay.

library(rms)
library(Hmisc)
library(epiDisplay)

Simple Linear Regression

Let’s simulate data from a sample of n=100 points along with population linear regression line. The conditional distribution of y|x can be thought of as a vertical slice at x. The unconditional distribution of y is shown on the y-axis. To envision the conditional normal distributions assumed for the underlying population, think of a bell-shaped curve coming out of the page, with its base along one of the vertical lines of points. The equal variance assumption (homoscedasticity) dictates that the series of Gaussian curves for all the different x values have equal variances.

n <- 100
set.seed(13)
x <- round(rnorm(n, .5, .25), 1)
y <- x + rnorm(n, 0, .1)
r <- c(-.2, 1.2)

Plot:

plot(x, y, axes=FALSE, xlim=r, ylim=r, xlab=expression(x), ylab=expression(y))
axis(1, at=r, labels=FALSE)     
axis(2, at=r, labels=FALSE)
abline(a=0,b=1)
histSpike(y, side=2, add=TRUE)
abline(v=.6, lty=2)

Simple linear regression is used when:

Only 2 variables are of interest
One variable is a response (continuous scale) and one is a predictor
The mean of the dependent variable is a quantity of interest [otherwise explore for example quantile regression]
No adjustment is needed for confounding or other between-subject variation
The investigator is interested in assessing the strength of the relationship between x and y in real data units, or in predicting y from x
A linear relationship is assumed (visual inspection is strongly recommended…)
Not when one only needs to test for association (use Pearson or Spearman’s correlation in case)

Interval Estimation: evaluating the uncertainty about predictions

Estimation of the confidence intervals (CI) for predictions depend on what you want to predict, if at individual level or at mean population level.

x1 <- c( 1,  3,  5,  6,  7,  9,  11)
y  <- c( 5, 10, 70, 58, 85, 89, 135)
dd <- datadist(x1, n.unique=5); options(datadist='dd')
f  <- ols(y ~ x1)
p1 <- Predict(f, x1=seq(1,11,length=100), conf.type='mean')
p2 <- Predict(f, x1=seq(1,11,length=100), conf.type='individual')
p  <- rbind(Mean=p1, Individual=p2)
ggplot(p, legend.position='none') +    
  geom_point(aes(x1, y), data=data.frame(x1, y, .set.=''))

Example usages:

Is a child of age x smaller than predicted for her age? Use the individual level, p2 (wider bands)
What is the best estimate of the population mean blood pressure for patients on treatment A? Use the mean population level, p1 (narrower bands)

Assessing the Goodness of Fit

It is crucial to verify the the assumptions underlying a linear regression model:

In a scatterplot the spread of y about the fitted line should be constant as x increases, and y vs. x should appear linear
Easier to see this with a plot of residuals vs estimated values
In this plot there should be no systematic patterns (no trend in central tendency, no change in spread of points with x)
Trend in central tendency indicates failure of linearity
qqnorm plot of residuals is a useful tool

Here an example: we fit a linear regression model where x and y should instead have been log transformed:

n <- 50
set.seed(2)
res <- rnorm(n, sd=.25)
x <- runif(n)
y <- exp(log(x) + res)
f <- ols(y ~ x)
plot(fitted(f), resid(f))

This plot depicts non-constant variance of the residuals, which might call for transforming y.

Now, we fit a linear model that should have been quadratic (functional form of X):

x <- runif(n, -1, 1)
y <- x ^ 2 + res
f <- ols(y ~ x)
plot(fitted(f), resid(f))

Finally, we fit a correct model:

y <- x + res
f <- ols(y ~ x)
plot(fitted(f), resid(f))

qqnorm(resid(f)); qqline(resid(f))

These plots shows the ideal situation of white noise (no trend, constant variance). The qq plot demonstrates approximate normality of residuals, for a sample of size n = 50.

Application example of simple linear regression: Hookworm & blood loss

The dataset concerns the relationship between hookworm and blood loss from a study conducted in 1970.

data(Suwit)
summary(Suwit)

##        id            worm            bloss      
##  Min.   : 1.0   Min.   :  32.0   Min.   : 5.03  
##  1st Qu.: 4.5   1st Qu.: 167.0   1st Qu.:14.02  
##  Median : 8.0   Median : 525.0   Median :33.80  
##  Mean   : 8.0   Mean   : 552.4   Mean   :33.45  
##  3rd Qu.:11.5   3rd Qu.: 796.5   3rd Qu.:41.70  
##  Max.   :15.0   Max.   :1929.0   Max.   :86.65

des(Suwit)

## HW and Blood loss SEAJTMH 1970; 
##  No. of observations =  15 
##   Variable      Class           Description   
## 1 id            numeric                       
## 2 worm          numeric         No. of worms  
## 3 bloss         numeric         Blood loss/day

summ(Suwit)

## HW and Blood loss SEAJTMH 1970;
## No. of observations = 15
## 
##   Var. name obs. mean   median  s.d.   min.   max.  
## 1 id        15   8      8       4.47   1      15    
## 2 worm      15   552.4  525     513.9  32     1929  
## 3 bloss     15   33.45  33.8    24.85  5.03   86.65

attach(Suwit)

The file is clean and ready for analysis (this happens here only for didactic purposes: in real life, you will usually spend a couple of hours - at minimum, if not days..- to clean datasets). For example, with this small sample size it is somewhat straightforward to verify that there is no repetition of ‘id’ and no missing values. The records have been sorted in ascending order of ‘worm’ (number of worms) ranging from 32 in the first subject to 1929 in the last one. Blood loss (‘bloss’) is however, not sorted. The 13th record has the highest blood loss of 86 ml per day, which is very high. The objective of this analysis is to predict blood loss using worm.

First of all, give a look to the data:

plot(worm, bloss, xlab="No. of worms", ylab="ml. per day",
main = "Blood loss by number of hookworms in the bowel")

A linear model using the above two variables seems reasonable.

lm1 <- lm(bloss ~ worm, data=Suwit)
lm1

## 
## Call:
## lm(formula = bloss ~ worm, data = Suwit)
## 
## Coefficients:
## (Intercept)         worm  
##    10.84733      0.04092

Displaying the model by typing ‘lm1’ gives limited information (essentially, the estimated regression line coefficients). To get more information, one can look at the attributes of this model, its summary and attributes of its summary.

attr(lm1, "names")

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

summary(lm1)

## 
## Call:
## lm(formula = bloss ~ worm, data = Suwit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.846 -10.812   0.750   4.356  34.390 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.847327   5.308569   2.043   0.0618 .  
## worm         0.040922   0.007147   5.725 6.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.74 on 13 degrees of freedom
## Multiple R-squared:  0.716,  Adjusted R-squared:  0.6942 
## F-statistic: 32.78 on 1 and 13 DF,  p-value: 6.99e-05

The first section of summary shows the formula that was ‘called’. The second section gives the distribution of residuals. The pattern is clearly not symmetric. The maximum is too far on the right (34.38) compared to the minimum (-15.84) and the first quartile is further left (-10.81) of the median (0.75) than the third quartile (4.35) is. Otherwise, the median is close to zero. The third section gives coefficients of the intercept and the effect of ‘worm’ on blood loss. The intercept is 10.8 meaning that when there are no worms, the blood loss is estimated to be 10.8 ml per day. This is however, not significantly different from zero as the P value is 0.0618. The coefficient of ‘worm’ is 0.04 indicating that each worm is associated with an average increase of 0.04 ml of blood loss per day. Although the value is small, it is highly significantly different from zero. When there are many worms, the level of blood loss can be very substantial. The multiple R-squared value of 0.716 indicates that 71.6% of the variation in the data is explained by the model. The adjusted value is 0.6942. (The calculation of Rsquared is discussed in the analysis of variance section below). The last section describes more details of the residuals and hypothesis testing on the effect of ‘worm’ using the F-statistic. The P value from this section (0.0000699) is equal to that tested by the t-distribution in the coefficient section. This F-test more commonly appears in the analysis of variance table.

Analysis of variance table, R-squared and adjusted R-squared

summary(aov(lm1))

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## worm         1   6192    6192   32.78 6.99e-05 ***
## Residuals   13   2455     189                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above analysis of variance (aov) table breaks down the degrees of freedom, sum of squares and mean square of the outcome (blood loss) by sources (in this case there only two: worm + residuals).

The so-called ‘square’ is actually the square of difference between the value and the mean. The total sum of squares of blood loss is therefore:

SST <- sum((bloss-mean(bloss))^2)
SST

## [1] 8647.044

The sum of squares from residuals is:

SSR <- sum(residuals(lm1)^2)
SSR

## [1] 2455.468

See also the analysis of variance table. The sum of squares of worm or sum of squares of difference between the fitted values and the grand mean is:

SSW <- sum((fitted(lm1)-mean(bloss))^2)
SSW

## [1] 6191.577

The latter two sums add up to the first one. The R-squared is the proportion of sum of squares of the fitted values to the total sum of squares.

SSW/SST

## [1] 0.7160339

Instead of sum of squares, one may consider the mean square as the level of variation. In such a case, the number of worms can reduce the total mean square (or variance) by: (total mean square - residual mean square) / total mean square, or (variance - residual mean square) / variance.

resid.msq <- sum(residuals(lm1)^2)/lm1$df.residual
Radj      <- (var(bloss)- resid.msq)/var(bloss)
Radj

## [1] 0.6941903

This is the adjusted R-squared shown in summary(lm1) in the above section.

F-test

When the mean square of ‘worm’ is divided by the mean square of residuals, the result is:

F <- SSW/resid.msq; F

## [1] 32.78011

Using this F value with the two corresponding degrees of freedom (from ‘worm’ and residuals) the P value for testing the effect of ‘worm’ can be computed.

pf(F, df1=1, df2=13, lower.tail=FALSE)

## [1] 6.990401e-05

The function pf is used to compute a P value from a given F value together with the two values of the degrees of freedom. The last argument ‘lower.tail’ is set to FALSE to obtain the right margin of the area under the curve of the F distribution. In summary, both the regression and analysis of variance give the same conclusion; that number of worms has a significant linear relationship with blood loss. Now the regression line can be drawn.

Regression line, fitted values and residuals

A regression line can be added to the scatter plot with the following command:

plot(worm, bloss, xlab="No. of worms", ylab="ml. per day",
main = "Blood loss by number of hookworms in the bowel", type="n")
abline(lm1)
points(worm, fitted(lm1), pch=18, col="blue")
segments(worm, bloss, worm, fitted(lm1), col="red")

The regression line has an intercept of 10.8 and a slope of 0.04. The expected value is the value of blood loss estimated from the regression line with a specific value of ‘worm’.A residual is the difference between the observed and expected value. The residuals can be drawn by adding the red segments.

The actual values of the residuals can be checked from the specific attribute of the defined linear model.

residuals(lm1) -> lm1.res
hist(lm1.res)

Checking the normality of residuals

Histogram of the residuals is not somehow convincing about their normal distribution shape. However, with such a small sample size, it is difficult to draw any conclusion. A better way to check normality is to plot the residuals against the expected normal score or (residual-mean) / standard deviation. A reasonably straight line would indicate normality. Moreover, a test for normality could be calculated.

a <- qqnorm(lm1.res)

shapiro.qqnorm(lm1.res, type="n")
text(a$x, a$y, labels=as.character(id))

If the residuals were perfectly normally distributed, the text symbols would have formed along the straight dotted line. The graph suggests that the largest residual (13th) is too high (positive) whereas the smallest value (7th) is not large enough (negative). However, the P value from the Shapiro-Wilk test is 0.08 suggesting that the possibility of residuals being normally distributed cannot be rejected.

Finally, the residuals are plotted against the fitted values to see if there is a pattern.

plot(fitted(lm1), lm1.res, xlab="Fitted values")

plot(fitted(lm1), lm1.res, xlab="Fitted values", type="n")
text(fitted(lm1), lm1.res, labels=as.character(id))
abline(h=0, col="blue")

There is no obvious pattern. The residuals are quite independent of the expected values. With this and the above findings from the qqnorm command we may conclude that the residuals are randomly and normally distributed.

The above two diagnostic plots for the model ‘lm1’ can also be obtained from:

par(mfrow=c(1,2))
plot(lm1, which=1:2)

detach(Suwit)

Final conclusion

From the analysis, it is clear that blood loss is associated with number of hookworms. On average, each additional worm is associated with an increase of 0.04 ml of blood loss. The remaining uncertainty of blood loss, apart from hookworm, is explained by random variation or other factors that were not measured.

Simple Linear Regression (just a recap!)