First of all, install and load the required libraries : rms, Hmisc and epiDisplay.
library(rms)
library(Hmisc)
library(epiDisplay)
Let’s simulate data from a sample of n=100 points along with population linear regression line. The conditional distribution of y|x can be thought of as a vertical slice at x. The unconditional distribution of y is shown on the y-axis. To envision the conditional normal distributions assumed for the underlying population, think of a bell-shaped curve coming out of the page, with its base along one of the vertical lines of points. The equal variance assumption (homoscedasticity) dictates that the series of Gaussian curves for all the different x values have equal variances.
n <- 100
set.seed(13)
x <- round(rnorm(n, .5, .25), 1)
y <- x + rnorm(n, 0, .1)
r <- c(-.2, 1.2)
Plot:
plot(x, y, axes=FALSE, xlim=r, ylim=r, xlab=expression(x), ylab=expression(y))
axis(1, at=r, labels=FALSE)
axis(2, at=r, labels=FALSE)
abline(a=0,b=1)
histSpike(y, side=2, add=TRUE)
abline(v=.6, lty=2)
Simple linear regression is used when:
Estimation of the confidence intervals (CI) for predictions depend on what you want to predict, if at individual level or at mean population level.
x1 <- c( 1, 3, 5, 6, 7, 9, 11)
y <- c( 5, 10, 70, 58, 85, 89, 135)
dd <- datadist(x1, n.unique=5); options(datadist='dd')
f <- ols(y ~ x1)
p1 <- Predict(f, x1=seq(1,11,length=100), conf.type='mean')
p2 <- Predict(f, x1=seq(1,11,length=100), conf.type='individual')
p <- rbind(Mean=p1, Individual=p2)
ggplot(p, legend.position='none') +
geom_point(aes(x1, y), data=data.frame(x1, y, .set.=''))
Example usages:
Is a child of age x smaller than predicted for her age? Use the individual level, p2 (wider bands)
What is the best estimate of the population mean blood pressure for patients on treatment A? Use the mean population level, p1 (narrower bands)
It is crucial to verify the the assumptions underlying a linear regression model:
In a scatterplot the spread of y about the fitted line should be constant as x increases, and y vs. x should appear linear
Easier to see this with a plot of residuals vs estimated values
In this plot there should be no systematic patterns (no trend in central tendency, no change in spread of points with x)
Trend in central tendency indicates failure of linearity
qqnorm plot of residuals is a useful tool
Here an example: we fit a linear regression model where x and y should instead have been log transformed:
n <- 50
set.seed(2)
res <- rnorm(n, sd=.25)
x <- runif(n)
y <- exp(log(x) + res)
f <- ols(y ~ x)
plot(fitted(f), resid(f))
This plot depicts non-constant variance of the residuals, which might call for transforming y.
Now, we fit a linear model that should have been quadratic (functional form of X):
x <- runif(n, -1, 1)
y <- x ^ 2 + res
f <- ols(y ~ x)
plot(fitted(f), resid(f))
Finally, we fit a correct model:
y <- x + res
f <- ols(y ~ x)
plot(fitted(f), resid(f))
qqnorm(resid(f)); qqline(resid(f))
These plots shows the ideal situation of white noise (no trend, constant variance). The qq plot demonstrates approximate normality of residuals, for a sample of size n = 50.
The dataset concerns the relationship between hookworm and blood loss from a study conducted in 1970.
data(Suwit)
summary(Suwit)
## id worm bloss
## Min. : 1.0 Min. : 32.0 Min. : 5.03
## 1st Qu.: 4.5 1st Qu.: 167.0 1st Qu.:14.02
## Median : 8.0 Median : 525.0 Median :33.80
## Mean : 8.0 Mean : 552.4 Mean :33.45
## 3rd Qu.:11.5 3rd Qu.: 796.5 3rd Qu.:41.70
## Max. :15.0 Max. :1929.0 Max. :86.65
des(Suwit)
## HW and Blood loss SEAJTMH 1970;
## No. of observations = 15
## Variable Class Description
## 1 id numeric
## 2 worm numeric No. of worms
## 3 bloss numeric Blood loss/day
summ(Suwit)
## HW and Blood loss SEAJTMH 1970;
## No. of observations = 15
##
## Var. name obs. mean median s.d. min. max.
## 1 id 15 8 8 4.47 1 15
## 2 worm 15 552.4 525 513.9 32 1929
## 3 bloss 15 33.45 33.8 24.85 5.03 86.65
attach(Suwit)
The file is clean and ready for analysis (this happens here only for didactic purposes: in real life, you will usually spend a couple of hours - at minimum, if not days..- to clean datasets). For example, with this small sample size it is somewhat straightforward to verify that there is no repetition of ‘id’ and no missing values. The records have been sorted in ascending order of ‘worm’ (number of worms) ranging from 32 in the first subject to 1929 in the last one. Blood loss (‘bloss’) is however, not sorted. The 13th record has the highest blood loss of 86 ml per day, which is very high. The objective of this analysis is to predict blood loss using worm.
First of all, give a look to the data:
plot(worm, bloss, xlab="No. of worms", ylab="ml. per day",
main = "Blood loss by number of hookworms in the bowel")
A linear model using the above two variables seems reasonable.
lm1 <- lm(bloss ~ worm, data=Suwit)
lm1
##
## Call:
## lm(formula = bloss ~ worm, data = Suwit)
##
## Coefficients:
## (Intercept) worm
## 10.84733 0.04092
Displaying the model by typing ‘lm1’ gives limited information (essentially, the estimated regression line coefficients). To get more information, one can look at the attributes of this model, its summary and attributes of its summary.
attr(lm1, "names")
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
summary(lm1)
##
## Call:
## lm(formula = bloss ~ worm, data = Suwit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.846 -10.812 0.750 4.356 34.390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.847327 5.308569 2.043 0.0618 .
## worm 0.040922 0.007147 5.725 6.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.74 on 13 degrees of freedom
## Multiple R-squared: 0.716, Adjusted R-squared: 0.6942
## F-statistic: 32.78 on 1 and 13 DF, p-value: 6.99e-05
The first section of summary shows the formula that was ‘called’. The second section gives the distribution of residuals. The pattern is clearly not symmetric. The maximum is too far on the right (34.38) compared to the minimum (-15.84) and the first quartile is further left (-10.81) of the median (0.75) than the third quartile (4.35) is. Otherwise, the median is close to zero. The third section gives coefficients of the intercept and the effect of ‘worm’ on blood loss. The intercept is 10.8 meaning that when there are no worms, the blood loss is estimated to be 10.8 ml per day. This is however, not significantly different from zero as the P value is 0.0618. The coefficient of ‘worm’ is 0.04 indicating that each worm is associated with an average increase of 0.04 ml of blood loss per day. Although the value is small, it is highly significantly different from zero. When there are many worms, the level of blood loss can be very substantial. The multiple R-squared value of 0.716 indicates that 71.6% of the variation in the data is explained by the model. The adjusted value is 0.6942. (The calculation of Rsquared is discussed in the analysis of variance section below). The last section describes more details of the residuals and hypothesis testing on the effect of ‘worm’ using the F-statistic. The P value from this section (0.0000699) is equal to that tested by the t-distribution in the coefficient section. This F-test more commonly appears in the analysis of variance table.
summary(aov(lm1))
## Df Sum Sq Mean Sq F value Pr(>F)
## worm 1 6192 6192 32.78 6.99e-05 ***
## Residuals 13 2455 189
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The above analysis of variance (aov) table breaks down the degrees of freedom, sum of squares and mean square of the outcome (blood loss) by sources (in this case there only two: worm + residuals).
The so-called ‘square’ is actually the square of difference between the value and the mean. The total sum of squares of blood loss is therefore:
SST <- sum((bloss-mean(bloss))^2)
SST
## [1] 8647.044
The sum of squares from residuals is:
SSR <- sum(residuals(lm1)^2)
SSR
## [1] 2455.468
See also the analysis of variance table. The sum of squares of worm or sum of squares of difference between the fitted values and the grand mean is:
SSW <- sum((fitted(lm1)-mean(bloss))^2)
SSW
## [1] 6191.577
The latter two sums add up to the first one. The R-squared is the proportion of sum of squares of the fitted values to the total sum of squares.
SSW/SST
## [1] 0.7160339
Instead of sum of squares, one may consider the mean square as the level of variation. In such a case, the number of worms can reduce the total mean square (or variance) by: (total mean square - residual mean square) / total mean square, or (variance - residual mean square) / variance.
resid.msq <- sum(residuals(lm1)^2)/lm1$df.residual
Radj <- (var(bloss)- resid.msq)/var(bloss)
Radj
## [1] 0.6941903
This is the adjusted R-squared shown in summary(lm1) in the above section.
When the mean square of ‘worm’ is divided by the mean square of residuals, the result is:
F <- SSW/resid.msq; F
## [1] 32.78011
Using this F value with the two corresponding degrees of freedom (from ‘worm’ and residuals) the P value for testing the effect of ‘worm’ can be computed.
pf(F, df1=1, df2=13, lower.tail=FALSE)
## [1] 6.990401e-05
The function pf is used to compute a P value from a given F value together with the two values of the degrees of freedom. The last argument ‘lower.tail’ is set to FALSE to obtain the right margin of the area under the curve of the F distribution. In summary, both the regression and analysis of variance give the same conclusion; that number of worms has a significant linear relationship with blood loss. Now the regression line can be drawn.
A regression line can be added to the scatter plot with the following command:
plot(worm, bloss, xlab="No. of worms", ylab="ml. per day",
main = "Blood loss by number of hookworms in the bowel", type="n")
abline(lm1)
points(worm, fitted(lm1), pch=18, col="blue")
segments(worm, bloss, worm, fitted(lm1), col="red")
The regression line has an intercept of 10.8 and a slope of 0.04. The expected value is the value of blood loss estimated from the regression line with a specific value of ‘worm’.A residual is the difference between the observed and expected value. The residuals can be drawn by adding the red segments.
The actual values of the residuals can be checked from the specific attribute of the defined linear model.
residuals(lm1) -> lm1.res
hist(lm1.res)
Histogram of the residuals is not somehow convincing about their normal distribution shape. However, with such a small sample size, it is difficult to draw any conclusion. A better way to check normality is to plot the residuals against the expected normal score or (residual-mean) / standard deviation. A reasonably straight line would indicate normality. Moreover, a test for normality could be calculated.
a <- qqnorm(lm1.res)
shapiro.qqnorm(lm1.res, type="n")
text(a$x, a$y, labels=as.character(id))
If the residuals were perfectly normally distributed, the text symbols would have formed along the straight dotted line. The graph suggests that the largest residual (13th) is too high (positive) whereas the smallest value (7th) is not large enough (negative). However, the P value from the Shapiro-Wilk test is 0.08 suggesting that the possibility of residuals being normally distributed cannot be rejected.
Finally, the residuals are plotted against the fitted values to see if there is a pattern.
plot(fitted(lm1), lm1.res, xlab="Fitted values")
plot(fitted(lm1), lm1.res, xlab="Fitted values", type="n")
text(fitted(lm1), lm1.res, labels=as.character(id))
abline(h=0, col="blue")
There is no obvious pattern. The residuals are quite independent of the expected values. With this and the above findings from the qqnorm command we may conclude that the residuals are randomly and normally distributed.
The above two diagnostic plots for the model ‘lm1’ can also be obtained from:
par(mfrow=c(1,2))
plot(lm1, which=1:2)
detach(Suwit)
From the analysis, it is clear that blood loss is associated with number of hookworms. On average, each additional worm is associated with an increase of 0.04 ml of blood loss. The remaining uncertainty of blood loss, apart from hookworm, is explained by random variation or other factors that were not measured.