First of all, install and load the required libraries : rms, Hmisc and epiDisplay.
library(rms)
library(Hmisc)
library(epiDisplay)
library(ggplot2)
Let’s simulate data from a sample of n=100 points along with population linear regression line. The conditional distribution of y|x can be thought of as a vertical slice at x. The unconditional distribution of y is shown on the y-axis. To envision the conditional normal distributions assumed for the underlying population, think of a bell-shaped curve coming out of the page, with its base along one of the vertical lines of points. The equal variance assumption (homoscedasticity) dictates that the series of Gaussian curves for all the different x values have equal variances.
n <- 100
set.seed(13)
x <- round(rnorm(n, .5, .25), 1)
y <- x + rnorm(n, 0, .1)
r <- c(-.2, 1.2)
Plot:
plot(x, y, axes=FALSE, xlim=r, ylim=r, xlab=expression(x), ylab=expression(y))
axis(1, at=r, labels=FALSE)
axis(2, at=r, labels=FALSE)
abline(a=0,b=1)
histSpike(y, side=2, add=TRUE)
abline(v=.6, lty=2)
Simple linear regression is used when:
Estimation of the confidence intervals (CI) for the estimates depend on what you are estimating, if it is at individual level or at mean population level.
x1 <- c( 1, 3, 5, 6, 7, 9, 11)
y <- c( 5, 10, 70, 58, 85, 89, 135)
dd <- datadist(x1, n.unique=5); options(datadist='dd')
f <- ols(y ~ x1)
p1 <- Predict(f, x1=seq(1,11,length=100), conf.type='mean')
p2 <- Predict(f, x1=seq(1,11,length=100), conf.type='individual')
p <- rbind(Mean=p1, Individual=p2)
ggplot(p, legend.position='none') +
geom_point(aes(x1, y), data=data.frame(x1, y, .set.=''))
Example usages:
Is a child of age x smaller than expected for her age? Use the individual level, p2 (wider bands)
What is the best estimate of the population mean blood pressure for patients on treatment A? Use the mean population level, p1 (narrower bands)
It is crucial to verify the the assumptions underlying a linear regression model:
In a scatterplot the spread of y about the fitted line should be constant as x increases, and y vs. x should appear linear
Easier to see this with a plot of residuals vs estimated values
In this plot there should be no systematic patterns (no trend in central tendency, no change in spread of points with x)
Trend in central tendency indicates failure of linearity
qqnorm plot of residuals is a useful tool
Here an example: we fit a linear regression model where x and y should instead have been log transformed:
n <- 50
set.seed(2)
res <- rnorm(n, sd=.25)
x <- runif(n)
y <- exp(log(x) + res)
f <- ols(y ~ x)
plot(fitted(f), resid(f))
This plot depicts non-constant variance of the residuals, which might call for transforming y.
Now, we fit a linear model that should have been quadratic (functional form of X):
x <- runif(n, -1, 1)
y <- x ^ 2 + res
f <- ols(y ~ x)
plot(fitted(f), resid(f))
Finally, we fit a correct model:
y <- x + res
f <- ols(y ~ x)
plot(fitted(f), resid(f))
qqnorm(resid(f)); qqline(resid(f))
These plots shows the ideal situation of white noise (no trend, constant variance). The qq plot demonstrates approximate normality of residuals, for a sample of size n = 50.
The dataset concerns the relationship between hookworm and blood loss from a study conducted in 1970.
data(Suwit)
summary(Suwit)
## id worm bloss
## Min. : 1.0 Min. : 32.0 Min. : 5.03
## 1st Qu.: 4.5 1st Qu.: 167.0 1st Qu.:14.02
## Median : 8.0 Median : 525.0 Median :33.80
## Mean : 8.0 Mean : 552.4 Mean :33.45
## 3rd Qu.:11.5 3rd Qu.: 796.5 3rd Qu.:41.70
## Max. :15.0 Max. :1929.0 Max. :86.65
des(Suwit)
## HW and Blood loss SEAJTMH 1970;
## No. of observations = 15
## Variable Class Description
## 1 id numeric
## 2 worm numeric No. of worms
## 3 bloss numeric Blood loss/day
summ(Suwit)
## HW and Blood loss SEAJTMH 1970;
## No. of observations = 15
##
## Var. name obs. mean median s.d. min. max.
## 1 id 15 8 8 4.47 1 15
## 2 worm 15 552.4 525 513.9 32 1929
## 3 bloss 15 33.45 33.8 24.85 5.03 86.65
attach(Suwit)
The file is clean and ready for analysis (this happens here only for didactic purposes: in real life, you will usually spend a couple of hours - at minimum, if not days..- to clean datasets). For example, with this small sample size it is somewhat straightforward to verify that there is no repetition of ‘id’ and no missing values. The records have been sorted in ascending order of ‘worm’ (number of worms) ranging from 32 in the first subject to 1929 in the last one. Blood loss (‘bloss’) is however, not sorted. The 13th record has the highest blood loss of 86 ml per day, which is very high. The objective of this analysis is to estimate the expected value of blood loss using worm.
First of all, give a look to the data:
plot(worm, bloss, xlab="No. of worms", ylab="ml. per day",
main = "Blood loss by number of hookworms in the bowel")
A linear model using the above two variables seems reasonable.
lm1 <- lm(bloss ~ worm, data=Suwit)
lm1
##
## Call:
## lm(formula = bloss ~ worm, data = Suwit)
##
## Coefficients:
## (Intercept) worm
## 10.84733 0.04092
Displaying the model by typing ‘lm1’ gives limited information (essentially, the estimated regression line coefficients). To get more information, one can look at the attributes of this model, its summary and attributes of its summary.
attr(lm1, "names")
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
summary(lm1)
##
## Call:
## lm(formula = bloss ~ worm, data = Suwit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.846 -10.812 0.750 4.356 34.390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.847327 5.308569 2.043 0.0618 .
## worm 0.040922 0.007147 5.725 6.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.74 on 13 degrees of freedom
## Multiple R-squared: 0.716, Adjusted R-squared: 0.6942
## F-statistic: 32.78 on 1 and 13 DF, p-value: 6.99e-05
The first section of summary shows the formula that was ‘called’. The second section gives the distribution of residuals. The pattern is clearly not symmetric. The maximum is too far on the right (34.38) compared to the minimum (-15.84) and the first quartile is further left (-10.81) of the median (0.75) than the third quartile (4.35) is. Otherwise, the median is close to zero. The third section gives coefficients of the intercept and the effect of ‘worm’ on blood loss. The intercept is 10.8 meaning that when there are no worms, the blood loss is estimated to be 10.8 ml per day. This is however, not significantly different from zero as the P value is 0.0618. The coefficient of ‘worm’ is 0.04 indicating that each worm is associated with an average increase of 0.04 ml of blood loss per day. Although the value is small, it is highly significantly different from zero. When there are many worms, the level of blood loss can be very substantial. The multiple R-squared value of 0.716 indicates that 71.6% of the variation in the data is explained by the model. The adjusted value is 0.6942. (The calculation of Rsquared is discussed in the analysis of variance section below). The last section describes more details of the residuals and hypothesis testing on the effect of ‘worm’ using the F-statistic. The P value from this section (0.0000699) is equal to that tested by the t-distribution in the coefficient section. This F-test more commonly appears in the analysis of variance table.
summary(aov(lm1))
## Df Sum Sq Mean Sq F value Pr(>F)
## worm 1 6192 6192 32.78 6.99e-05 ***
## Residuals 13 2455 189
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The above analysis of variance (aov) table breaks down the degrees of freedom, sum of squares and mean square of the outcome (blood loss) by sources (in this case there only two: worm + residuals).
The so-called ‘square’ is actually the square of difference between the value and the mean. The total sum of squares of blood loss is therefore:
SST <- sum((bloss-mean(bloss))^2)
SST
## [1] 8647.044
The sum of squares from residuals is:
SSR <- sum(residuals(lm1)^2)
SSR
## [1] 2455.468
See also the analysis of variance table. The sum of squares of worm or sum of squares of difference between the fitted values and the grand mean is:
SSW <- sum((fitted(lm1)-mean(bloss))^2)
SSW
## [1] 6191.577
The latter two sums add up to the first one. The R-squared is the proportion of sum of squares of the fitted values to the total sum of squares.
SSW/SST
## [1] 0.7160339
Instead of sum of squares, one may consider the mean square as the level of variation. In such a case, the number of worms can reduce the total mean square (or variance) by: (total mean square - residual mean square) / total mean square, or (variance - residual mean square) / variance.
resid.msq <- sum(residuals(lm1)^2)/lm1$df.residual
Radj <- (var(bloss)- resid.msq)/var(bloss)
Radj
## [1] 0.6941903
This is the adjusted R-squared shown in summary(lm1) in the above section.
When the mean square of ‘worm’ is divided by the mean square of residuals, the result is:
F <- SSW/resid.msq; F
## [1] 32.78011
Using this F value with the two corresponding degrees of freedom (from ‘worm’ and residuals) the P value for testing the effect of ‘worm’ can be computed.
pf(F, df1=1, df2=13, lower.tail=FALSE)
## [1] 6.990401e-05
The function pf is used to compute a P value from a given F value together with the two values of the degrees of freedom. The last argument ‘lower.tail’ is set to FALSE to obtain the right margin of the area under the curve of the F distribution. In summary, both the regression and analysis of variance give the same conclusion; that number of worms has a significant linear relationship with blood loss. Now the regression line can be drawn.
A regression line can be added to the scatter plot with the following command:
plot(worm, bloss, xlab="No. of worms", ylab="ml. per day",
main = "Blood loss by number of hookworms in the bowel", type="n")
abline(lm1)
points(worm, fitted(lm1), pch=18, col="blue")
segments(worm, bloss, worm, fitted(lm1), col="red")
The regression line has an intercept of 10.8 and a slope of 0.04. The expected value is the value of blood loss estimated from the regression line with a specific value of ‘worm’.A residual is the difference between the observed and expected value. The residuals can be drawn by adding the red segments.
The actual values of the residuals can be checked from the specific attribute of the defined linear model.
residuals(lm1) -> lm1.res
hist(lm1.res)
Histogram of the residuals is not somehow convincing about their normal distribution shape. However, with such a small sample size, it is difficult to draw any conclusion. A better way to check normality is to plot the residuals against the expected normal score or (residual-mean) / standard deviation. A reasonably straight line would indicate normality. Moreover, a test for normality could be calculated.
a <- qqnorm(lm1.res)
shapiro.qqnorm(lm1.res, type="n")
text(a$x, a$y, labels=as.character(id))
If the residuals were perfectly normally distributed, the text symbols would have formed along the straight dotted line. The graph suggests that the largest residual (13th) is too high (positive) whereas the smallest value (7th) is not large enough (negative). However, the P value from the Shapiro-Wilk test is 0.08 suggesting that the possibility of residuals being normally distributed cannot be rejected.
Finally, the residuals are plotted against the fitted values to see if there is a pattern.
plot(fitted(lm1), lm1.res, xlab="Fitted values")
plot(fitted(lm1), lm1.res, xlab="Fitted values", type="n")
text(fitted(lm1), lm1.res, labels=as.character(id))
abline(h=0, col="blue")
There is no obvious pattern. The residuals are quite independent of the expected values. With this and the above findings from the qqnorm command we may conclude that the residuals are randomly and normally distributed.
The above two diagnostic plots for the model ‘lm1’ can also be obtained from:
par(mfrow=c(1,2))
plot(lm1, which=1:2)
detach(Suwit)
From the analysis, it is clear that blood loss is associated with number of hookworms. On average, each additional worm is associated with an increase of 0.04 ml of blood loss. The remaining uncertainty of blood loss, apart from hookworm, is explained by random variation or other factors that were not measured.
First of all, install and load the required libraries.
library(rms)
library(Hmisc)
library(epiDisplay)
library(psych)
Remember that linear regression always means linearity in the parameters, irrespective of linearity in explanatory variables (i.e. the functional forms selected for the covariates). Y is linearly related to X (in the parameters) if the rate of change of Y with respect to X (dY/dX) is independent of the value of X. A function Y=BX=\(b_{1}\)\(x_{1}\) + \(b_{2}\)\(x_{2}\) is said to be linear (in the parameters), say, \(b_{1}\), if \(b_{1}\) appears with a power of 1 only and is not multiplied or divided by any other parameter (for eg \(b_{1}\) x \(b_{2}\) , or \(b_{2}\) / \(b_{1}\)).
This is a different concept with respect to the linearity in the functional form of the covariates, that is not instead required.
Moreover, some regression models may look non linear in the parameters but are inherently or intrinsically linear.
This is because with suitable transformations they can be made linear in parameters.
Just as an example, imagine that we want to investigate the effect of total CSF polymorph count on blood glucose ratio in patients with either bacterial or viral meningitis.
require(Hmisc)
getHdata(abm)
As a first step in every statistical analysis, we should visualize data of interest.
In this example, we use a nonparametric regression approach to explore the relationship between the two variables: plsmo is a loess nonparametric smoother.
with(ABM, {
glratio <- gl / bloodgl
tpolys <- polys * whites / 100
plsmo(tpolys, glratio, xlab='Total Polymorphs in CSF',
ylab='CSF/Blood Glucose Ratio',
xlim=quantile(tpolys, c(.05,.95), na.rm=TRUE),
ylim=quantile(glratio, c(.05,.95), na.rm=TRUE))
scat1d(tpolys); scat1d(glratio, side=4) })
Moreover, we can use also a Super smoother relating age to the probability of bacterial meningitis given a patient has bacterial or viral meningitis (a binary outcome, see later in this block for logistic regression examples), with a rug plot showing the age distribution:
with(ABM, {
plsmo(age, abm, 'supsmu', bass=7,
xlab='Age at Admission, Years',
ylab='Proportion Bacterial Meningitis')
scat1d(age) })
We can use these nonparametric approaches to help in finding a suitable functional form for the candidate predictor (switching to parametric regression models), that could be very different from the linear effect. For example, we can decide to use splines to model the effect of a continuous predictor on an outcome. For whom interested in splines see: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-019-0666-3
Datasets usually contain many variables collected during a study. It is often useful to see the relationship between two variables within the different levels of another third, categorical variable, i.e. to verify for the presence of an interaction.
(This is just a didactic example, usually more than 3 variables are involved in real studies).
A small survey on blood pressure was carried out. The objective is to see the hypertensive effect of subjects putting additional table salt on their meal.Gender of subjects is also measured.
data(BP)
attach(BP)
des(BP)
## cross-sectional survey on BP & risk factors
## No. of observations = 100
## Variable Class Description
## 1 id integer id
## 2 sex factor sex
## 3 sbp integer Systolic BP
## 4 dbp integer Diastolic BP
## 5 saltadd factor Salt added on table
## 6 birthdate Date
Note that the maximum systolic and diastolic blood pressures are quite high. There are 20 missing values in saltadd. The frequencies of the categorical variables sex and saltadd are now inspected.
describe(data.frame(sex, saltadd))
## vars n mean sd median trimmed mad min max range skew kurtosis se
## sex* 1 100 1.55 0.5 2 1.56 0 1 2 1 -0.20 -1.98 0.05
## saltadd* 2 80 1.54 0.5 2 1.55 0 1 2 1 -0.15 -2.00 0.06
The next step is to create a new age variable from birthdate. The calculation is based on 12th March 2001, the date of the survey (date of entry in the study).
age.in.days <- as.Date("2001-03-12") - birthdate
There is a leap year in every four years. Therefore, an average year will have 365.25 days.
class(age.in.days)
## [1] "difftime"
age <- as.numeric(age.in.days)/365.25
The function as.numeric is needed to transform the units of age (difftime); otherwise modelling would not be possible.
describeBy(sbp,saltadd)
##
## Descriptive statistics by group
## group: no
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 37 137.46 29.62 132 136.39 23.72 80 201 121 0.44 -0.42 4.87
## ------------------------------------------------------------
## group: yes
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 43 163.02 39.39 171 163.94 45.96 80 224 144 -0.25 -1.13 6.01
The missing value group has the highest median and average systolic blood pressure. In order to create a new variable with three levels type:
saltadd1 <- saltadd
levels(saltadd1) <- c("no", "yes", "missing")
saltadd1[is.na(saltadd)] <- "missing"
summary(saltadd1)
## no yes missing
## 37 43 20
summary(aov(age ~ saltadd1))
## Df Sum Sq Mean Sq F value Pr(>F)
## saltadd1 2 115 57.42 0.448 0.64
## Residuals 97 12422 128.06
Since there is not enough evidence that the missing group is important and for additional reasons of simplicity, we will assume MCAR (missing completely at random) and we will ignore this group and continue the analysis with the original saltadd variable consisting of only two levels. Before doing this however, a simple regression model and regression line are first fitted.
lm1 <- lm(sbp ~ age)
summary(lm1)
##
## Call:
## lm(formula = sbp ~ age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.231 -21.801 -3.679 15.127 105.395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.1465 14.8942 4.374 3.05e-05 ***
## age 1.8422 0.2997 6.147 1.71e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.56 on 98 degrees of freedom
## Multiple R-squared: 0.2782, Adjusted R-squared: 0.2709
## F-statistic: 37.78 on 1 and 98 DF, p-value: 1.712e-08
Although the R-squared is not very high, the p value is small indicating important association of age with systolic blood pressure. A scatterplot of age against systolic blood pressure is now shown with the regression line added using the abline function. This function can accept many different argument forms, including a regression object. If this object has a coef method, and it returns a vector of length 1, then the value is taken to be the slope of a line through the origin, otherwise the first two values are taken to be the intercept and slope, as is the case for lm1.
plot(age, sbp, main = "Systolic BP by age", xlab = "Years", ylab = "mm.Hg")
abline(lm1)
Subsequent exploration of residuals suggests a non-significant deviation from normality and no pattern. Details of this can be adopted from the techniques already discussed and are omitted here. The next step is to provide different plot patterns for different groups of salt habits. Note that here the sample size is lower (we decided to omit missing values for saltadd, reducing the analysis to the complete cases):
lm2 <- lm(sbp ~ age + saltadd)
summary(lm2)
##
## Call:
## lm(formula = sbp ~ age + saltadd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.369 -24.972 -0.125 22.456 64.805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.1291 15.7645 4.005 0.000142 ***
## age 1.5526 0.3118 4.979 3.81e-06 ***
## saltaddyes 22.9094 6.9340 3.304 0.001448 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.83 on 77 degrees of freedom
## (20 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.3331, Adjusted R-squared: 0.3158
## F-statistic: 19.23 on 2 and 77 DF, p-value: 1.679e-07
On the average, a one year increment of age is associated with an increase in systolic blood pressure by 1.5 mmHg. Adding table salt increases systolic blood pressure significantly by approximately 23 mmHg.
plot(age, sbp, main="Systolic BP by age", xlab="Years", ylab="mm.Hg", type="n")
points(age[saltadd=="no"], sbp[saltadd=="no"], col="blue")
points(age[saltadd=="yes"], sbp[saltadd=="yes"], col="red",pch = 18)
Note that the red dots corresponding to those who added table salt are higher than the blue circles.
The final task is to draw two separate regression lines for each group. We now have two regression lines to draw, one for each group. The intercept for non-salt users will be the first coefficient and for salt users will be the first plus the third. The slope for both groups is the same.
a0 <- coef(lm2)[1]
a1 <- coef(lm2)[1] + coef(lm2)[3]
b <- coef(lm2)[2]
plot(age, sbp, main="Systolic BP by age", xlab="Years", ylab="mm.Hg", type="n")
points(age[saltadd=="no"], sbp[saltadd=="no"], col="blue")
points(age[saltadd=="yes"], sbp[saltadd=="yes"], col="red",pch = 18)
abline(a = a0, b, col = "blue")
abline(a = a1, b, col = "red")
Note that X-axis does not start at zero. Thus the intercepts are out of the plot frame. The red line is for the red points of salt adders and the blue line is for the blue points of non-adders. In this model, age is assumed to have a constant effect on systolic blood pressure independently from added salt.
But look at the distributions of the points of the two colours: the red points are higher than the blue ones but mainly on the right half of the graph. To fit lines with different slopes, a new model with interaction term is created.
Therefore the next step is to estimate a model with different slopes (or different ‘b’ for the abline arguments) for the different lines. The model needs an interaction term between addsalt and age.
lm3 <- lm(sbp ~ age * saltadd)
summary(lm3)
##
## Call:
## lm(formula = sbp ~ age * saltadd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.885 -24.891 -1.142 21.228 66.867
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.0066 20.3981 3.824 0.000267 ***
## age 1.2419 0.4128 3.009 0.003558 **
## saltaddyes -12.2540 31.4574 -0.390 0.697965
## age:saltaddyes 0.7199 0.6282 1.146 0.255441
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.77 on 76 degrees of freedom
## (20 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.3445, Adjusted R-squared: 0.3186
## F-statistic: 13.31 on 3 and 76 DF, p-value: 4.528e-07
For the intercept of the salt users, the second term and the fourth are all zero (since age is zero) but the third should be kept as such. This term is negative. The intercept of salt users is therefore lower than that of the non-users.
a0 <- coef(lm3)[1]
a1 <- coef(lm3)[1] + coef(lm3)[3]
For the slope of the non-salt users, the second coefficient alone is enough since the first and the third are not involved with each unit of increment of age and the fourth term has saltadd being 0. The slope for the salt users group includes the second and the fourth coefficients since saltaddyes is 1.
b0 <- coef(lm3)[2]
b1 <- coef(lm3)[2] + coef(lm3)[4]
plot(age, sbp, main="Systolic BP by age", xlab="Years",ylab="mm.Hg", pch=18, col=as.numeric(saltadd))
abline(a = a0, b = b0, col = 1)
abline(a = a1, b = b1, col = 2)
legend("topleft", legend = c("Salt added", "No salt added"),lty=1, col=c("red","black"))
Note that as.numeric(saltadd) converts the factor levels into the integers 1 (black) and 2 (red), representing the non-salt adders and the salt adders, respectively. These colour codes come from the R colour palette.
This model suggests that at the young age, the systolic blood pressure of two groups are not much different as the two lines are close together on the left of the plot. For example, at the age of 25, the difference is 5.7mmHg. Increasing age increases the difference between the two groups. At 70 years of age, the difference is as great as 38mmHg. In this aspect, age modifies the effect of adding table salt on blood pressure.
On the other hand the slope of age is 1.24mmHg per year among those who did not add salt but becomes 1.24+0.72 = 1.96mmHg among the salt adders. Thus, salt adding modifies the effect of age. Note that interaction is a statistical term whereas effect modification is the equivalent epidemiological term.
Note also that the coefficient of the interaction term age:saltaddyes is not statistically significant !!!
This means that the two slopes just differ by chance (or that we are low-powered to detect a significant interaction..)
This was in fact just a didactic example to show how to introduce an interaction term in a regression model.