Recap of basic linear regression

First of all, install and load the required libraries : rms, Hmisc and epiDisplay.

library(rms)
library(Hmisc)
library(epiDisplay)
library(ggplot2)

Simple Linear Regression

Let’s simulate data from a sample of n=100 points along with population linear regression line. The conditional distribution of y|x can be thought of as a vertical slice at x. The unconditional distribution of y is shown on the y-axis. To envision the conditional normal distributions assumed for the underlying population, think of a bell-shaped curve coming out of the page, with its base along one of the vertical lines of points. The equal variance assumption (homoscedasticity) dictates that the series of Gaussian curves for all the different x values have equal variances.

n <- 100
set.seed(13)
x <- round(rnorm(n, .5, .25), 1)
y <- x + rnorm(n, 0, .1)
r <- c(-.2, 1.2)

Plot:

plot(x, y, axes=FALSE, xlim=r, ylim=r, xlab=expression(x), ylab=expression(y))
axis(1, at=r, labels=FALSE)     
axis(2, at=r, labels=FALSE)
abline(a=0,b=1)
histSpike(y, side=2, add=TRUE)
abline(v=.6, lty=2)

Simple linear regression is used when:

Only 2 variables are of interest
One variable is a response (continuous scale) and one is a predictor
The mean of the dependent variable is a quantity of interest [otherwise explore for example quantile regression]
No adjustment is needed for confounding or other between-subject variation
The investigator is interested in assessing the strength of the relationship between x and y in real data units, or in predicting y from x
A linear relationship is assumed (visual inspection is strongly recommended…)
Not when one only needs to test for association (use Pearson or Spearman’s correlation in case)

Interval Estimation: evaluating the uncertainty about estimates

Estimation of the confidence intervals (CI) for the estimates depend on what you are estimating, if it is at individual level or at mean population level.

x1 <- c( 1,  3,  5,  6,  7,  9,  11)
y  <- c( 5, 10, 70, 58, 85, 89, 135)
dd <- datadist(x1, n.unique=5); options(datadist='dd')
f  <- ols(y ~ x1)
p1 <- Predict(f, x1=seq(1,11,length=100), conf.type='mean')
p2 <- Predict(f, x1=seq(1,11,length=100), conf.type='individual')
p  <- rbind(Mean=p1, Individual=p2)
ggplot(p, legend.position='none') +    
geom_point(aes(x1, y), data=data.frame(x1, y, .set.=''))

Example usages:

Is a child of age x smaller than expected for her age? Use the individual level, p2 (wider bands)
What is the best estimate of the population mean blood pressure for patients on treatment A? Use the mean population level, p1 (narrower bands)

Assessing the Goodness of Fit

It is crucial to verify the the assumptions underlying a linear regression model:

In a scatterplot the spread of y about the fitted line should be constant as x increases, and y vs. x should appear linear
Easier to see this with a plot of residuals vs estimated values
In this plot there should be no systematic patterns (no trend in central tendency, no change in spread of points with x)
Trend in central tendency indicates failure of linearity
qqnorm plot of residuals is a useful tool

Here an example: we fit a linear regression model where x and y should instead have been log transformed:

n <- 50
set.seed(2)
res <- rnorm(n, sd=.25)
x <- runif(n)
y <- exp(log(x) + res)
f <- ols(y ~ x)
plot(fitted(f), resid(f))

This plot depicts non-constant variance of the residuals, which might call for transforming y.

Now, we fit a linear model that should have been quadratic (functional form of X):

x <- runif(n, -1, 1)
y <- x ^ 2 + res
f <- ols(y ~ x)
plot(fitted(f), resid(f))

Finally, we fit a correct model:

y <- x + res
f <- ols(y ~ x)
plot(fitted(f), resid(f))

qqnorm(resid(f)); qqline(resid(f))

These plots shows the ideal situation of white noise (no trend, constant variance). The qq plot demonstrates approximate normality of residuals, for a sample of size n = 50.

Application example of simple linear regression: Hookworm & blood loss

The dataset concerns the relationship between hookworm and blood loss from a study conducted in 1970.

data(Suwit)
summary(Suwit)

##        id            worm            bloss      
##  Min.   : 1.0   Min.   :  32.0   Min.   : 5.03  
##  1st Qu.: 4.5   1st Qu.: 167.0   1st Qu.:14.02  
##  Median : 8.0   Median : 525.0   Median :33.80  
##  Mean   : 8.0   Mean   : 552.4   Mean   :33.45  
##  3rd Qu.:11.5   3rd Qu.: 796.5   3rd Qu.:41.70  
##  Max.   :15.0   Max.   :1929.0   Max.   :86.65

des(Suwit)

## HW and Blood loss SEAJTMH 1970; 
##  No. of observations =  15 
##   Variable      Class           Description   
## 1 id            numeric                       
## 2 worm          numeric         No. of worms  
## 3 bloss         numeric         Blood loss/day

summ(Suwit)

## HW and Blood loss SEAJTMH 1970;
## No. of observations = 15
## 
##   Var. name obs. mean   median  s.d.   min.   max.  
## 1 id        15   8      8       4.47   1      15    
## 2 worm      15   552.4  525     513.9  32     1929  
## 3 bloss     15   33.45  33.8    24.85  5.03   86.65

attach(Suwit)

The file is clean and ready for analysis (this happens here only for didactic purposes: in real life, you will usually spend a couple of hours - at minimum, if not days..- to clean datasets). For example, with this small sample size it is somewhat straightforward to verify that there is no repetition of ‘id’ and no missing values. The records have been sorted in ascending order of ‘worm’ (number of worms) ranging from 32 in the first subject to 1929 in the last one. Blood loss (‘bloss’) is however, not sorted. The 13th record has the highest blood loss of 86 ml per day, which is very high. The objective of this analysis is to estimate the expected value of blood loss using worm.

First of all, give a look to the data:

plot(worm, bloss, xlab="No. of worms", ylab="ml. per day",
main = "Blood loss by number of hookworms in the bowel")

A linear model using the above two variables seems reasonable.

lm1 <- lm(bloss ~ worm, data=Suwit)
lm1

## 
## Call:
## lm(formula = bloss ~ worm, data = Suwit)
## 
## Coefficients:
## (Intercept)         worm  
##    10.84733      0.04092

Displaying the model by typing ‘lm1’ gives limited information (essentially, the estimated regression line coefficients). To get more information, one can look at the attributes of this model, its summary and attributes of its summary.

attr(lm1, "names")

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

summary(lm1)

## 
## Call:
## lm(formula = bloss ~ worm, data = Suwit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.846 -10.812   0.750   4.356  34.390 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.847327   5.308569   2.043   0.0618 .  
## worm         0.040922   0.007147   5.725 6.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.74 on 13 degrees of freedom
## Multiple R-squared:  0.716,  Adjusted R-squared:  0.6942 
## F-statistic: 32.78 on 1 and 13 DF,  p-value: 6.99e-05

The first section of summary shows the formula that was ‘called’. The second section gives the distribution of residuals. The pattern is clearly not symmetric. The maximum is too far on the right (34.38) compared to the minimum (-15.84) and the first quartile is further left (-10.81) of the median (0.75) than the third quartile (4.35) is. Otherwise, the median is close to zero. The third section gives coefficients of the intercept and the effect of ‘worm’ on blood loss. The intercept is 10.8 meaning that when there are no worms, the blood loss is estimated to be 10.8 ml per day. This is however, not significantly different from zero as the P value is 0.0618. The coefficient of ‘worm’ is 0.04 indicating that each worm is associated with an average increase of 0.04 ml of blood loss per day. Although the value is small, it is highly significantly different from zero. When there are many worms, the level of blood loss can be very substantial. The multiple R-squared value of 0.716 indicates that 71.6% of the variation in the data is explained by the model. The adjusted value is 0.6942. (The calculation of Rsquared is discussed in the analysis of variance section below). The last section describes more details of the residuals and hypothesis testing on the effect of ‘worm’ using the F-statistic. The P value from this section (0.0000699) is equal to that tested by the t-distribution in the coefficient section. This F-test more commonly appears in the analysis of variance table.

Analysis of variance table, R-squared and adjusted R-squared

summary(aov(lm1))

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## worm         1   6192    6192   32.78 6.99e-05 ***
## Residuals   13   2455     189                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above analysis of variance (aov) table breaks down the degrees of freedom, sum of squares and mean square of the outcome (blood loss) by sources (in this case there only two: worm + residuals).

The so-called ‘square’ is actually the square of difference between the value and the mean. The total sum of squares of blood loss is therefore:

SST <- sum((bloss-mean(bloss))^2)
SST

## [1] 8647.044

The sum of squares from residuals is:

SSR <- sum(residuals(lm1)^2)
SSR

## [1] 2455.468

See also the analysis of variance table. The sum of squares of worm or sum of squares of difference between the fitted values and the grand mean is:

SSW <- sum((fitted(lm1)-mean(bloss))^2)
SSW

## [1] 6191.577

The latter two sums add up to the first one. The R-squared is the proportion of sum of squares of the fitted values to the total sum of squares.

SSW/SST

## [1] 0.7160339

Instead of sum of squares, one may consider the mean square as the level of variation. In such a case, the number of worms can reduce the total mean square (or variance) by: (total mean square - residual mean square) / total mean square, or (variance - residual mean square) / variance.

resid.msq <- sum(residuals(lm1)^2)/lm1$df.residual
Radj      <- (var(bloss)- resid.msq)/var(bloss)
Radj

## [1] 0.6941903

This is the adjusted R-squared shown in summary(lm1) in the above section.

F-test

When the mean square of ‘worm’ is divided by the mean square of residuals, the result is:

F <- SSW/resid.msq; F

## [1] 32.78011

Using this F value with the two corresponding degrees of freedom (from ‘worm’ and residuals) the P value for testing the effect of ‘worm’ can be computed.

pf(F, df1=1, df2=13, lower.tail=FALSE)

## [1] 6.990401e-05

The function pf is used to compute a P value from a given F value together with the two values of the degrees of freedom. The last argument ‘lower.tail’ is set to FALSE to obtain the right margin of the area under the curve of the F distribution. In summary, both the regression and analysis of variance give the same conclusion; that number of worms has a significant linear relationship with blood loss. Now the regression line can be drawn.

Regression line, fitted values and residuals

A regression line can be added to the scatter plot with the following command:

plot(worm, bloss, xlab="No. of worms", ylab="ml. per day",
main = "Blood loss by number of hookworms in the bowel", type="n")
abline(lm1)
points(worm, fitted(lm1), pch=18, col="blue")
segments(worm, bloss, worm, fitted(lm1), col="red")

The regression line has an intercept of 10.8 and a slope of 0.04. The expected value is the value of blood loss estimated from the regression line with a specific value of ‘worm’.A residual is the difference between the observed and expected value. The residuals can be drawn by adding the red segments.

The actual values of the residuals can be checked from the specific attribute of the defined linear model.

residuals(lm1) -> lm1.res
hist(lm1.res)

Checking the normality of residuals

Histogram of the residuals is not somehow convincing about their normal distribution shape. However, with such a small sample size, it is difficult to draw any conclusion. A better way to check normality is to plot the residuals against the expected normal score or (residual-mean) / standard deviation. A reasonably straight line would indicate normality. Moreover, a test for normality could be calculated.

a <- qqnorm(lm1.res)

shapiro.qqnorm(lm1.res, type="n")
text(a$x, a$y, labels=as.character(id))

If the residuals were perfectly normally distributed, the text symbols would have formed along the straight dotted line. The graph suggests that the largest residual (13th) is too high (positive) whereas the smallest value (7th) is not large enough (negative). However, the P value from the Shapiro-Wilk test is 0.08 suggesting that the possibility of residuals being normally distributed cannot be rejected.

Finally, the residuals are plotted against the fitted values to see if there is a pattern.

plot(fitted(lm1), lm1.res, xlab="Fitted values")

plot(fitted(lm1), lm1.res, xlab="Fitted values", type="n")
text(fitted(lm1), lm1.res, labels=as.character(id))
abline(h=0, col="blue")

There is no obvious pattern. The residuals are quite independent of the expected values. With this and the above findings from the qqnorm command we may conclude that the residuals are randomly and normally distributed.

The above two diagnostic plots for the model ‘lm1’ can also be obtained from:

par(mfrow=c(1,2))
plot(lm1, which=1:2)

detach(Suwit)

Final conclusion

From the analysis, it is clear that blood loss is associated with number of hookworms. On average, each additional worm is associated with an increase of 0.04 ml of blood loss. The remaining uncertainty of blood loss, apart from hookworm, is explained by random variation or other factors that were not measured.

What does “linear” truly mean?

First of all, install and load the required libraries.

library(rms)
library(Hmisc)
library(epiDisplay)
library(psych)

Visualizing relationships/functional forms

Remember that linear regression always means linearity in the parameters, irrespective of linearity in explanatory variables (i.e. the functional forms selected for the covariates). Y is linearly related to X (in the parameters) if the rate of change of Y with respect to X (dY/dX) is independent of the value of X. A function Y=BX=\(b_{1}\)\(x_{1}\) + \(b_{2}\)\(x_{2}\) is said to be linear (in the parameters), say, \(b_{1}\), if \(b_{1}\) appears with a power of 1 only and is not multiplied or divided by any other parameter (for eg \(b_{1}\) x \(b_{2}\) , or \(b_{2}\) / \(b_{1}\)).

This is a different concept with respect to the linearity in the functional form of the covariates, that is not instead required.

Moreover, some regression models may look non linear in the parameters but are inherently or intrinsically linear.

This is because with suitable transformations they can be made linear in parameters.

Just as an example, imagine that we want to investigate the effect of total CSF polymorph count on blood glucose ratio in patients with either bacterial or viral meningitis.

require(Hmisc)
getHdata(abm)

As a first step in every statistical analysis, we should visualize data of interest.

In this example, we use a nonparametric regression approach to explore the relationship between the two variables: plsmo is a loess nonparametric smoother.

with(ABM, {
  glratio <- gl / bloodgl
  tpolys <- polys * whites / 100
  plsmo(tpolys, glratio, xlab='Total Polymorphs in CSF',
       ylab='CSF/Blood Glucose Ratio',    
       xlim=quantile(tpolys,  c(.05,.95), na.rm=TRUE),
       ylim=quantile(glratio, c(.05,.95), na.rm=TRUE))
 scat1d(tpolys); scat1d(glratio, side=4) })

Moreover, we can use also a Super smoother relating age to the probability of bacterial meningitis given a patient has bacterial or viral meningitis (a binary outcome, see later in this block for logistic regression examples), with a rug plot showing the age distribution:

with(ABM, {
  plsmo(age, abm, 'supsmu', bass=7,     
        xlab='Age at Admission, Years',
        ylab='Proportion Bacterial Meningitis')
  scat1d(age) })

We can use these nonparametric approaches to help in finding a suitable functional form for the candidate predictor (switching to parametric regression models), that could be very different from the linear effect. For example, we can decide to use splines to model the effect of a continuous predictor on an outcome. For whom interested in splines see: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-019-0666-3

Multivariable linear regression

Datasets usually contain many variables collected during a study. It is often useful to see the relationship between two variables within the different levels of another third, categorical variable, i.e. to verify for the presence of an interaction.

(This is just a didactic example, usually more than 3 variables are involved in real studies).

Example: Systolic blood pressure

A small survey on blood pressure was carried out. The objective is to see the hypertensive effect of subjects putting additional table salt on their meal.Gender of subjects is also measured.

data(BP)
attach(BP)
des(BP)

## cross-sectional survey on BP & risk factors 
##  No. of observations =  100 
##   Variable      Class           Description        
## 1 id            integer         id                 
## 2 sex           factor          sex                
## 3 sbp           integer         Systolic BP        
## 4 dbp           integer         Diastolic BP       
## 5 saltadd       factor          Salt added on table
## 6 birthdate     Date

Note that the maximum systolic and diastolic blood pressures are quite high. There are 20 missing values in saltadd. The frequencies of the categorical variables sex and saltadd are now inspected.

describe(data.frame(sex, saltadd))

##          vars   n mean  sd median trimmed mad min max range  skew kurtosis   se
## sex*        1 100 1.55 0.5      2    1.56   0   1   2     1 -0.20    -1.98 0.05
## saltadd*    2  80 1.54 0.5      2    1.55   0   1   2     1 -0.15    -2.00 0.06

The next step is to create a new age variable from birthdate. The calculation is based on 12th March 2001, the date of the survey (date of entry in the study).

age.in.days <- as.Date("2001-03-12") - birthdate

There is a leap year in every four years. Therefore, an average year will have 365.25 days.

class(age.in.days)

## [1] "difftime"

age <- as.numeric(age.in.days)/365.25

The function as.numeric is needed to transform the units of age (difftime); otherwise modelling would not be possible.

describeBy(sbp,saltadd)

## 
##  Descriptive statistics by group 
## group: no
##    vars  n   mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 37 137.46 29.62    132  136.39 23.72  80 201   121 0.44    -0.42 4.87
## ------------------------------------------------------------ 
## group: yes
##    vars  n   mean    sd median trimmed   mad min max range  skew kurtosis   se
## X1    1 43 163.02 39.39    171  163.94 45.96  80 224   144 -0.25    -1.13 6.01

Recoding missing values into another category

The missing value group has the highest median and average systolic blood pressure. In order to create a new variable with three levels type:

saltadd1 <- saltadd
levels(saltadd1) <- c("no", "yes", "missing")
saltadd1[is.na(saltadd)] <- "missing"
summary(saltadd1)

##      no     yes missing 
##      37      43      20

summary(aov(age ~ saltadd1))

##             Df Sum Sq Mean Sq F value Pr(>F)
## saltadd1     2    115   57.42   0.448   0.64
## Residuals   97  12422  128.06

Since there is not enough evidence that the missing group is important and for additional reasons of simplicity, we will assume MCAR (missing completely at random) and we will ignore this group and continue the analysis with the original saltadd variable consisting of only two levels. Before doing this however, a simple regression model and regression line are first fitted.

lm1 <- lm(sbp ~ age)
summary(lm1)

## 
## Call:
## lm(formula = sbp ~ age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.231 -21.801  -3.679  15.127 105.395 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  65.1465    14.8942   4.374 3.05e-05 ***
## age           1.8422     0.2997   6.147 1.71e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.56 on 98 degrees of freedom
## Multiple R-squared:  0.2782, Adjusted R-squared:  0.2709 
## F-statistic: 37.78 on 1 and 98 DF,  p-value: 1.712e-08

Although the R-squared is not very high, the p value is small indicating important association of age with systolic blood pressure. A scatterplot of age against systolic blood pressure is now shown with the regression line added using the abline function. This function can accept many different argument forms, including a regression object. If this object has a coef method, and it returns a vector of length 1, then the value is taken to be the slope of a line through the origin, otherwise the first two values are taken to be the intercept and slope, as is the case for lm1.

plot(age, sbp, main = "Systolic BP by age", xlab = "Years", ylab = "mm.Hg")
abline(lm1)

Subsequent exploration of residuals suggests a non-significant deviation from normality and no pattern. Details of this can be adopted from the techniques already discussed and are omitted here. The next step is to provide different plot patterns for different groups of salt habits. Note that here the sample size is lower (we decided to omit missing values for saltadd, reducing the analysis to the complete cases):

lm2 <- lm(sbp ~ age + saltadd)
summary(lm2)

## 
## Call:
## lm(formula = sbp ~ age + saltadd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.369 -24.972  -0.125  22.456  64.805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63.1291    15.7645   4.005 0.000142 ***
## age           1.5526     0.3118   4.979 3.81e-06 ***
## saltaddyes   22.9094     6.9340   3.304 0.001448 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.83 on 77 degrees of freedom
##   (20 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.3331, Adjusted R-squared:  0.3158 
## F-statistic: 19.23 on 2 and 77 DF,  p-value: 1.679e-07

On the average, a one year increment of age is associated with an increase in systolic blood pressure by 1.5 mmHg. Adding table salt increases systolic blood pressure significantly by approximately 23 mmHg.

plot(age, sbp, main="Systolic BP by age", xlab="Years", ylab="mm.Hg", type="n")
points(age[saltadd=="no"], sbp[saltadd=="no"], col="blue")
points(age[saltadd=="yes"], sbp[saltadd=="yes"], col="red",pch = 18)

Note that the red dots corresponding to those who added table salt are higher than the blue circles.

The final task is to draw two separate regression lines for each group. We now have two regression lines to draw, one for each group. The intercept for non-salt users will be the first coefficient and for salt users will be the first plus the third. The slope for both groups is the same.

a0 <- coef(lm2)[1]
a1 <- coef(lm2)[1] + coef(lm2)[3]
b <- coef(lm2)[2]

plot(age, sbp, main="Systolic BP by age", xlab="Years", ylab="mm.Hg", type="n")
points(age[saltadd=="no"], sbp[saltadd=="no"], col="blue")
points(age[saltadd=="yes"], sbp[saltadd=="yes"], col="red",pch = 18)
abline(a = a0, b, col = "blue")
abline(a = a1, b, col = "red")

Note that X-axis does not start at zero. Thus the intercepts are out of the plot frame. The red line is for the red points of salt adders and the blue line is for the blue points of non-adders. In this model, age is assumed to have a constant effect on systolic blood pressure independently from added salt.

But look at the distributions of the points of the two colours: the red points are higher than the blue ones but mainly on the right half of the graph. To fit lines with different slopes, a new model with interaction term is created.

Therefore the next step is to estimate a model with different slopes (or different ‘b’ for the abline arguments) for the different lines. The model needs an interaction term between addsalt and age.

lm3 <- lm(sbp ~ age * saltadd)
summary(lm3)

## 
## Call:
## lm(formula = sbp ~ age * saltadd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.885 -24.891  -1.142  21.228  66.867 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     78.0066    20.3981   3.824 0.000267 ***
## age              1.2419     0.4128   3.009 0.003558 ** 
## saltaddyes     -12.2540    31.4574  -0.390 0.697965    
## age:saltaddyes   0.7199     0.6282   1.146 0.255441    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.77 on 76 degrees of freedom
##   (20 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.3445, Adjusted R-squared:  0.3186 
## F-statistic: 13.31 on 3 and 76 DF,  p-value: 4.528e-07

For the intercept of the salt users, the second term and the fourth are all zero (since age is zero) but the third should be kept as such. This term is negative. The intercept of salt users is therefore lower than that of the non-users.

a0 <- coef(lm3)[1]
a1 <- coef(lm3)[1] + coef(lm3)[3]

For the slope of the non-salt users, the second coefficient alone is enough since the first and the third are not involved with each unit of increment of age and the fourth term has saltadd being 0. The slope for the salt users group includes the second and the fourth coefficients since saltaddyes is 1.

b0 <- coef(lm3)[2]
b1 <- coef(lm3)[2] + coef(lm3)[4]

plot(age, sbp, main="Systolic BP by age", xlab="Years",ylab="mm.Hg", pch=18, col=as.numeric(saltadd))
abline(a = a0, b = b0, col = 1)
abline(a = a1, b = b1, col = 2)
legend("topleft", legend = c("Salt added", "No salt added"),lty=1, col=c("red","black"))

Note that as.numeric(saltadd) converts the factor levels into the integers 1 (black) and 2 (red), representing the non-salt adders and the salt adders, respectively. These colour codes come from the R colour palette.

This model suggests that at the young age, the systolic blood pressure of two groups are not much different as the two lines are close together on the left of the plot. For example, at the age of 25, the difference is 5.7mmHg. Increasing age increases the difference between the two groups. At 70 years of age, the difference is as great as 38mmHg. In this aspect, age modifies the effect of adding table salt on blood pressure.

On the other hand the slope of age is 1.24mmHg per year among those who did not add salt but becomes 1.24+0.72 = 1.96mmHg among the salt adders. Thus, salt adding modifies the effect of age. Note that interaction is a statistical term whereas effect modification is the equivalent epidemiological term.

Note also that the coefficient of the interaction term age:saltaddyes is not statistically significant !!!

This means that the two slopes just differ by chance (or that we are low-powered to detect a significant interaction..)

This was in fact just a didactic example to show how to introduce an interaction term in a regression model.

Regression models