First of all, install and load the required libraries.

library(rms)
library(Hmisc)
library(epiDisplay)
library(psych)

Visualizing relationships/functional forms

Remember that linear regression always means linearity in the parameters, irrespective of linearity in explanatory variables (i.e. the functional forms selected for the covariates). Y is linearly related to X (in the parameters) if the rate of change of Y with respect to X (dY/dX) is independent of the value of X. A function Y=BX=\(b_{1}\)\(x_{1}\) + \(b_{2}\)\(x_{2}\) is said to be linear (in the parameters), say, \(b_{1}\), if \(b_{1}\) appears with a power of 1 only and is not multiplied or divided by any other parameter (for eg \(b_{1}\) x \(b_{2}\) , or \(b_{2}\) / \(b_{1}\)).

This is a different concept with respect to the linearity in the functional form of the covariates, that is not instead required.

Moreover, some regression models may look non linear in the parameters but are inherently or intrinsically linear.

This is because with suitable transformations they can be made linear in parameters.

Just as an example, imagine that we want to investigate the effect of total CSF polymorph count on blood glucose ratio in patients with either bacterial or viral meningitis.

require(Hmisc)
getHdata(abm)

As a first step in every statistical analysis, we should visualize data of interest.

In this example, we use a nonparametric regression approach to explore the relationship between the two variables: plsmo is a loess nonparametric smoother.

with(ABM, {
  glratio <- gl / bloodgl
  tpolys <- polys * whites / 100
  plsmo(tpolys, glratio, xlab='Total Polymorphs in CSF',
       ylab='CSF/Blood Glucose Ratio',    
       xlim=quantile(tpolys,  c(.05,.95), na.rm=TRUE),
       ylim=quantile(glratio, c(.05,.95), na.rm=TRUE))
 scat1d(tpolys); scat1d(glratio, side=4) })

Moreover, we can use also a Super smoother relating age to the probability of bacterial meningitis given a patient has bacterial or viral meningitis (a binary outcome, see later in this block for logistic regression examples), with a rug plot showing the age distribution:

with(ABM, {
  plsmo(age, abm, 'supsmu', bass=7,     
        xlab='Age at Admission, Years',
        ylab='Proportion Bacterial Meningitis')
  scat1d(age) })

We can use these nonparametric approaches to help in finding a suitable functional form for the candidate predictor (switching to parametric regression models), that could be very different from the linear effect. For example, we can decide to use splines to model the effect of a continuous predictor on an outcome. For whom interested in splines see: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-019-0666-3

Multiple linear regression (MLR)

Datasets usually contain many variables collected during a study. It is often useful to see the relationship between two variables within the different levels of another third, categorical variable, i.e. to verify for the presence of an interaction. (This is just a didactic example, usually more than 3 variables are involved in MLR models).

Example: Systolic blood pressure

A small survey on blood pressure was carried out. The objective is to see the hypertensive effect of subjects putting additional table salt on their meal.Gender of subjects is also measured.

data(BP)
attach(BP)
des(BP)

## cross-sectional survey on BP & risk factors 
##  No. of observations =  100 
##   Variable      Class           Description        
## 1 id            integer         id                 
## 2 sex           factor          sex                
## 3 sbp           integer         Systolic BP        
## 4 dbp           integer         Diastolic BP       
## 5 saltadd       factor          Salt added on table
## 6 birthdate     Date

Note that the maximum systolic and diastolic blood pressures are quite high. There are 20 missing values in saltadd. The frequencies of the categorical variables sex and saltadd are now inspected.

describe(data.frame(sex, saltadd))

##          vars   n mean  sd median trimmed mad min max range  skew kurtosis   se
## sex*        1 100 1.55 0.5      2    1.56   0   1   2     1 -0.20    -1.98 0.05
## saltadd*    2  80 1.54 0.5      2    1.55   0   1   2     1 -0.15    -2.00 0.06

The next step is to create a new age variable from birthdate. The calculation is based on 12th March 2001, the date of the survey (date of entry in the study).

age.in.days <- as.Date("2001-03-12") - birthdate

There is a leap year in every four years. Therefore, an average year will have 365.25 days.

class(age.in.days)

## [1] "difftime"

age <- as.numeric(age.in.days)/365.25

The function as.numeric is needed to transform the units of age (difftime); otherwise modelling would not be possible.

describeBy(sbp,saltadd)

## 
##  Descriptive statistics by group 
## group: no
##    vars  n   mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 37 137.46 29.62    132  136.39 23.72  80 201   121 0.44    -0.42 4.87
## ------------------------------------------------------------ 
## group: yes
##    vars  n   mean    sd median trimmed   mad min max range  skew kurtosis   se
## X1    1 43 163.02 39.39    171  163.94 45.96  80 224   144 -0.25    -1.13 6.01

Recoding missing values into another category

The missing value group has the highest median and average systolic blood pressure. In order to create a new variable with three levels type:

saltadd1 <- saltadd
levels(saltadd1) <- c("no", "yes", "missing")
saltadd1[is.na(saltadd)] <- "missing"
summary(saltadd1)

##      no     yes missing 
##      37      43      20

summary(aov(age ~ saltadd1))

##             Df Sum Sq Mean Sq F value Pr(>F)
## saltadd1     2    115   57.42   0.448   0.64
## Residuals   97  12422  128.06

Since there is not enough evidence that the missing group is important and for additional reasons of simplicity, we will assume MCAR (missing completely at random) and we will ignore this group and continue the analysis with the original saltadd variable consisting of only two levels. Before doing this however, a simple regression model and regression line are first fitted.

lm1 <- lm(sbp ~ age)
summary(lm1)

## 
## Call:
## lm(formula = sbp ~ age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.231 -21.801  -3.679  15.127 105.395 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  65.1465    14.8942   4.374 3.05e-05 ***
## age           1.8422     0.2997   6.147 1.71e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.56 on 98 degrees of freedom
## Multiple R-squared:  0.2782, Adjusted R-squared:  0.2709 
## F-statistic: 37.78 on 1 and 98 DF,  p-value: 1.712e-08

Although the R-squared is not very high, the p value is small indicating important influence of age on systolic blood pressure. A scatterplot of age against systolic blood pressure is now shown with the regression line added using the abline function. This function can accept many different argument forms, including a regression object. If this object has a coef method, and it returns a vector of length 1, then the value is taken to be the slope of a line through the origin, otherwise the first two values are taken to be the intercept and slope, as is the case for lm1.

plot(age, sbp, main = "Systolic BP by age", xlab = "Years", ylab = "mm.Hg")
abline(lm1)

Subsequent exploration of residuals suggests a non-significant deviation from normality and no pattern. Details of this can be adopted from the techniques already discussed and are omitted here. The next step is to provide different plot patterns for different groups of salt habits. Note that here the sample size is lower (we decided to omit missing values for saltadd, reducing the analysis to the complete cases):

lm2 <- lm(sbp ~ age + saltadd)
summary(lm2)

## 
## Call:
## lm(formula = sbp ~ age + saltadd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.369 -24.972  -0.125  22.456  64.805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63.1291    15.7645   4.005 0.000142 ***
## age           1.5526     0.3118   4.979 3.81e-06 ***
## saltaddyes   22.9094     6.9340   3.304 0.001448 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.83 on 77 degrees of freedom
##   (20 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.3331, Adjusted R-squared:  0.3158 
## F-statistic: 19.23 on 2 and 77 DF,  p-value: 1.679e-07

On the average, a one year increment of age is associated with an increase in systolic blood pressure by 1.5 mmHg. Adding table salt increases systolic blood pressure significantly by approximately 23 mmHg.

plot(age, sbp, main="Systolic BP by age", xlab="Years", ylab="mm.Hg", type="n")
points(age[saltadd=="no"], sbp[saltadd=="no"], col="blue")
points(age[saltadd=="yes"], sbp[saltadd=="yes"], col="red",pch = 18)

Note that the red dots corresponding to those who added table salt are higher than the blue circles.

The final task is to draw two separate regression lines for each group. We now have two regression lines to draw, one for each group. The intercept for non-salt users will be the first coefficient and for salt users will be the first plus the third. The slope for both groups is the same.

a0 <- coef(lm2)[1]
a1 <- coef(lm2)[1] + coef(lm2)[3]
b <- coef(lm2)[2]

plot(age, sbp, main="Systolic BP by age", xlab="Years", ylab="mm.Hg", type="n")
points(age[saltadd=="no"], sbp[saltadd=="no"], col="blue")
points(age[saltadd=="yes"], sbp[saltadd=="yes"], col="red",pch = 18)
abline(a = a0, b, col = "blue")
abline(a = a1, b, col = "red")

Note that X-axis does not start at zero. Thus the intercepts are out of the plot frame. The red line is for the red points of salt adders and the blue line is for the blue points of non-adders. In this model, age is assumed to have a constant effect on systolic blood pressure independently from added salt.

But look at the distributions of the points of the two colours: the red points are higher than the blue ones but mainly on the right half of the graph. To fit lines with different slopes, a new model with interaction term is created.

Therefore the next step is to estimate a model with different slopes (or different ‘b’ for the abline arguments) for the different lines. The model needs an interaction term between addsalt and age.

lm3 <- lm(sbp ~ age * saltadd)
summary(lm3)

## 
## Call:
## lm(formula = sbp ~ age * saltadd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.885 -24.891  -1.142  21.228  66.867 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     78.0066    20.3981   3.824 0.000267 ***
## age              1.2419     0.4128   3.009 0.003558 ** 
## saltaddyes     -12.2540    31.4574  -0.390 0.697965    
## age:saltaddyes   0.7199     0.6282   1.146 0.255441    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.77 on 76 degrees of freedom
##   (20 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared:  0.3445, Adjusted R-squared:  0.3186 
## F-statistic: 13.31 on 3 and 76 DF,  p-value: 4.528e-07

For the intercept of the salt users, the second term and the fourth are all zero (since age is zero) but the third should be kept as such. This term is negative. The intercept of salt users is therefore lower than that of the non-users.

a0 <- coef(lm3)[1]
a1 <- coef(lm3)[1] + coef(lm3)[3]

For the slope of the non-salt users, the second coefficient alone is enough since the first and the third are not involved with each unit of increment of age and the fourth term has saltadd being 0. The slope for the salt users group includes the second and the fourth coefficients since saltaddyes is 1.

b0 <- coef(lm3)[2]
b1 <- coef(lm3)[2] + coef(lm3)[4]

plot(age, sbp, main="Systolic BP by age", xlab="Years",ylab="mm.Hg", pch=18, col=as.numeric(saltadd))
abline(a = a0, b = b0, col = 1)
abline(a = a1, b = b1, col = 2)
legend("topleft", legend = c("Salt added", "No salt added"),lty=1, col=c("red","black"))

Note that as.numeric(saltadd) converts the factor levels into the integers 1 (black) and 2 (red), representing the non-salt adders and the salt adders, respectively. These colour codes come from the R colour palette.

This model suggests that at the young age, the systolic blood pressure of two groups are not much different as the two lines are close together on the left of the plot. For example, at the age of 25, the difference is 5.7mmHg. Increasing age increases the difference between the two groups. At 70 years of age, the difference is as great as 38mmHg. In this aspect, age modifies the effect of adding table salt on blood pressure.

On the other hand the slope of age is 1.24mmHg per year among those who did not add salt but becomes 1.24+0.72 = 1.96mmHg among the salt adders. Thus, salt adding modifies the effect of age. Note that interaction is a statistical term whereas effect modification is the equivalent epidemiological term.

Note also that the coefficient of the interaction term age:saltaddyes is not statistically significant !!!

This means that the two slopes just differ by chance (or that we are low-powered to detect a significant interaction..)

This was in fact just a didactic example to show how to introduce an interaction term in a regression model.

Linear Regression basics (II)

Visualizing relationships/functional forms

Multiple linear regression (MLR)

Example: Systolic blood pressure

Recoding missing values into another category