Confounding

Identifying and addressing confounding in observational data
Published

October 24, 2024

Dummy dataset

To demonstrate confounding in the relationship between coffee consumption and cardiovascular disease (CVD), using tobacco as a confounder, we first generate a dummy dataset with three variables: coffee consumption, CVD status, and tobacco use. We will assume that tobacco use affects both coffee consumption and the likelihood of CVD.

# Settings
set.seed(123)  # for reproducibility
n <- 500       # number of iterations

# Generate tobacco use (confounder)
tobacco <- rbinom(n, 1, 0.4)  # 40% smokers

# Generate coffee consumption based on tobacco use (confounder effect)
# Smokers are more likely to drink coffee
coffee <- rbinom(n, 1, prob = 0.2 + 0.5 * tobacco)  

# Generate CVD based on tobacco use and coffee consumption
# Tobacco is a stronger predictor of CVD
CVD <- rbinom(n, 1, prob = 0.1 + 0.3 * tobacco + 0.05 * coffee)

# Create the data frame
data <- data.frame(coffee, tobacco, CVD)

# Examine the first few rows of the dataset
head(data)
  coffee tobacco CVD
1      0       0   0
2      1       1   1
3      0       0   0
4      1       1   1
5      1       1   1
6      0       0   0

Visualising confounding

To visually check for confounding in the relationship between coffee consumption and cardiovascular disease (CVD), we can use several types of plots in R. These plots help illustrate how the confounder (tobacco) influences both the exposure (coffee consumption) and the outcome (CVD).

Correlation heatmap

The pairwise correlations between the variables can be visualised in a heatmap:

# Load the required package
library(ggcorrplot)
Loading required package: ggplot2
# Compute the correlation matrix
corr_matrix <- cor(data)

# Plot the correlation heatmap
ggcorrplot(
  corr_matrix, method = "circle",
  lab = TRUE, title = "Correlation Matrix: Coffee, Tobacco, and CVD")

Interpretation:

  • Coffee and Tobacco have a moderate positive correlation, as expected, since smokers are more likely to drink coffee in this simulated dataset.
  • Tobacco and CVD show a stronger positive correlation, reflecting that smoking increases the risk of cardiovascular disease.
  • Coffee and CVD have a weaker positive correlation, which may be confounded by tobacco use.

Scatterplot with Grouping (Exposure vs. Outcome with Confounder Grouping)

A scatterplot or barplot that groups the data by the confounder (tobacco) can visually show the confounding effect. In this case, we can plot the relationship between coffee and CVD, with separate lines or bars for smokers and non-smokers.

library(ggplot2)

# Create a bar plot showing CVD rate by coffee consumption, stratified by tobacco use
ggplot(data, aes(x = factor(coffee), fill = factor(CVD))) +
  geom_bar(position = "fill") +
  facet_wrap(~ tobacco, labeller = labeller(tobacco = c("0" = "Non-smokers", "1" = "Smokers"))) +
  labs(x = "Coffee Consumption", fill = "CVD Status", y = "Proportion") +
  scale_fill_manual(values = c("0" = "lightblue", "1" = "red")) +
  ggtitle("CVD Proportion by Coffee Consumption (Stratified by Tobacco Use)")

This plot shows how the relationship between coffee and CVD changes when stratified by tobacco use. If there is confounding, the relationship between coffee and CVD should look different for smokers and non-smokers.

Vice versa, we can plot the relationship between tobacco use and CVD, stratified by coffee consumption. If there is confounding, the relationship between tobacco and CVD should look similar for coffee consumers and non-consumers.

# Create a bar plot showing CVD rate by coffee consumption, stratified by tobacco use
ggplot(data, aes(x = factor(tobacco), fill = factor(CVD))) +
  geom_bar(position = "fill") +
  facet_wrap(~ coffee, labeller = labeller(coffee = c("0" = "No coffee", "1" = "Coffee"))) +
  labs(x = "Tobacco use", fill = "CVD Status", y = "Proportion") +
  scale_fill_manual(values = c("0" = "lightblue", "1" = "red")) +
  ggtitle("CVD Proportion by Tobacco use (Stratified by Coffee Consumption)")

Interaction Plot (Coffee, CVD, and Tobacco)

An interaction plot can help visually assess whether there is an interaction effect between coffee consumption and tobacco use on the likelihood of CVD. It shows the predicted probabilities or the mean CVD rate across different groups.

# Calculate mean CVD rate by coffee and tobacco groups
mean_cvd <- aggregate(CVD ~ coffee + tobacco, data = data, FUN = mean)

# Interaction plot
ggplot(mean_cvd, aes(x = factor(coffee), y = CVD, color = factor(tobacco), group = tobacco)) +
  geom_line(linewidth = 1.2) +
  geom_point(linewidth = 3) +
  labs(x = "Coffee Consumption", y = "Mean CVD Rate", color = "Tobacco Use") +
  ggtitle("Interaction Plot: Coffee Consumption, Tobacco, and Mean CVD Rate")
Warning in geom_point(linewidth = 3): Ignoring unknown parameters: `linewidth`

This plot illustrates how the relationship between coffee consumption and CVD varies by tobacco use. If tobacco is a confounder, the slopes for smokers and non-smokers should differ, suggesting that the effect of coffee on CVD is not the same for both groups.

Model-based identification of confounding

We can analyze the relationship between coffee and CVD without adjusting for tobacco use and then include tobacco use to observe its confounding effect.

First, we can see if there is a relationship between coffee consumption and CVD without considering tobacco as a confounder.

# Unadjusted model (coffee vs CVD)
unadjusted_model <- glm(CVD ~ coffee, data = data, family = "binomial")
summary(unadjusted_model)

Call:
glm(formula = CVD ~ coffee, family = "binomial", data = data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.8036     0.1665  -10.83  < 2e-16 ***
coffee        1.2051     0.2219    5.43 5.64e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 536.85  on 499  degrees of freedom
Residual deviance: 506.09  on 498  degrees of freedom
AIC: 510.09

Number of Fisher Scoring iterations: 4

This will give us the log odds ratio for the effect of coffee consumption on CVD without considering tobacco.

Now, we adjust for tobacco use to see if the relationship between coffee and CVD changes after accounting for the confounder.

# Adjusted model (coffee vs CVD, adjusting for tobacco)
adjusted_model <- glm(CVD ~ coffee + tobacco, data = data, family = "binomial")
summary(adjusted_model)

Call:
glm(formula = CVD ~ coffee + tobacco, family = "binomial", data = data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.3239     0.2057 -11.298  < 2e-16 ***
coffee        0.4377     0.2582   1.696     0.09 .  
tobacco       1.7344     0.2666   6.507 7.68e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 536.85  on 499  degrees of freedom
Residual deviance: 459.82  on 497  degrees of freedom
AIC: 465.82

Number of Fisher Scoring iterations: 4

By comparing the coefficients in both models, we can observe the confounding effect of tobacco on the coffee-CVD relationship.

Approaches to Handle Confounding

Stratification by tobacco use

We can stratify the analysis by tobacco use to compare the effect of coffee on CVD within each group (smokers and non-smokers).

# Stratified analysis for non-smokers
non_smokers <- subset(data, tobacco == 0)
strat_model_ns <- glm(CVD ~ coffee, data = non_smokers, family = "binomial")
summary(strat_model_ns)

Call:
glm(formula = CVD ~ coffee, family = "binomial", data = non_smokers)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.2889     0.2237 -10.230   <2e-16 ***
coffee        0.3079     0.4385   0.702    0.483    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 196.09  on 304  degrees of freedom
Residual deviance: 195.62  on 303  degrees of freedom
AIC: 199.62

Number of Fisher Scoring iterations: 5
# Stratified analysis for smokers
smokers <- subset(data, tobacco == 1)
strat_model_sm <- glm(CVD ~ coffee, data = smokers, family = "binomial")
summary(strat_model_sm)

Call:
glm(formula = CVD ~ coffee, family = "binomial", data = smokers)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -0.6419     0.2763  -2.323   0.0202 *
coffee        0.5103     0.3250   1.570   0.1164  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 266.58  on 194  degrees of freedom
Residual deviance: 264.06  on 193  degrees of freedom
AIC: 268.06

Number of Fisher Scoring iterations: 4

Based on the stratified results, we can calculate the pooled Mantel-Haenszel odds ratio.

# Load required package
library(epiR)
Loading required package: survival
Package epiR 2.0.76 is loaded
Type help(epi.about) for summary information
Type browseVignettes(package = 'epiR') to learn how to use epiR for applied epidemiological analyses
# Combine tobacco as a stratifying variable in a 2x2x2 table format
mantel_haenszel_table <- table(data$coffee, data$CVD, data$tobacco)
mantel_haenszel_table
, ,  = 0

   
      0   1
  0 217  22
  1  58   8

, ,  = 1

   
      0   1
  0  38  20
  1  73  64
# Apply Mantel-Haenszel Odds Ratio (MH OR)
epi.2by2(mantel_haenszel_table, method = "cross.sectional")
             Outcome +    Outcome -      Total               Prev risk *
Exposed +          255           42        297    85.86 (81.37 to 89.61)
Exposed -          131           72        203    64.53 (57.53 to 71.10)
Total              386          114        500    77.20 (73.27 to 80.81)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Prev risk ratio (crude)                        1.33 (1.19, 1.49)
Prev risk ratio (M-H)                          1.10 (0.99, 1.22)
Prev risk ratio (crude:M-H)                    1.21
Prev odds ratio (crude)                        3.34 (2.16, 5.16)
Prev odds ratio (M-H)                          1.56 (0.94, 2.59)
Prev odds ratio (crude:M-H)                    2.14
Attrib prev in the exposed (crude) *           21.33 (13.64, 29.01)
Attrib prev in the exposed (M-H) *             7.02 (-5.41, 19.45)
Attrib prev (crude:M-H)                        3.04
-------------------------------------------------------------------
 Woolf test of homogeneity of PRs: chi2(1) = 1.684 Pr>chi2 = 0.194
 Woolf test of homogeneity of ORs: chi2(1) = 0.089 Pr>chi2 = 0.766
 Test that M-H adjusted OR = 1:  chi2(1) = 2.879 Pr>chi2 = 0.045
 Wald confidence limits
 M-H: Mantel-Haenszel; CI: confidence interval
 * Outcomes per 100 population units 

Regression adjustment

As done earlier, logistic regression models can adjust for tobacco use. This method is often more convenient than stratification.

# The adjusted model shown earlier adjusts for tobacco use
summary(adjusted_model)

Call:
glm(formula = CVD ~ coffee + tobacco, family = "binomial", data = data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.3239     0.2057 -11.298  < 2e-16 ***
coffee        0.4377     0.2582   1.696     0.09 .  
tobacco       1.7344     0.2666   6.507 7.68e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 536.85  on 499  degrees of freedom
Residual deviance: 459.82  on 497  degrees of freedom
AIC: 465.82

Number of Fisher Scoring iterations: 4
# Odds ratio based on the adjusted model
cbind(
  OR = exp(coef(adjusted_model)),
  exp(confint(adjusted_model)))
Waiting for profiling to be done...
                    OR      2.5 %    97.5 %
(Intercept) 0.09788812 0.06405821 0.1438391
coffee      1.54916632 0.93204191 2.5696811
tobacco     5.66535568 3.39174231 9.6687771

Note how the adjusted OR for coffee consumption is similar to the Mantel-Haenszel corrected OR.

Classification tree

Classification trees provide a hierarchical structure of relationships between variables. By examining the tree structure, we can see which variables are driving the splits (important predictors of the outcome). If a variable that could be a potential confounder (e.g., age, smoking status) is frequently chosen to split the data, it may indicate that this variable plays an important role in explaining the outcome.

Random forests, an extension of decision trees, consist of an ensemble of decision trees where each tree is trained on a random subset of the data and variables. Random forests rank variables by their importance in predicting the outcome. If a potential confounder (e.g., smoking) has high importance, it suggests that this variable influences both the outcome (e.g., CVD) and the exposure (e.g., coffee consumption), signaling confounding.

library(rpart)
library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'
The following object is masked from 'package:ggplot2':

    margin
# Fit classification tree
fitTree <- rpart(factor(CVD) ~ coffee + tobacco, data = data)
print(fitTree)
n= 500 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 500 114 0 (0.7720000 0.2280000) *
# Fit random forest
rf <- randomForest(factor(CVD) ~ coffee + tobacco, data = data, ntree = 1000)

# Show variable importance
importance(rf)
        MeanDecreaseGini
coffee          6.203028
tobacco        17.299934

G-computation

G-computation (or the G-formula) is a causal inference method derived from the counterfactual framework, often used to estimate the causal effect of an exposure on an outcome, adjusting for confounders. G-computation models the outcome (e.g., cardiovascular disease, CVD) under different hypothetical scenarios of the exposure (e.g., coffee consumption) and then averages these over the distribution of confounders (e.g., tobacco use).

The G-formula provides a direct way to adjust for confounding by estimating the expected outcome for both treated and untreated groups if everyone in the population had the same level of exposure, regardless of their actual exposure.

Steps for G-Computation:

  • Fit an outcome model: Model the relationship between the confounder (tobacco), the exposure (coffee), and the outcome (CVD).
  • Predict counterfactual outcomes: Use the model to predict the probability of CVD for everyone in the dataset assuming two different scenarios:
    • Everyone drank coffee (coffee = 1).
    • No one drank coffee (coffee = 0).
  • Estimate the causal effect: Compare the average predicted outcomes under the two scenarios to estimate the causal effect of coffee consumption on CVD.
# Fit a logistic regression model for the outcome
outcome_model <- glm(CVD ~ coffee + tobacco, data = data, family = binomial)
summary(outcome_model)

Call:
glm(formula = CVD ~ coffee + tobacco, family = binomial, data = data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.3239     0.2057 -11.298  < 2e-16 ***
coffee        0.4377     0.2582   1.696     0.09 .  
tobacco       1.7344     0.2666   6.507 7.68e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 536.85  on 499  degrees of freedom
Residual deviance: 459.82  on 497  degrees of freedom
AIC: 465.82

Number of Fisher Scoring iterations: 4
# Create a copy of the data with coffee set to 1 (everyone drinks coffee)
data_coffee1 <- data
data_coffee1$coffee <- 1

# Predict the probability of CVD for everyone if all drank coffee
pred_coffee1 <- predict(outcome_model, newdata = data_coffee1, type = "response")

# Create a copy of the data with coffee set to 0 (no one drinks coffee)
data_coffee0 <- data
data_coffee0$coffee <- 0

# Predict the probability of CVD for everyone if no one drank coffee
pred_coffee0 <- predict(outcome_model, newdata = data_coffee0, type = "response")

# Calculate the mean predicted probability of CVD under both scenarios
mean_pred_coffee1 <- mean(pred_coffee1)
mean_pred_coffee0 <- mean(pred_coffee0)

# Calculate the average causal effect (risk difference)
causal_effect <- mean_pred_coffee1 - mean_pred_coffee0
causal_effect
[1] 0.06703175

R session info

sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_Belgium.utf8  LC_CTYPE=English_Belgium.utf8   
[3] LC_MONETARY=English_Belgium.utf8 LC_NUMERIC=C                    
[5] LC_TIME=English_Belgium.utf8    

time zone: Europe/Brussels
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] randomForest_4.7-1.2 rpart_4.1.23         epiR_2.0.76         
[4] survival_3.6-4       ggcorrplot_0.1.4.1   ggplot2_3.5.1       

loaded via a namespace (and not attached):
 [1] gtable_0.3.5            xfun_0.45               htmlwidgets_1.6.4      
 [4] lattice_0.22-6          vctrs_0.6.5             tools_4.4.1            
 [7] generics_0.1.3          curl_5.2.1              tibble_3.2.1           
[10] proxy_0.4-27            fansi_1.0.6             pkgconfig_2.0.3        
[13] Matrix_1.7-0            KernSmooth_2.23-24      data.table_1.15.4      
[16] uuid_1.2-0              lifecycle_1.0.4         flextable_0.9.6        
[19] compiler_4.4.1          farver_2.1.2            stringr_1.5.1          
[22] textshaping_0.4.0       munsell_0.5.1           httpuv_1.6.15          
[25] fontquiver_0.2.1        fontLiberation_0.1.0    htmltools_0.5.8.1      
[28] class_7.3-22            yaml_2.3.9              later_1.3.2            
[31] pillar_1.9.0            crayon_1.5.3            gfonts_0.2.0           
[34] openssl_2.2.0           classInt_0.4-10         BiasedUrn_2.0.12       
[37] mime_0.12               fontBitstreamVera_0.1.1 zip_2.3.1              
[40] tidyselect_1.2.1        digest_0.6.36           stringi_1.8.4          
[43] sf_1.0-16               dplyr_1.1.4             reshape2_1.4.4         
[46] pander_0.6.5            labeling_0.4.3          splines_4.4.1          
[49] fastmap_1.2.0           grid_4.4.1              colorspace_2.1-0       
[52] cli_3.6.3               magrittr_2.0.3          crul_1.5.0             
[55] utf8_1.2.4              e1071_1.7-14            withr_3.0.0            
[58] promises_1.3.0          gdtools_0.3.7           scales_1.3.0           
[61] lubridate_1.9.3         timechange_0.3.0        officer_0.6.6          
[64] rmarkdown_2.27          ragg_1.3.2              askpass_1.2.0          
[67] zoo_1.8-12              shiny_1.8.1.1           evaluate_0.24.0        
[70] knitr_1.48              rlang_1.1.4             Rcpp_1.0.12            
[73] xtable_1.8-4            glue_1.7.0              DBI_1.2.3              
[76] httpcode_0.3.0          xml2_1.3.6              rstudioapi_0.16.0      
[79] jsonlite_1.8.8          R6_2.5.1                plyr_1.8.9             
[82] systemfonts_1.1.0       units_0.8-5