---
title: "Multibias example: Evans"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Multibias example: Evans}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(multibias)
```
We are interested in quantifying the effect of smoking (SMK) on coronary heart disease (CHD) using data from the [Evans County Heart Study](https://en.wikipedia.org/wiki/Evans_County_Heart_Study). Let's inspect the dataframe we have for 609 participants aged 40 and older.
```{r, eval = TRUE}
head(evans)
```
This is clearly not the most robust data, but, for purposes of demonstrating `multibias`, let's proceed by pretending that our data was missing information on AGE. Let's inspect the resulting effect estimate, adjusted for hypertension (HPT).
```{r, eval = TRUE}
biased_model <- glm(CHD ~ SMK + HPT,
family = binomial(link = "logit"),
data = evans
)
or <- round(exp(coef(biased_model)[2]), 2)
or_ci_low <- round(
exp(coef(biased_model)[2] - 1.96 * summary(biased_model)$coef[2, 2]), 2
)
or_ci_high <- round(
exp(coef(biased_model)[2] + 1.96 * summary(biased_model)$coef[2, 2]), 2
)
print(paste0("Biased Odds Ratio: ", or))
print(paste0("95% CI: (", or_ci_low, ", ", or_ci_high, ")"))
```
This estimate is around what we'd expect. In the IJE article [The association between tobacco smoking and coronary heart disease](https://doi.org/10.1093/ije/dyv124) the author notes: "Cigarette smokers have about twice as much coronary heart disease as non-smokers, whether measured by deaths, prevalence, or the incidence of new events." However, we also know from studies like [this BMJ meta-analysis](https://doi.org/10.1136/bmj.j5855) that the effect estimate can vary greatly depending on the degree of cigarette consumption.
Can we anticipate whether this odds ratio without age-adjustment is biased towards or away from the null? Let's consider the association of the uncontrolled confounder with the exposure and outcome.
```{r}
cor(evans$SMK, evans$AGE)
cor(evans$CHD, evans$AGE)
```
In our data, AGE has a negative association with SMK (older people are **less** likely to be smokers) and a positive association with CHD (older people are **more** likely to have CHD). These opposite associations must be biasing the odds ratio towards the null, creating a distortion where those who are less likely to smoke are more likely to experience the outcome.
We'll treat AGE as a binary indicator of over (1) or under (0) age 60. To adjust for the uncontrolled confounding from AGE, let's refer to the appropriate bias model for a binary uncontrolled confounder: logit(P(U=1)) = α0 + α1X + α2Y + α2+jCj.
To derive the necessary bias parameters, let's make the following assumption:
* the odds of an age>60 is half as likely in smokers than non-smokers
* the odds of an age>60 is 2.5x in those with CHD than those without CHD
* the odds of an age>60 is 2x in those with HPT than those without HPT
To convert these relationships as parameters in the model, we'll log-transform them from odds ratios. For the model intercept, we can use the following reasoning: what is the probability that a non-smoker (X=0) without CHD (Y=0) and HPT (C=0) is over age 60 in this population? We'll assume this is a 25% probability. We'll use the inverse logit function `qlogis()` from the `stats` package to convert this from a probability to the intercept coefficient of the logistic regression model.
```{r, eval = TRUE}
u_0 <- qlogis(0.25)
u_x <- log(0.5)
u_y <- log(2.5)
u_c <- log(2)
```
Now let's plug these bias parameters into `adjust_uc()` along with our `data_observed` object to obtain our bias-adjusted odds ratio.
```{r, eval = TRUE}
df_obs <- data_observed(
data = evans,
exposure = "SMK",
outcome = "CHD",
confounders = "HPT"
)
set.seed(1234)
adjust_uc(
df_obs,
u_model_coefs = c(u_0, u_x, u_y, u_c)
)
```
We get an odds ratio of 2.3 (95% CI: 1.3, 4.1). This matches our expectation that the bias adjustment would pull the odds ratio away from the null. How does this result compare to the result we would get if age **wasn't** missing in the data and was incorporated in the outcome regression?
```{r, eval = TRUE}
full_model <- glm(CHD ~ SMK + HPT + AGE,
family = binomial(link = "logit"),
data = evans
)
or <- round(exp(coef(full_model)[2]), 2)
or_ci_low <- round(
exp(coef(biased_model)[2] - 1.96 * summary(full_model)$coef[2, 2]), 2
)
or_ci_high <- round(
exp(coef(biased_model)[2] + 1.96 * summary(full_model)$coef[2, 2]), 2
)
print(paste0("Odds Ratio: ", or))
print(paste0("95% CI: (", or_ci_low, ", ", or_ci_high, ")"))
```
It turns out that our bias-adjusted odds ratio using `multibias` is close to this complete-data odds ratio of 2.3.