---
title: "Units of measurement"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Units of measurement}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, setup, include = FALSE}
source(system.file("extdata", "vignettes", "helpers.R", package = "ricu"))

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(ricu)
library(ggplot2)
```

```{r, assign-demo, echo = FALSE}
demo <- c("mimic_demo", "eicu_demo")
```

```{r, assign-srcs, echo = FALSE}
srcs <- c("mimic", "eicu", "aumc", "hirid", "miiv")
```

```{r, demo-miss, echo = FALSE, eval = !srcs_avail(demo), results = "asis"}
demo_missing_msg(demo, "uom.html")
knitr::opts_chunk$set(eval = FALSE)
```

One challenge when working with electronic healthcare (EHR) data from different hospitals, which is accentuated if the intensive care units (ICUs) collecting the data are located in different geographic regions, is the use of different measurement unit systems. In part, this can be attributed to the use of imperial units in English speaking countries (e.g. patient weight being reported in lbs instead of kg), but more subtle differences in practice are involved as well, such as reporting lab test results in mg/dL instead of the SI unit mmol/L. While discrepancies of the former type are easy to resolve, harmonization of different notions of concentration is slightly more involved due to the conversion factor being substance-specific.

## `ricu` data concepts

Data concepts of type `num_cncpt` can specify an expected units as character vector (where the string in position 1 is added as per-column attribute `units`). When using concepts of this type, unit conversion to the specified target unit is not handled automatically but is up to the user via a callback function. Unit mismatches are reported as messages during concept loading^[Recently, experimental support for automatic unit conversion has been added via the `unt_cncpt` class. Such a concept will attempt to convert data columns using the [units](https://cran.r-project.org/package=units) package and therefore requires that both the source- and target-units are recognized and convertible (see `units::ud_are_convertible()`).].

```{r, demo-load, eval = srcs_avail("mimic_demo")}
(dat <- load_concepts(c("lact", "map"), "mimic_demo"))
```

Messages raised during loading of lactate values, for example, indicate that 0.62% of retrieved values are specified in mEq/L instead of mmol/L (which requires an identity transformation for unit conversion), while the remaining discrepancies are false positives (both mmole/L and MMOLL can be assumed to mean mmol/L). For mean arterial bloop pressure values, the target unit is specified as mmHg (with the alternative spelling mm Hg being accepted as well), however, due to the data organization in eICU^[Both the `vitalperiodic` and `vitalaperiodic` tables in eICU are layed out in *wide* format (whereas most other tables are in *long* format), and therefore no unit columns are available. This also explains the substantial degree of missingness (in terms of values) reported, as such a *wide* data organization scheme coupled with differing measurement intervals per variable will inevitably lead to some degree of sparsity.], no explicit measurement units are specified for this variable, in turn causing the large percentage of missing unit values reported.

Several utility functions are exported from `ricu` for helping with creating callback functions that handle unit conversion. Data items corresponding to the bilirubin concept for the European datasets HiRID and AUMCdb, for example, have a callback entry specified as ``convert_unit(binary_op(`*`, 0.058467), "mg/dL")``. This creates a callback functions which applies ``binary_op(`*`, 0.058467)`` to the column specified as `val_var` and replaces existing values in the column identified by `unit_var` with the value `"mg/dl"`. In case the loaded data already is comprised of a mix of units, a regular expression passed as `rgx` can be specified, which will be used to identify the rows on which to operate. Finally, the function `binary_op` turns a binary function into an unary function by fixing the second operand.

As a first sanity check we will slightly modify data loading in order to be warned about item IDs that do not appear in the data. For this we chop up data items loaded from the dictionary such that items of type `sel_itm` which may contain several IDs, are split into separate items and regular expressions in `rgx_itm` items are taken apart such that aggregate expressions such as `(foo|bar)baz` are turned into `foobaz` and `barbaz`.

```{r, sep-itms, include = FALSE, eval = TRUE}
assign_id <- function(id, itm, nme) `[[<-`(itm, nme, id)

rep_itm <- function(itm, nme = "ids", fun = identity) {
  as_item(lapply(fun(itm[[nme]]), assign_id, itm, nme))
}

unlst <- function(x) unlist(x, recursive = FALSE, use.names = FALSE)

split_ind <- function(x, s) {
  unlst(Map(substr, s, c(1L, x + 1L), c(x - 1L, nchar(s))))
}

greg <- function(...) {
  res <- lapply(gregexpr(...), `attributes<-`, NULL)
  res[vapply(res, identical, logical(1L), -1L)] <- list(integer())
  res
}

find_paren <- function(x) {

  sta <- greg("\\(", x)
  end <- greg("\\)", x)

  res <- Map(rep, 0L, nchar(x))
  res <- Map(`[<-`, res, sta, 1L)

  res <- Map(`[<-`, res, end, lapply(Map(`[`, res, end), `-`, 1L))
  res <- lapply(res, cumsum)

  res <- lapply(res, `==`, 0L)
  res <- Map(`[<-`, res, Map(c, sta, end), NA_integer_)

  res
}

expand_lst <- function(x) {
  apply(do.call(expand.grid, x), 1L, paste0, collapse = "")
}

seq_nna <- function(x) {
  nna <- !is.na(x)
  x[nna] <- seq_len(sum(nna))
  x
}

rep_paren <- function(x) {

  rln <- lapply(find_paren(x), rle)
  sqs <- lapply(lapply(rln, `[[`, "values"), seq_nna)
  rln <- Map(`[[<-`, rln, "values", sqs)
  rln <- lapply(rln , inverse.rle)

  res <- Map(split, strsplit(x, ""), rln)
  res <- lapply(res, lapply, paste0, collapse = "")

  res <- lapply(res, lapply, split_rgx)
  res <- lapply(res, expand_lst)

  unlst(res)
}

split_pipe <- function(x) {

  ind <- find_paren(x)

  spt <- greg("\\|", x)
  tdo <- lengths(spt) > 0L

  spt <- Map(`[`, spt, Map(`[`, ind, spt))
  res <- Map(split_ind, spt, x)

  res[tdo] <- lapply(res[tdo], rep_paren)

  unlst(res)
}

split_rgx <- function(rgx) unlst(split_pipe(rgx))

split_items <- function(x) {

  if (is_concept(x)) {
    return(new_concept(lapply(x, split_items)))
  }

  if (inherits(x, "sel_itm")) {
    return(rep_itm(x, "ids"))
  }

  if (inherits(x, "rgx_itm")) {
    return(rep_itm(x, "regex", split_rgx))
  }

  if (inherits(x, "itm")) {
    return(as_item(x))
  }

  if (inherits(x, "rec_cncpt")) {
    x$items <- as_concept(lapply(x$items, split_items))
  } else if (inherits(x, "cncpt")) {
    x$items <- do.call("c", lapply(x$items, split_items))
  }

  x
}

empty_items <- list()

report_empty <- function(dict, ...) {

  itm_load_report_empty <- quote({

    res <- returnValue()

    if (!nrow(res) && !inherits(x, "nul_itm")) {

      i <- 1

      while (!identical(.GlobalEnv, env <- parent.frame(i)) &&
             !inherits(cncpt <- get0("x", envir = env), "cncpt")) {
        i <- i + 1
      }

      if (inherits(x, "rgx_itm")) {
        id <- x$regex
      } else {
        id <- x$ids
      }

      add_empty <- list(name = cncpt$name, id = id, source = x$src,
                        table = x$table)
      empty_items <<- c(empty_items, list(add_empty))
    }

    res
  })

  trace(do_itm_load, exit = itm_load_report_empty, print = FALSE)
  on.exit(untrace(do_itm_load))

  load_concepts(split_items(dict), ...)

  empty_items
}
```

Next, for the actual data loading, `report_empty()` substitutes the internally called function `do_itm_load()` with a modified version that takes note of the offending IDs alongside concept, table and data source names whenever zero rows are returned for a given `itm` object.

```{r, itm-check, cache = TRUE, eval = srcs_avail(demo)}
<<assign-demo>>
concepts <- c("map", "lact", "bili", "gcs", "abx")

dict <- load_dictionary(demo, concepts)

empty_items <- report_empty(dict, merge = FALSE, verbose = FALSE)
```

```{r, itm-print, eval = length(get0("empty_items")) > 0L, echo = FALSE}
knitr::kable(
  data.table::rbindlist(empty_items),
  caption = paste(
    "For the given concepts, the listed item IDs do not return any data when",
    "the respective tables and data sources are queried."
  )
)
```

If no table is printed given the modified data loading above, every single ID that is part of a `sel_itm` or an `rgx_itm` actually returns some data. For other types of data items this means that for every single item as a whole, some data was returned. There are limitations to this type of sanity check though: It might be the case that one of the supplied IDs associated with a `num_cncpt` concept returns data in an unexpected unit of measurement which may cause the range filter to remove all of that data again. In such a scenario though, this will be reported (if `TRUE` is passed as `verbose` argument to `load_concept()`). Paying attention to the output produced by `load_concept()` should help spot such issues, albeit no longer at item resolution but only at concept level.

Next, we will investigate the number of measurements available per concept and stay day. For each stay ID and concept we calculate the number of measurements and note the stay duration. From this we can visualize how the number of measurements per day is distributed over the datasets alongside the percentage of patients that have at least one measurement available.

```{r, full-miss, echo = FALSE, eval = srcs_avail(demo) && (!srcs_avail(srcs) || quick_build()), results = "asis"}
demo_instead_full_msg(demo, srcs, "uom.html")
```

```{r, demo-srcs, eval = FALSE, echo = FALSE}
srcs <- demo
```

```{r, itm-funs, eval = FALSE, echo = FALSE}
count_meas <- function(x) {
  x[!is.na(get(data_var(x))), list(count = .N), by = c(id_vars(x))]
}

meas_day <- function(x, los) {
  merge(x, los)[, count := count / los_icu]
}

quants <- function(x) {
  setNames(
    as.list(quantile(x, c(0.05, 0.25, 0.5, 0.75, 0.95))),
    c("min", "lwr", "med", "upr", "max")
  )
}

meas_stats <- function(x, concept) {
  x[, c(list(concept = concept, n_pat = .N), quants(count / los_icu)),
    by = "source"]
}

```

```{r, itm-counts, eval = FALSE, echo = FALSE}

los <- load_concepts("los_icu", srcs, verbose = FALSE)
los <- los[los_icu > 0, ]

concepts <- c("map", "lact", "crea", "bili", "plt")

dat <- load_concepts(concepts, srcs, merge = FALSE, verbose = FALSE)

counts <- lapply(dat, count_meas)
counts <- lapply(counts, merge, los)
counts <- Map(meas_stats, counts, names(counts))
counts <- do.call(rbind, counts)
counts <- merge(counts, los[, list(total_pat = .N), by = "source"],
                by = "source")

head(counts)
```

```{r, itm-load, cache = TRUE, ref.label = c("itm-funs", if (srcs_avail(srcs) && !quick_build()) "assign-srcs" else "demo-srcs", "itm-counts")}
```

```{r, count-plot, echo = FALSE, fig.width = 6}
boxplots <- ggplot(counts, aes(concept)) +
  geom_boxplot(
     aes(ymin = min, lower = lwr, middle = med, upper = upr, ymax = max,
         color = source),
     stat = "identity"
  ) +
  coord_flip() +
  theme_bw() +
  xlab("Concept name") + ylab("Measurement count per ICU day")

pat_perc <- ggplot(counts, aes(concept)) +
  geom_col(aes(y = n_pat / total_pat, fill = source), position = "dodge2") +
  coord_flip() +
  theme_bw() +
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  ylab("Percentage of patients")

if (requireNamespace("cowplot", quietly = TRUE)) {
  cowplot::plot_grid(
    cowplot::plot_grid(
      boxplots + theme(legend.position = "none"),
      pat_perc + theme(legend.position = "none"),
      nrow = 1L, rel_widths = c(0.7, 0.3)
    ),
    cowplot::get_legend(boxplots + theme(legend.position = "bottom")),
    ncol = 1L, rel_heights = c(0.9, 0.1)
  )
} else {
  boxplots
  pat_perc
}
```

Finally, we compare the densities we obtain by looking concept values per dataset, as visualized in the following plot.

```{r, uom-hist, echo = FALSE, fig.width = 6}
filter_quants <- function(x, lwr, upr) {

  do_filter <- function(x) {
    qq <- quantile(x, probs = c(lwr, upr), na.rm = TRUE)
    x[!is.na(x) & x >= qq[1L] & x <= qq[2L]]
  }

  x[, list(val = do_filter(get(data_var(x)))), by = "source"]
}

for (x in dat) {

  feat <- data_var(x)

  x <- filter_quants(x, lwr = 0.025, upr = 0.975)

  title <- x[, list(val = median(val)), by = "source"]
  title <- paste0(title$val, " (", title$source, ")", collapse = ", ")
  title <- paste0(feat, ": ", title)

  print(
    ggplot(x, aes(x = val, fill = source)) +
      geom_density(alpha = 0.5) +
      xlab(feat) + theme_bw() + ggtitle(title)
  )
}
```

When extending the `ricu` dictionary to both new data sources and new data concepts, it might be worthwhile to visually inspect the returned data in a fashion similar to the above in order to have a high-level confirmation that measurement units roughly agree.