Preparing data for finalfit

Ewen Harrison

This vignette shows you how to upload and prepare any dataset for use with finalfit. The demonstration will use the boot::melanoma. Use ?boot::melanoma to see the help page with data description. I will use library(tidyverse) methods. First I’ll write_csv() the data just to demonstrate reading it.

Read data

Note the various options in read_csv(), including providing column names, variable type, missing data identifier etc.

library(readr)

# Save example
write_csv(boot::melanoma, "boot.csv")

# Read data
melanoma = read_csv("boot.csv")
#> Rows: 205 Columns: 7
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (7): time, status, sex, age, year, thickness, ulcer
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Column types

Note the output shows how the columns/variables have been parsed. For full details see ?readr::cols().

Continuous data

Categorical data

Dates and times

Check data

ff_glimpse() provides a convenient overview of all data in a tibble or data frame. It is particularly important that factors are correctly specified. Hence, ff_glimpse() separates variables into continuous and categorcial. As expected, no factors are yet specified in the melanoma dataset.

library(finalfit)
ff_glimpse(melanoma)
#> $Continuous
#>               label var_type   n missing_n missing_percent   mean     sd    min
#> time           time    <dbl> 205         0             0.0 2152.8 1122.1   10.0
#> status       status    <dbl> 205         0             0.0    1.8    0.6    1.0
#> sex             sex    <dbl> 205         0             0.0    0.4    0.5    0.0
#> age             age    <dbl> 205         0             0.0   52.5   16.7    4.0
#> year           year    <dbl> 205         0             0.0 1969.9    2.6 1962.0
#> thickness thickness    <dbl> 205         0             0.0    2.9    3.0    0.1
#> ulcer         ulcer    <dbl> 205         0             0.0    0.4    0.5    0.0
#>           quartile_25 median quartile_75    max
#> time           1525.0 2005.0      3042.0 5565.0
#> status            1.0    2.0         2.0    3.0
#> sex               0.0    0.0         1.0    1.0
#> age              42.0   54.0        65.0   95.0
#> year           1968.0 1970.0      1972.0 1977.0
#> thickness         1.0    1.9         3.6   17.4
#> ulcer             0.0    0.0         1.0    1.0
#> 
#> $Categorical
#> data frame with 0 columns and 205 rows

If you wish to see the variables in the order in which they appear in the data frame or tibble, missing_glimpse() or tibble::glimpse() are useful.

missing_glimpse(melanoma)
#>               label var_type   n missing_n missing_percent
#> time           time    <dbl> 205         0             0.0
#> status       status    <dbl> 205         0             0.0
#> sex             sex    <dbl> 205         0             0.0
#> age             age    <dbl> 205         0             0.0
#> year           year    <dbl> 205         0             0.0
#> thickness thickness    <dbl> 205         0             0.0
#> ulcer         ulcer    <dbl> 205         0             0.0

Specify factors

Use an original description of the data (often called a data dictionary) to correctly assign and label any factor variables. This can be done in a single pipe.

library(dplyr)
melanoma %>% 
  mutate(
    status.factor = factor(status, levels = c(1, 2, 3), 
      labels = c("Died from melanoma", "Alive", "Died from other causes")) %>% 
    ff_label("Status"),
    sex.factor = factor(sex, levels = c(1, 0),
      labels = c("Male", "Female")) %>% 
    ff_label("Sex"),
    ulcer.factor = factor(ulcer, levels = c(1, 0),
      labels = c("Present", "Absent")) %>% 
    ff_label("Ulcer")
  ) -> melanoma

ff_glimpse(melanoma)
#> $Continuous
#>               label var_type   n missing_n missing_percent   mean     sd    min
#> time           time    <dbl> 205         0             0.0 2152.8 1122.1   10.0
#> status       status    <dbl> 205         0             0.0    1.8    0.6    1.0
#> sex             sex    <dbl> 205         0             0.0    0.4    0.5    0.0
#> age             age    <dbl> 205         0             0.0   52.5   16.7    4.0
#> year           year    <dbl> 205         0             0.0 1969.9    2.6 1962.0
#> thickness thickness    <dbl> 205         0             0.0    2.9    3.0    0.1
#> ulcer         ulcer    <dbl> 205         0             0.0    0.4    0.5    0.0
#>           quartile_25 median quartile_75    max
#> time           1525.0 2005.0      3042.0 5565.0
#> status            1.0    2.0         2.0    3.0
#> sex               0.0    0.0         1.0    1.0
#> age              42.0   54.0        65.0   95.0
#> year           1968.0 1970.0      1972.0 1977.0
#> thickness         1.0    1.9         3.6   17.4
#> ulcer             0.0    0.0         1.0    1.0
#> 
#> $Categorical
#>                label var_type   n missing_n missing_percent levels_n
#> status.factor Status    <fct> 205         0             0.0        3
#> sex.factor       Sex    <fct> 205         0             0.0        2
#> ulcer.factor   Ulcer    <fct> 205         0             0.0        2
#>                                                                             levels
#> status.factor "Died from melanoma", "Alive", "Died from other causes", "(Missing)"
#> sex.factor                                           "Male", "Female", "(Missing)"
#> ulcer.factor                                      "Present", "Absent", "(Missing)"
#>               levels_count   levels_percent
#> status.factor  57, 134, 14 27.8, 65.4,  6.8
#> sex.factor         79, 126           39, 61
#> ulcer.factor       90, 115           44, 56

Everything looks good and you are ready to start analysis.