Introduction to matchmaker

The goal of {matchmaker} is to provide dictionary-based cleaning for R users in a simple and intuitive manner built on the {forcats} package. Some of the features of this package include:

Installation

You can install {matchmaker} from CRAN:

install.packages("matchmaker")

Example

The matchmaker package has two user-facing functions that perform dictionary-based cleaning:

Each of these functions have four manditory options:

Mostly, users will be working with match_df() to transform values across specific columns. A typical workflow would be to:

  1. construct your dictionary in a spreadsheet program based on your data
  2. read in your data and dictionary to data frames in R
  3. match!
library("matchmaker")

# Read in data set
dat <- read.csv(matchmaker_example("coded-data.csv"),
  stringsAsFactors = FALSE
)
dat$date <- as.Date(dat$date)

# Read in dictionary
dict <- read.csv(matchmaker_example("spelling-dictionary.csv"),
  stringsAsFactors = FALSE
)

Data

This is the top of our data set, generated for example purposes

id date readmission treated facility age_group lab_result_01 lab_result_02 lab_result_03 has_symptoms followup
ef267c 2019-07-08 NA 0 C 10 unk high inc NA u
e80a37 2019-07-07 y 0 3 10 inc unk norm y oui
b72883 2019-07-07 y 1 8 30 inc norm inc oui
c9ee86 2019-07-09 n 1 4 40 inc inc unk y oui
40bc7a 2019-07-12 n 1 6 0 norm unk norm NA n
46566e 2019-07-14 y NA B 50 unk unk inc NA NA

Dictionary

The dictionary looks like this:

options values grp orders
y Yes readmission 1
n No readmission 2
u Unknown readmission 3
.missing Missing readmission 4
0 Yes treated 1
1 No treated 2
.missing Missing treated 3
1 Facility 1 facility 1
2 Facility 2 facility 2
3 Facility 3 facility 3
4 Facility 4 facility 4
5 Facility 5 facility 5
6 Facility 6 facility 6
7 Facility 7 facility 7
8 Facility 8 facility 8
9 Facility 9 facility 9
10 Facility 10 facility 10
.default Unknown facility 11
0 0-9 age_group 1
10 10-19 age_group 2
20 20-29 age_group 3
30 30-39 age_group 4
40 40-49 age_group 5
50 50+ age_group 6
high High .regex ^lab_result_ 1
norm Normal .regex ^lab_result_ 2
inc Inconclusive .regex ^lab_result_ 3
y yes .global Inf
n no .global Inf
u unknown .global Inf
unk unknown .global Inf
oui yes .global Inf
.missing missing .global Inf

Matching

Special Keywords

In addition to strict one-to-one matching, there are a few reserved keywords that will help with correcting data. There are reserved keywords for the ‘from’, ‘to’, and ‘by’ columns in the dictionary.

Keys (from column)

The from column of the dictionary will contain the keys that you want to match in your current data set. These are expected to match exactly with the exception of three reserved keywords that start with a full stop:

* Any NA values in the keys will be interpreted as “NA” because it’s a common mistake to import the value “NA” to missing in R. If you intend for NA to indicate missing data, replace it with: dictionary[[1]][is.na(dictionary[[1]])] <- ".missing"

Values (to column)

The values will replace their respective keys exactly as they are presented with one exception. There is currently one recognised keyword that can be placed in the to column of your dictionary:

Keyword demonstration

For example, let’s say you have the following data set of people asked if they like ice cream:

Missing data and default values

You could contstruct a dictionary that has a 1:1 relationship between the keys that looks like this:

my_dict1
keys values
yes Yes
Y Yes
n No
N No
NO No
.missing .na
.default (invalid)

Once you read in the file (either via read.csv() or readxl::read_excel() if you use excel), you can use it as a dictionary. This dictionary will do three things:

  1. convert iterations of yes/no into Yes and No
  2. convert blank or NA values to explicit missing data.
  3. convert all other values to “(invalid)”

Fuzzy matching

Now we have nice, predictable values, but let’s say Luke really didn’t like ice cream. If he responded “NOOOOOOO” instead of “n”, then the dictionary we specified would convert it to “(invalid)”:

We can fix this if we use pattern matching. Here we are selecting from any valid spelling of yes/no with trailing letters so that it capture’s Luke’s extreme objection to ice cream. To do this we add the .regex prefix (note the space after .regex):

my_dict2
keys values
.regex ^[Yy][Ee]?[Ss]*$ Yes
.regex ^[Nn][Oo]*$ No
.missing .na
.default (invalid)

The drawback to fuzzy matching is that it will convert things that match the pattern, so be very careful when constructing your keys.

Working with Data Frames

When using the match_df() function, you would construct the dictionary same as you would above, with two extra columns that specify the column name in the data frame and the order the resulting values should be (if the column is a factor).

As with match_vec(), all the same keywords apply, but now there are also two keywords for the columns:

Matching columns with .regex

Before you use regex, you should be aware of three special symbols that will help anchor your words and prevent any unintended matching.

  1. The carrot (^) should be placed at the beginning of a pattern to show that it’s the beginning of the word. For example, lab will match both lab_result and granite_slab, but ^lab will only match lab_result
  2. The dollar ($) should be placed at the end of a pattern to show that it’s the end of a word. For example, date will match both admission_date and date_of_onset, but date$ will only match admission_date$.
  3. The dot (.) matches any character. Because it’s common in column names imported by R, it’s a good idea to wrap it in square brackets ([.]) to tell R that you actually mean a dot. For example, ^lab.r$ will match lab.r, lab_r, and labor, but ^lab[.]r$ will only match lab.r.

The best strategy is to use at least one anchor to prevent it greedily selecting columns to match.

In our example from the top, there are three columns that all start with lab_result_, so we use the .regex ^lab_result keyword:

Using .global to clean up all character/factor columns

We’ve actually seen the .global keyword in use already. Let’s take one more look at the results from above:

Notice above how there are rules for “high”, “norm”, and “inc”, but not for “unk”, which was turned into “unknown”? This is because of the global keywords:

The “unk” keyword was defined in our global dictionary and has been used to translate “unk” to “unknown”.

Of course, be very careful with this one.

Warnings

Internally, the match_vec() function can be quite noisy with warnings for various reasons. Thus, by default, the match_df() function will keep these quiet, but you can have them printed to your console if you use the warn = TRUE option: