This R package is designed to block records for data deduplication
and record linkage (also known as entity resolution) using approximate
nearest neighbours algorithms (ANN) and graphs (via the
igraph
package).
It supports the following R packages that bind to specific ANN algorithms:
mlpack::lsh
and mlpack::knn
).The package can be used with the reclin2 package
via the blocking::pair_ann
function.
Install the GitHub blocking package with:
# install.packages("remotes") # uncomment if needed
::install_github("ncn-foreigners/blocking") remotes
Load packages for the examples
library(blocking)
library(reclin2)
#> Ładowanie wymaganego pakietu: data.table
#> data.table 1.17.0 using 6 threads (see ?getDTthreads). Latest news: r-datatable.com
Generate simple data with three groups (df_example
) and
reference data (df_base
).
<- data.frame(txt = c(
df_example "jankowalski",
"kowalskijan",
"kowalskimjan",
"kowaljan",
"montypython",
"pythonmonty",
"cyrkmontypython",
"monty"
))<- data.frame(txt = c("montypython", "kowalskijan", "other"))
df_base
df_example#> txt
#> 1 jankowalski
#> 2 kowalskijan
#> 3 kowalskimjan
#> 4 kowaljan
#> 5 montypython
#> 6 pythonmonty
#> 7 cyrkmontypython
#> 8 monty
df_base#> txt
#> 1 montypython
#> 2 kowalskijan
#> 3 other
Deduplication using the blocking
function. Output
contains information:
nnd
which refers to the NN
descent algorithm),text2vec
package (here 28),<- blocking(x = df_example$txt)
blocking_result
blocking_result#> ========================================================
#> Blocking based on the nnd method.
#> Number of blocks: 2.
#> Number of columns used for blocking: 28.
#> Reduction ratio: 0.5714.
#> ========================================================
#> Distribution of the size of the blocks:
#> 4
#> 2
Table with blocking results contains:
$result
blocking_result#> x y block dist
#> <int> <int> <num> <num>
#> 1: 1 2 1 0.10000002
#> 2: 2 3 1 0.14188367
#> 3: 2 4 1 0.28286284
#> 4: 5 6 2 0.08333331
#> 5: 5 7 2 0.13397455
#> 6: 5 8 2 0.27831215
Deduplication using the pair_ann
function for
integration with the reclin2
package. Use the pipeline with
the reclin2
package.
pair_ann(x = df_example, on = "txt") |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.55) |>
link(selection = "threshold")
#> Total number of pairs: 8 pairs
#>
#> Key: <.y>
#> .y .x txt.x txt.y
#> <int> <int> <char> <char>
#> 1: 2 1 jankowalski kowalskijan
#> 2: 3 1 jankowalski kowalskimjan
#> 3: 3 2 kowalskijan kowalskimjan
#> 4: 4 1 jankowalski kowaljan
#> 5: 4 2 kowalskijan kowaljan
#> 6: 6 5 montypython pythonmonty
#> 7: 7 5 montypython cyrkmontypython
#> 8: 8 5 montypython monty
Linking records using the same function where df_base
is
the “register” and df_example
is the reference (data).
pair_ann(x = df_base, y = df_example, on = "txt", deduplication = FALSE) |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.55) |>
link(selection = "threshold")
#> Total number of pairs: 8 pairs
#>
#> Key: <.y>
#> .y .x txt.x txt.y
#> <int> <int> <char> <char>
#> 1: 1 2 kowalskijan jankowalski
#> 2: 2 2 kowalskijan kowalskijan
#> 3: 3 2 kowalskijan kowalskimjan
#> 4: 4 2 kowalskijan kowaljan
#> 5: 5 1 montypython montypython
#> 6: 6 1 montypython pythonmonty
#> 7: 7 1 montypython cyrkmontypython
#> 8: 8 1 montypython monty
See section
Data Integration (Statistical Matching and Record Linkage)
in the
Official Statistics Task View.
Packages that allow blocking:
pair_blocking
, pari_minsim
functions,blockData
function.Other:
Work on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941.