Read required packages.
Read the RLdata500
data (taken from the RecordLinkage
package).
fname_c1 | fname_c2 | lname_c1 | lname_c2 | by | bm | bd | rec_id | ent_id |
---|---|---|---|---|---|---|---|---|
CARSTEN | MEIER | 1949 | 7 | 22 | 1 | 34 | ||
GERD | BAUER | 1968 | 7 | 27 | 2 | 51 | ||
ROBERT | HARTMANN | 1930 | 4 | 30 | 3 | 115 | ||
STEFAN | WOLFF | 1957 | 9 | 2 | 4 | 189 | ||
RALF | KRUEGER | 1966 | 1 | 13 | 5 | 72 | ||
JUERGEN | FRANKE | 1929 | 7 | 4 | 6 | 142 |
This dataset contains 500 rows with 450 entities.
Now we create a new column that concatenates the information in each row.
RLdata500[, id_count :=.N, ent_id] ## how many times given unit occurs
RLdata500[, bm:=sprintf("%02d", bm)] ## add leading zeros to month
RLdata500[, bd:=sprintf("%02d", bd)] ## add leading zeros to day
RLdata500[, txt:=tolower(paste0(fname_c1,fname_c2,lname_c1,lname_c2,by,bm,bd))]
head(RLdata500)
fname_c1 | fname_c2 | lname_c1 | lname_c2 | by | bm | bd | rec_id | ent_id | id_count | txt |
---|---|---|---|---|---|---|---|---|---|---|
CARSTEN | MEIER | 1949 | 07 | 22 | 1 | 34 | 1 | carstenmeier19490722 | ||
GERD | BAUER | 1968 | 07 | 27 | 2 | 51 | 2 | gerdbauer19680727 | ||
ROBERT | HARTMANN | 1930 | 04 | 30 | 3 | 115 | 1 | roberthartmann19300430 | ||
STEFAN | WOLFF | 1957 | 09 | 02 | 4 | 189 | 1 | stefanwolff19570902 | ||
RALF | KRUEGER | 1966 | 01 | 13 | 5 | 72 | 1 | ralfkrueger19660113 | ||
JUERGEN | FRANKE | 1929 | 07 | 04 | 6 | 142 | 1 | juergenfranke19290704 |
In the next step we use the newly created column in the
blocking
function. If we specify verbose, we get
information about the progress.
df_blocks <- blocking(x = RLdata500$txt, ann = "nnd", verbose = 1, graph = TRUE, seed = 2024)
#> ===== creating tokens =====
#> ===== starting search (nnd, x, y: 500, 500, t: 429) =====
#> ===== creating graph =====
Results are as follows:
rnndescent
we have created 133 blocks,df_blocks
#> ========================================================
#> Blocking based on the nnd method.
#> Number of blocks: 133.
#> Number of columns used for blocking: 429.
#> Reduction ratio: 0.9917.
#> ========================================================
#> Distribution of the size of the blocks:
#> 2 3 4 5 6 7 8 9 10 11 17
#> 47 34 18 12 8 5 3 3 1 1 1
Structure of the object is as follows:
result
– a data.table
with identifiers and
block IDs,method
– the method used,deduplication
– whether deduplication was applied,representation
– whether shingles or vectors were
used,metrics
– standard metrics and based on the
igraph::compare
methods for comparing graphs (here
NULL),confusion
– confusion matrix (here NULL),colnames
– column names used for the comparison,graph
– an igraph
object mainly for
visualisation.str(df_blocks,1)
#> List of 8
#> $ result :Classes 'data.table' and 'data.frame': 367 obs. of 4 variables:
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> $ method : chr "nnd"
#> $ deduplication : logi TRUE
#> $ representation: chr "shingles"
#> $ metrics : NULL
#> $ confusion : NULL
#> $ colnames : chr [1:429] "86" "ap" "av" "bf" ...
#> $ graph :Class 'igraph' hidden list of 10
#> - attr(*, "class")= chr "blocking"
Plot connections.
The resulting data.table
has four columns:
x
– reference dataset (i.e. RLdata500
) –
this may not contain all units of RLdata500
,y
- query (each row of RLdata500
) – this
may not contain all units of RLdata500
,block
– the block ID,dist
– distance between objects.x | y | block | dist |
---|---|---|---|
1 | 64 | 33 | 0.4737987 |
2 | 43 | 1 | 0.0807453 |
2 | 486 | 1 | 0.4102322 |
3 | 450 | 88 | 0.4326335 |
4 | 234 | 12 | 0.5256584 |
5 | 128 | 2 | 0.5133357 |
Create long data.table
with information on blocks and
units from original dataset.
df_block_melted <- melt(df_blocks$result, id.vars = c("block", "dist"))
df_block_melted_rec_block <- unique(df_block_melted[, .(rec_id=value, block)])
head(df_block_melted_rec_block)
rec_id | block |
---|---|
1 | 33 |
2 | 1 |
3 | 88 |
4 | 12 |
5 | 2 |
6 | 33 |
We add block information to the final dataset.
fname_c1 | fname_c2 | lname_c1 | lname_c2 | by | bm | bd | rec_id | ent_id | id_count | txt | block_id |
---|---|---|---|---|---|---|---|---|---|---|---|
CARSTEN | MEIER | 1949 | 07 | 22 | 1 | 34 | 1 | carstenmeier19490722 | 33 | ||
GERD | BAUER | 1968 | 07 | 27 | 2 | 51 | 2 | gerdbauer19680727 | 1 | ||
ROBERT | HARTMANN | 1930 | 04 | 30 | 3 | 115 | 1 | roberthartmann19300430 | 88 | ||
STEFAN | WOLFF | 1957 | 09 | 02 | 4 | 189 | 1 | stefanwolff19570902 | 12 | ||
RALF | KRUEGER | 1966 | 01 | 13 | 5 | 72 | 1 | ralfkrueger19660113 | 2 | ||
JUERGEN | FRANKE | 1929 | 07 | 04 | 6 | 142 | 1 | juergenfranke19290704 | 33 |
We can check in how many blocks the same entities
(ent_id
) are observed. In our example, all the same
entities are in the same blocks.
uniq_blocks | N |
---|---|
1 | 450 |
We can visualise the distances between units stored in the
df_blocks$result
data set. Clearly we have a mixture of two
groups: matches (close to 0) and non-matches (close to 1).
hist(df_blocks$result$dist, xlab = "Distances", ylab = "Frequency", breaks = "fd",
main = "Distances calculated between units")
Finally, we can visualise the result based on the information whether block contains matches or not.
df_for_density <- copy(df_block_melted[block %in% RLdata500$block_id])
df_for_density[, match:= block %in% RLdata500[id_count == 2]$block_id]
plot(density(df_for_density[match==FALSE]$dist), col = "blue", xlim = c(0, 0.8),
main = "Distribution of distances between\nclusters type (match=red, non-match=blue)")
lines(density(df_for_density[match==TRUE]$dist), col = "red", xlim = c(0, 0.8))