PackageIntroduction

library(validata)
library(tidyselect)

Distinct

Confirm Distinct

In data analysis tasks we often have data sets with multiple possible ID columns, but it’s not always clear which combination uniquely identifies each row.

sample_data1 has 125 row with 3 ID type columns and 3 value columns.

head(sample_data1)
#> # A tibble: 6 x 6
#>   ID_COL1 ID_COL2 ID_COL3     VAL1   VAL2   VAL3
#>     <dbl>   <dbl>   <dbl>    <dbl>  <dbl>  <dbl>
#> 1    2413    1034    1014 -0.0639  -1.16  -0.302
#> 2    2413    1034    1322  0.363    1.62   0.165
#> 3    2413    1034    2999 -0.00466  1.23   0.819
#> 4    2413    1034    3544  1.83    -2.58  -0.525
#> 5    2413    1034    9901  0.837   -0.442 -0.341
#> 6    2413    1122    1014 -0.894   -1.11   0.768

Let’s use confirm_distinct iteratively to find the uniquely identifying columns of sample_data1.

sample_data1 %>% 
  confirm_distinct(ID_COL1)
#> database has 120 duplicates at ID_COL1
sample_data1 %>% 
  confirm_distinct(ID_COL1, ID_COL2)
#> database has 100 duplicates at ID_COL1, ID_COL2
sample_data1 %>% 
  confirm_distinct(ID_COL1, ID_COL2, ID_COL3)
#> database is distinct at ID_COL1, ID_COL2, ID_COL3

Here we can conclude that the combination of 3 ID columns is the primary key for the data.

Determine Distinct

These steps can be automated with the wrapper function determine distinct.

sample_data1 %>% 
  determine_distinct(matches("ID"))

Mapping

confirm_mapping tells you the mapping between two columns in a data frame:

Confirm mapping

confirm_mapping gives the option to view which type of mapping is associated with each individual row.

sample_data1 %>% 
  confirm_mapping(ID_COL1, ID_COL2, view = F)
#> many - many mapping between ID_COL1 and ID_COL2

Determine mapping

sample_data1 %>% 
  determine_mapping(everything())

Overlap

The overlap functions give a venn style description of the values in 2 columns. This is especially useful before performing a join function, and you want to confirm that the dataframes have matching keys.

Confirm Overlap

confirm_overlap is different from the other confirm functions in that it takes 2 vectors as arguments, instead of a data frame. This is to allow the user to test overlap between different dataframes, or arbitrary vectors if necessary


confirm_overlap(iris$Sepal.Width, iris$Petal.Length) -> iris_overlap
#> # A tibble: 1 x 5
#>   only_in_iris_Sepal.W… only_in_iris_Petal.… shared_names total_names pct_shared
#>                   <int>                <int>        <int>       <int> <chr>     
#> 1                    12                   32           11          55 20%

confirm_overlap returns a summary data frame invisibly allowing you to access individual elements using the helper functions.

print(iris_overlap)
#> # A tibble: 55 x 4
#>        x iris_Sepal.Width iris_Petal.Length both_flags
#>    <dbl>            <dbl>             <dbl>      <dbl>
#>  1   3.5                1                 1          2
#>  2   3                  1                 1          2
#>  3   3.2                1                 0          1
#>  4   3.1                1                 0          1
#>  5   3.6                1                 1          2
#>  6   3.9                1                 1          2
#>  7   3.4                1                 0          1
#>  8   2.9                1                 0          1
#>  9   3.7                1                 1          2
#> 10   4                  1                 1          2
#> # … with 45 more rows

Find the elements unique to the first column

iris_overlap %>% 
  co_find_only_in_1() %>% 
  head()
#> # A tibble: 6 x 1
#>   iris_Sepal.Width
#>              <dbl>
#> 1              3.2
#> 2              3.1
#> 3              3.4
#> 4              2.9
#> 5              2.3
#> 6              2.8

Find the elements unique to the second column

iris_overlap %>% 
  co_find_only_in_2() %>% 
  head()
#> # A tibble: 6 x 1
#>   iris_Petal.Length
#>               <dbl>
#> 1               1.4
#> 2               1.3
#> 3               1.5
#> 4               1.7
#> 5               1.6
#> 6               1.1

Find the elements shared by both columns

iris_overlap %>% 
  co_find_in_both() %>% 
  head()
#> # A tibble: 6 x 1
#>       x
#>   <dbl>
#> 1   3.5
#> 2   3  
#> 3   3.6
#> 4   3.9
#> 5   3.7
#> 6   4

Determine Overlap

determine_overlap takes a dataframe and a tidyselect specification, and returns a tibble summarizing all of the pairwise overlaps. Only pairs with matching types are tested.

Note that the overlap functions only test pairwise overlaps. For multi-column and large-scale overlap testing, see Complex Upset Plots

string length

confirm string length

Get a frequency table of string lengths in a character column. Table is printed while the original df is returned invisibly with a column indicating the string lengths.

iris %>% 
  confirm_strlen(Species) -> species_len
#>  Species_chr_len  n percent
#>                6 50   33.3%
#>                9 50   33.3%
#>               10 50   33.3%

output is a dataframe

head(species_len)
#> # A tibble: 6 x 6
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_chr_len
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>             <int>
#> 1          5.1         3.5          1.4         0.2 setosa                6
#> 2          4.9         3            1.4         0.2 setosa                6
#> 3          4.7         3.2          1.3         0.2 setosa                6
#> 4          4.6         3.1          1.5         0.2 setosa                6
#> 5          5           3.6          1.4         0.2 setosa                6
#> 6          5.4         3.9          1.7         0.4 setosa                6

choose string length

A helped function for the output of confirm_strlen that filters the database for chosen string lengths.

species_len %>% 
  choose_strlen(len = 6) %>% 
  head()
#> # A tibble: 6 x 6
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_chr_len
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>             <int>
#> 1          5.1         3.5          1.4         0.2 setosa                6
#> 2          4.9         3            1.4         0.2 setosa                6
#> 3          4.7         3.2          1.3         0.2 setosa                6
#> 4          4.6         3.1          1.5         0.2 setosa                6
#> 5          5           3.6          1.4         0.2 setosa                6
#> 6          5.4         3.9          1.7         0.4 setosa                6

diagnose

Reproduction of diagnose from the dlookr package. Usually a good choice for first analyzing a data set.

iris %>% 
  diagnose()
#> # A tibble: 5 x 6
#>   variables    types   missing_count missing_percent unique_count unique_rate
#>   <chr>        <chr>           <int>           <dbl>        <int>       <dbl>
#> 1 Sepal.Length numeric             0               0           35       0.233
#> 2 Sepal.Width  numeric             0               0           23       0.153
#> 3 Petal.Length numeric             0               0           43       0.287
#> 4 Petal.Width  numeric             0               0           22       0.147
#> 5 Species      factor              0               0            3       0.02