library(autodb)
#> 
#> Attaching package: 'autodb'
#> The following object is masked from 'package:stats':
#> 
#>     decompose

Terminology

To avoid confusion, this vignette is consistent with terminology: a database consists of relations, which have records with the same attributes. This is in line with terms used in the relational model; in Statistics, we would talk about tables, which have rows with values for the same variables, with a column for each variable.

For the original data set, which is a flat table, we refer to a data frame instead, which also consists of records with attributes.

When we talk about the type of values an attribute can take, we talk about a class, as in the relational model and R, rather than a value type, as in many other programming languages.

Motivation

autodb is described in the DESCRIPTION file in the following way:

Automatic normalisation of a data frame to third normal form, with the intention of easing the process of data cleaning. (Usage to design your actual database for you is not advised.)

Let us begin there.

For data cleaning

Database normalisation works on the principle that our data should only express a piece of information in a single place. This is done to reduce the chance that a modification will accidentally change or remove other pieces of information.

Data cleaning works on the principle that we remove duplicate observations and structural errors, and have a clear understanding of any associations between different observations, or parts of observations. This is done to ensure a high quality for the data used in analysis, but I have found that it is equally important for making sure that I understand the domain I am hoping to model.

Normalisation is helpful for both of these areas. I usually use the original, flat, “tidy” data for the analysis itself. However, for checking the data’s consistency, and for using the data to learn about the domain, normalised data is often helpful, whether it’s used for the analysis or not.

I usually do data checking under the following conditions:

No need to store the data in normalised form. My own work normally consists of one-off analyses, where I won’t see the data again once I’ve finished the work. For more permanent data, it would be better to store the data in a database designed to hold it, if you’re able to.
Limited documentation. In my own work, I am lucky if I am given a key or a data dictionary for the data set at all. When I am, it usually proves to be wrong somehow. This makes it even more important to do data checking, not just for looking for errors, but also as an exercise in understanding the problem domain before starting to model it.
Data that fits in local memory. This process could be done in batches or using a streaming algorithm, but this is a more advanced case.

It’s in this context that I would usually have to do the normalisation by hand, and I would like to have a tool that does it semi-automatically. This package is intended to be that tool, using a minimal amount of additional libraries.

Not for database design

Good database design is a hard and involved process, which requires a strong grasp on the domain that the database will hold information on. Please don’t try to use this package to do it on autopilot. In particular, the conditions described above mean we do things differently to good practice for normalised relational databases:

We need to handle constant attributes, that have the same value in every record. In a real database, we would probably expect these to not stay constant when more data is added, and would know which relation to place it in. In one-off data checking, there isn’t more data coming, so we need to handle these attributes in an automatic way by placing them in a “constants” relation with an empty key, which database designers would probably frown at.
We don’t care about database design principles concerned with avoiding errors under deletion, modification, and so on, because the data won’t change. This means that we don’t care about introducing artificial keys, unless we think it would make data modelling more convenient. Similarly, we don’t care about having single-attribute reference relations, which are used to make sure other relations only assign legal values to that relation’s attribute.
We have no “derived relations”, also known as “views”, which are additional relations created from the others for lookup convenience. We expect to be using the data frame for modelling, rather than the database, so such lookups aren’t commonly necessary, and are left to the discretion of the user to do manually with base R’s merge function.

To third normal form

autodb gets a given data frame to third normal form: every attribute depends on the whole key(s), and non-key attributes depend on nothing but the key(s). This was chosen because there is an existing algorithm, Bernstein’s synthesis, for normalising to third normal form, and because it’s the highest normal form that is attainable with arbitrary original data frames. There is an additional enhancement available as an option: see the section on avoidable attributes for more details.

Simple examples

For most of these simple examples, we use the ChickWeight dataset, included with base R.

summary(ChickWeight)
#>      weight           Time           Chick     Diet   
#>  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
#>  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
#>  Median :103.0   Median :10.00   20     : 12   3:120  
#>  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
#>  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
#>  Max.   :373.0   Max.   :21.00   19     : 12          
#>                                  (Other):506

Normalisation with `autodb`

The simplest way to do normalisation is with the autodb function:

db <- autodb(ChickWeight)
db
#> database with 2 relations
#> 4 attributes: weight, Time, Chick, Diet
#> relation Chick: Chick, Diet; 50 records
#>   key 1: Chick
#> relation Time_Chick: Time, Chick, weight; 578 records
#>   key 1: Time, Chick
#> references:
#> Time_Chick.{Chick} -> Chick.{Chick}

Plots

There is no plotting method within the autodb package itself. Instead, there are functions to write normalised objects as inputs to the Graphviz visualisation software, using the DOT language.

The generic function to do this is gv, which calls the gv.database method on our database, db:

db_text <- gv(db)
cat(db_text)
#> digraph {
#>   rankdir = "LR"
#>   node [shape=plaintext];
#> 
#>   "Chick" [label = <
#>     <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
#>     <TR><TD COLSPAN="3">Chick (50 records)</TD></TR>
#>     <TR><TD PORT="TO_chick">Chick</TD><TD BGCOLOR="black"></TD><TD PORT="FROM_chick">ordered</TD></TR>
#>     <TR><TD PORT="TO_diet">Diet</TD><TD></TD><TD PORT="FROM_diet">factor</TD></TR>
#>     </TABLE>>];
#>   "Time_Chick" [label = <
#>     <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
#>     <TR><TD COLSPAN="3">Time_Chick (578 records)</TD></TR>
#>     <TR><TD PORT="TO_time">Time</TD><TD BGCOLOR="black"></TD><TD PORT="FROM_time">numeric</TD></TR>
#>     <TR><TD PORT="TO_chick">Chick</TD><TD BGCOLOR="black"></TD><TD PORT="FROM_chick">ordered</TD></TR>
#>     <TR><TD PORT="TO_weight">weight</TD><TD></TD><TD PORT="FROM_weight">numeric</TD></TR>
#>     </TABLE>>];
#> 
#>   "Time_Chick":FROM_chick -> "Chick":TO_chick;
#> }

This can be saved to a file for passing to Graphviz elsewhere, or can be passed into an R function that renders Graphviz code. This vignette uses grViz from the DiagrammeR package. If you’re writing a document in Quarto, you can call GraphViz directly using the dot engine; this approach avoids loading 2MB of Javascript for HTML widgets into the rendered file.

if (requireNamespace("DiagrammeR", quietly = TRUE)) {
  show <- function(x) DiagrammeR::grViz(gv(x))
  maybe_plot <- function(x) DiagrammeR::grViz(gv(x))
}else{
  show <- print
  maybe_plot <- function(x) invisible(NULL)
}

maybe_plot(db)

Each relation is represented as a box, with the top row giving the relation name and the number of records, and the other rows detailing the attributes. In the middle is a grid of cells: each column of cells represents a (candidate) key for the relation, where attributes in that key have their cell filled in black.

We can see that ChickWeight was split into two relations, with a weight for each unique combination of Chick number and time, and the Chick number by itself determining the diet. We can also see the number of records in each relation, and the classes of the attributes in those relations.

gv also has methods for data.frame, and for database_schema, which is introduced later.

Finding functional dependencies

Having looked at the final result first, we now look at the individual steps. The first step is to find the (non-trivial and minimal) dependencies present in the given data frame. There are various ways to do this; by default, the package uses FDHitsSep, a depth-first search algorithm. We run this using the discover function, setting progress to TRUE to see the steps taken:

deps <- discover(ChickWeight, progress = TRUE)
#> formatting numerical/complex variables with 7 significant digits
#> simplifying data types
#> calculating single-attribute PLIs
#> sampling difference sets
#> 8 initial diffsets
#> 
#> dependant weight
#> dependant Time
#> dependant Chick
#> dependant Diet
#> 
#> FDHitsSep complete
#> 9 final diffsets
#> 9 nodes visited
#> 6 partitions cached
deps
#> 2 functional dependencies
#> 4 attributes: weight, Time, Chick, Diet
#> Time, Chick -> weight
#>       Chick -> Diet

After simplifying the data to something quicker to iterate over – specifically, by converting all attributes to have integer values – the algorithm takes each attribute in turn as a possible dependant, and finds sets of the other attributes that determine it.

The result is a list of functional dependencies, in the format determinant set -> dependant, with an attribute named attrs_order that gives all the attribute names in their original order. Each of these three parts has its own (generic) extraction function:

detset(deps)
#> [[1]]
#> [1] "Time"  "Chick"
#> 
#> [[2]]
#> [1] "Chick"
dependant(deps)
#> [1] "weight" "Diet"
attrs_order(deps)
#> [1] "weight" "Time"   "Chick"  "Diet"

The former two, especially, are useful for filtering, as we’ll see later.

Normalisation

Now that we have a list of discovered dependencies, we can construct a database schema, where the relation schemas are normalised to third normal form. This is done using a version of Bernstein’s synthesis.

schema <- synthesise(deps)
schema
#> 2 relation schemas
#> 4 attributes: weight, Time, Chick, Diet
#> schema Chick: Chick, Diet
#>   key 1: Chick
#> schema Time_Chick: Time, Chick, weight
#>   key 1: Time, Chick

Like the database before, we can also plot this database schema:

maybe_plot(schema)

This is similar to the database plot given before, but there is some information not present, that requires the data frame itself. We do have class information about the attributes, extracted from the data frame during dependency discovery, but we have no record counts for the individual relation schemas. However, we do have automatically-generated names for the individual relations.

Additionally, at this point we have no connections between the relation schemas, since Bernstein’s synthesis doesn’t supply information about foreign key references. We could use this database schema to build a database, but we’d rather add the foreign key references first.

Tuning detection and normalisation

Let’s look at a different dataset for a moment, to look at some cases where we don’t want to use the dependencies as given. We’ll use the Titanic data set, also provided with base R. This data is in array form, so we first convert it to data frame form:

knitr::kable(as.data.frame(Titanic))

Class	Sex	Age	Survived	Freq
1st	Male	Child	No	0
2nd	Male	Child	No	0
3rd	Male	Child	No	35
Crew	Male	Child	No	0
1st	Female	Child	No	0
2nd	Female	Child	No	0
3rd	Female	Child	No	17
Crew	Female	Child	No	0
1st	Male	Adult	No	118
2nd	Male	Adult	No	154
3rd	Male	Adult	No	387
Crew	Male	Adult	No	670
1st	Female	Adult	No	4
2nd	Female	Adult	No	13
3rd	Female	Adult	No	89
Crew	Female	Adult	No	3
1st	Male	Child	Yes	5
2nd	Male	Child	Yes	11
3rd	Male	Child	Yes	13
Crew	Male	Child	Yes	0
1st	Female	Child	Yes	1
2nd	Female	Child	Yes	13
3rd	Female	Child	Yes	14
Crew	Female	Child	Yes	0
1st	Male	Adult	Yes	57
2nd	Male	Adult	Yes	14
3rd	Male	Adult	Yes	75
Crew	Male	Adult	Yes	192
1st	Female	Adult	Yes	140
2nd	Female	Adult	Yes	80
3rd	Female	Adult	Yes	76
Crew	Female	Adult	Yes	20

This is a simple set of data, with a single count observation, Freq, for each combination of the four determining attributes. In other words, the relation is already normalised, so we only expect one relation in the normalised database.

If we use autodb again, we get the following database:

show(autodb(as.data.frame(Titanic)))

Oops! The search found some functional dependencies where the count could be used to determine another attribute. These are clearly spurious: frequency count can’t causally determine age, for example. However, the algorithm still finds these dependencies, because the counts are unique often enough to make these dependencies hold in the given data.

There are two approaches we can take to eliminate these spurious dependencies: not letting them be detected in the first place, and removing them before using synthesise.

To stop them being detected, we can put constraints on what is discovered by discover: we can ask for certain attributes to not be considered as determinants, or we can exclude attributes that inherit from certain classes. In this example, we could exclude Freq from being considered:

titanic_deps_freqonly <- discover(as.data.frame(Titanic), exclude = "Freq")
titanic_deps_freqonly
#> 1 functional dependency
#> 5 attributes: Class, Sex, Age, Survived, Freq
#> Class, Sex, Age, Survived -> Freq

Alternatively, we could exclude all attributes that inherit from “numeric”:

identical(titanic_deps_freqonly, discover(as.data.frame(Titanic), exclude_class = "numeric"))
#> [1] TRUE

These can both be used as arguments to autodb too:

show(autodb(as.data.frame(Titanic), exclude = "Freq"))

Generally, excluding numeric attributes as determinants is often useful, because we expect non-integer numbers to be a measurement, not part of a primary key.

Alternatively, we could remove unwanted dependencies before using decompose. Here are the found dependencies, if we don’t exclude anything:

titanic_deps <- discover(as.data.frame(Titanic))
titanic_deps
#> 3 functional dependencies
#> 5 attributes: Class, Sex, Age, Survived, Freq
#>     Class, Survived, Freq -> Age
#>       Sex, Survived, Freq -> Age
#> Class, Sex, Age, Survived -> Freq

We can remove the unwanted dependencies, where Age is the dependant:

titanic_deps[dependant(titanic_deps) == "Freq"]
#> 1 functional dependency
#> 5 attributes: Class, Sex, Age, Survived, Freq
#> Class, Sex, Age, Survived -> Freq

Adding foreign key references

Getting back to our ChickWeight example, we now have a database schema, consisting of a list of relation schemas. However, we have no information about how these relation schemas are linked to each other. In particular, we have no information about foreign keys. We can add this information using autoref:

linked_schema <- autoref(schema)
linked_schema
#> database schema with 2 relation schemas
#> 4 attributes: weight, Time, Chick, Diet
#> schema Chick: Chick, Diet
#>   key 1: Chick
#> schema Time_Chick: Time, Chick, weight
#>   key 1: Time, Chick
#> references:
#> Time_Chick.{Chick} -> Chick.{Chick}

We could also have used normalise, instead of synthesise and autoref separately:

normalise(deps)
#> database schema with 2 relation schemas
#> 4 attributes: weight, Time, Chick, Diet
#> schema Chick: Chick, Diet
#>   key 1: Chick
#> schema Time_Chick: Time, Chick, weight
#>   key 1: Time, Chick
#> references:
#> Time_Chick.{Chick} -> Chick.{Chick}

Plotting this updated database schema shows the same relation schemas as before, linked together by foreign key references:

show(linked_schema)

Decomposing the original relation

Finally, once we have our normalised database schema, we can apply it to our original data frame, or a new one with the same structure. This results in a normalised database, as we got from using autodb:

db2 <- decompose(ChickWeight, linked_schema)
show(db2)

We now have the record counts added.

Rejoining a database back into a data frame

We can reverse the process of turning a data frame into a database with the rejoin function. This may not be identical to ChickWeight, since the rows may have been rearranged. However, we can use the convenience function df_equiv to check for equivalence under row reordering:

rejoined <- rejoin(db)
summary(rejoined)
#>      weight           Time           Chick     Diet   
#>  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
#>  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
#>  Median :103.0   Median :10.00   20     : 12   3:120  
#>  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
#>  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
#>  Max.   :373.0   Max.   :21.00   19     : 12          
#>                                  (Other):506
identical(rejoined, ChickWeight)
#> [1] FALSE
df_equiv(rejoined, ChickWeight)
#> [1] TRUE

When rejoined, the relation attributes will be in the original order. However, the record order might have changed.

A larger example

Included in the package is a 447-by-25 data frame called nudge:

show(nudge)

This is the data set for a meta-analysis, looking at the effectiveness of “nudge” interventions. Measurements are taken within a three-layer hierarchy: publications contain studies, which contain effect size measurements. We expect this hierarchy to be apparent in the normalisation.

Initial decomposition

Getting full dependency information for this relation can take a long time, so here we search for a reduced set, not considering numeric or sample size attributes as determinants:

nudge_deps <- discover(
  nudge,
  exclude = c("n_study", "n_comparison", "n_control", "n_intervention"),
  exclude_class = "numeric"
)
nudge_schema <- normalise(nudge_deps, remove_avoidable = TRUE)
show(nudge_schema)

We can see a relation, with many attributes, determined by the effect size ID number, es_id. This contains all of the numeric measurements, as expected in the relation for the lowest level in the hierarchy. As also expected, this has a foreign key reference to a relation for study-level information, determined by the study ID number, study_id.

However, the publication-level relation this refers to is not determined by the publication ID, publication_id, as we would expect; neither is it determined by the publication reference. Instead, it is solely determined by the publication’s title: to use the ID, we would need to supplement it with other information, like the publication year, or other choices that come from the middle column of relations, which look likely to be spurious.

This suggests that some publication ID numbers have been erroneously assigned to several publications, which we can easily test:

nudge_database <- decompose(nudge, nudge_schema)
nudge_title_relation <- records(nudge_database)$title
nudge_pid_duplicates <- unique(nudge_title_relation$publication_id[
  duplicated(nudge_title_relation$publication_id)
])
knitr::kable(subset(nudge_title_relation, publication_id %in% nudge_pid_duplicates))

	title	publication_id	reference	year
44	Enhanced active choice: A new method to motivate behavior change	95	Keller et al. (2011)	2011
130	Nudging product choices: The effect of position change on snack bar choice	95	Keller et al. (2015)	2015

We’d also expect reference to determine the publication, but this is also not the case: it needs additional information, such as binary_outcome.

This means that there are publications that share a reference:

nudge_reference_duplicates <- unique(nudge_title_relation$reference[
  duplicated(nudge_title_relation$reference)
])
knitr::kable(subset(nudge_title_relation, reference %in% nudge_reference_duplicates))

	title	publication_id	reference	year
214	Nudge vs superbugs: A behavioural economics trial to reduce the overprescribing of antibiotics	18	BETA (2018)	2018
399	Energy labels that make cents	19	BETA (2018)	2018

BETA is the Behavioural Economics Team of the Australian Government, so it’s not surprising that they’d have multiple publications/reports per year. Duplicate references is not necessarily an error, but would be awkward if the references were to be used.

There are other, clearly spurious, relations involving reference. One of these combines it with type_experiment to determine location. Since reference is a publication-level attribute, and location is a study-level attribute, this can’t be right.

(As far as I’m aware, the publication ID and reference errors mentioned above would not have affected the meta-analysis results.)

Removing spurious structure

The publication ID and the reference clearly have issues, so we need to decide what to do about them. Here, I’ll show what is usually the correct approach, which is to remove the spurious functional dependencies. We’ll look at the other options later.

The better option is to remove all functional dependencies that we’d consider to be spurious: those where publication ID or reference are part of a multiple-attribute determinant set, and the determinant set isn’t just those two:

nudge_deps_filtered <- nudge_deps[
  lengths(detset(nudge_deps)) == 1 |
    vapply(
      detset(nudge_deps),
      \(ds) length(setdiff(ds, c("publication_id", "reference"))) != 1,
      logical(1)
    )
]
nudge_schema_filtered <- normalise(nudge_deps_filtered, remove_avoidable = TRUE)
show(nudge_schema_filtered)

We now, finally, have the study location in a non-spurious relation. If we want to, we could also add the dependencies we fail to find, due to excluding the sample size attributes from determinant sets:

nudge_deps_size <- discover(nudge[, startsWith(names(nudge), "n_")])
nudge_deps_size
#> 3 functional dependencies
#> 4 attributes: n_study, n_comparison, n_control, n_intervention
#>    n_control, n_intervention -> n_comparison
#> n_comparison, n_intervention -> n_control
#>      n_comparison, n_control -> n_intervention
nudge_deps_final <- c(nudge_deps_filtered, nudge_deps_size)
nudge_schema_final <- normalise(nudge_deps_final, remove_avoidable = TRUE)
nudge_database_final <- decompose(nudge, nudge_schema_final)

show(nudge_schema_final)

The new schema confirms what we might expect, that the total sample size, and the sizes of control and treatment arms, only have two degrees of freedom: knowing two of them determines the other.

We’d expect this to be because the total sample size is the sum of the other two. However, the functional dependency can’t determine this. Indeed, if we check for this more specific condition, we find that it doesn’t hold:

knitr::kable(unique(subset(
  nudge,
  n_comparison != n_control + n_intervention,
  c(
    es_id,
    reference,
    title,
    n_study,
    n_comparison,
    n_control,
    n_intervention
  )
)))

	es_id	reference	title	n_study	n_comparison	n_control	n_intervention
237	180	Hedlin & Sunstein (2016)	Does active choosing promote green energy use? Experimental evidence	1037	1037	345	346
302	181	Hedlin & Sunstein (2016)	Does active choosing promote green energy use? Experimental evidence	1037	1037	345	346

Looking through the paper, the arm sample sizes aren’t given explicitly, but the numbers here are consistent with the parts of the study that have two treatments.

Automatic search for functional dependencies cannot check domain-specific constraints like the summation constraint above, but it can give schemas that suggest constraints to check for, such as the two-degrees relation above.

Going back to the final database schema, removing dependencies has revealed extra information about study locations, via the new title_type_experiment relation: studies of the same experiment type in a publication always have the same location. Looking at the resulting database shows that this removes many entries of what would be redundant location information if kept in the study relation:

show(nudge_database_final)

While this is not a dependency we could expect to hold if more data was collected, it’s a reasonable dependency for the given data set, which won’t be added to.

How not to remove spurious structure

The additional relation mentioned at the end of the previous section is also notable for what we’ll now discuss: alternative ways to get rid of spurious structure. How can we do this?

Remove the offending functional dependencies, and re-do the schema creation (the proper method);
Remove offending attributes from the table – e.g. publication_id above, since it doesn’t do its intended job of unique publication identification – and re-do the entire process (re-running discover can be expensive; we don’t always want to throw out everything to do with an attribute, e.g. reference -> year, or title -> reference);
Add offending attributes to exclude, to disallow them in determinant sets, and re-do the entire process (same issues as above, except that we keep title -> reference, since an excluded reference can still be a dependant);
Remove offending schemas from the database schema / database (see below).

What happens if we remove the offending schemas?

nudge_schema_relfiltered <- nudge_schema[
  !grepl("publication_id_", names(nudge_schema), fixed = TRUE) &
    !grepl("_publication_id", names(nudge_schema), fixed = TRUE) &
    !grepl("reference_", names(nudge_schema), fixed = TRUE) &
    !grepl("_reference", names(nudge_schema), fixed = TRUE)
]

Subsetting a database schema also removes any foreign key references involving removed schemas, so the resulting schema is still valid. However, any foreign key reference chains with removed schemas in the middle are broken. Amongst others, in this case we lose the reference between the three relations for the three hierarchical data levels:

show(nudge_schema_relfiltered)

Re-running autoref has no effect, because the es_id section doesn’t contain any key for the title section:

identical(autoref(nudge_schema_relfiltered), nudge_schema_relfiltered)
#> [1] TRUE

Because the database subschema including the es_id relation only has publication_id for distinguishing publications, effect sizes for the two publications with the same ID cannot be uniquely associated with their publication: removing schemas irreversibly separates two parts of the database schema.

Furthermore, compared to nudge_schema_filtered, there is a relation schema missing: directly removing schemas loses the structure information that results in the title_type_experiment relation. How does this happen?

The relation isn’t present in the original schema, nudge_schema_filtered, because the full set of functional dependencies make its implied dependency transitive. Specifically, nudge_fds contains the following functional dependencies:

example_fds <- functional_dependency(
  list(
    list("title", "reference"),
    list(c("reference", "type_experiment"), "location"),
    list(c("title", "type_experiment"), "location")
  ),
  c("title", "reference", "type_experiment", "location")
)
example_fds
#> 3 functional dependencies
#> 4 attributes: title, reference, type_experiment, location
#>                      title -> reference
#> reference, type_experiment -> location
#>     title, type_experiment -> location

The first two dependencies make the third transitive, so it’s ignored when constructing the relation schemas: it’s implied, and enforced, by the title and reference_type_experiment relation schemas.

show(normalise(example_fds, ensure_lossless = FALSE))

However, when reference, type_experiment -> location is removed as spurious, title, type_experiment -> location is no longer transitive, so it now appears as its own relation schema.

normalise(example_fds[-2], ensure_lossless = FALSE)

show(normalise(example_fds[-2], ensure_lossless = FALSE))

When we instead remove the reference_type_experiment schema, this implicitly also throws away the title, type_experiment -> location dependency that it co-implies. We throw away non-spurious information that we don’t intend to.

Between the risk of irreversibly breaking the database schema’s structure, and the risk of unintentionally removing additional structure information, I hope it’s clear why directly removing relations should be avoided.

Other features

Approximate dependencies and database reduction

Larger datasets can often have entry errors, without an easy way to remove or deal with them. For this reason, we might be interested in “approximate” functional dependencies, which hold after removing a bounded amount of violating records.

Suppose that we normalise nudge again, without any manual dependency removal, but allowing approximate dependencies. We could cheat, knowing that the questionable data example found above showed there to be two questionable records: one for a duplicated publication ID, and one for a duplicated reference. Since nudge has 447 records, we can get rid of the resulting questionable dependencies by setting the accuracy to allow two discrepant records.

The accuracy argument for discover expects a number between zero and one, determining what proportion of a data frame’s records need to satisfy a given dependency for the algorithm to consider it valid. By default, it’s equal to one, so only exact dependencies are returned. If we set it to less than one, we need to use the slower DFD search algorithm, since approximate dependency search is not implemented for the default FDHitsSep algorithm.

show(normalise(discover(
  nudge,
  accuracy = 1 - 2/nrow(nudge),
  method = "DFD",
  exclude = c("n_study", "n_comparison", "n_control", "n_intervention"),
  exclude_class = "numeric"
)))

Currently decompose doesn’t account for approximate dependencies, resulting in invalid databases, so here we just work with the database schema.

Compare this to the database schema we arrived at manually:

The publication ID, the title, and the reference are considered equivalent, as we’d have originally expected.
There is an additional relation, showing approximation to be approximately determined by some publication- and study-level attributes, rather than functionally determined at the effect-size level.

Lowering the accuracy results in more dependencies being found, and so in more relations. The number of relations can get very large. For example, suppose we instead set the accuracy to 0.99, returning any approximate dependencies that hold in at least 443 records.

nudge_approx_database_schema <- discover(
  nudge,
  accuracy = 0.99,
  method = "DFD",
  exclude = c("n_study", "n_comparison", "n_control", "n_intervention"),
  exclude_class = "numeric"
) |>
  normalise()
show(nudge_approx_database_schema)

This is a little overwhelming, so we’ll use a utility function called reduce. This returns only the named “main” relation, and any relations for which it is a descendant via foreign key references. Reducing the approximate schema, with the effect size table as the main table, gives us this set of relations:

show(reduce(nudge_approx_database_schema, "es_id"))

Questionable intermediate relations aside, we can see that there is now a publication-level relation with the publication ID as a key, since it determines the publication attributes once we remove one of the duplicate records we discovered before.

reduce can also be used on databases, where it considers each relation with the largest number of records as a main table. It needs to be used with caution: while the relations it removes are often spurious, it’s possible to find databases where relations not linked to the main relation by foreign key references are required to rejoin to the original data frame, so reduction can remove necessary relations. Its intent is mostly to make glancing at database plots more manageable.

Avoidable attributes

The next normal form after third normal form (3NF) is Boyes-Codd normal form (BCNF). Ensuring BCNF is enforced by the database is trickier, as in some cases it can’t be enforced with just relations and foreign key constraints.

However, the package includes an option to convert to enhanced third normal form, also known as LTK form, which can be so enforced. This enhancement is tangential to BCNF, and could also be used to enhance schemas in BCNF.

In brief, the standard normal forms only put constraints on the attributes present in the relations one relation at a time. The enhancement is a constraint on the attributes present in a relation, while considering their presence in other relations. If a attribute in a relation can be removed, and still be determined from that relation by joining it to others, then the attribute is “avoidable”, and can be removed. If the attribute is in any of the relation’s keys, they’ll be replaced by keys that use the attributes not being removed. This removes attributes from relations without removing any information from the database as a whole.

For example, we can take this simple example from Chapter 6 of The Theory of Relational Databases, by David Maier:

avoid_deps <- functional_dependency(
  list(
    list("A", "B"),
    list("B", "A"),
    list(c("A", "C"), "D"),
    list(c("A", "C"), "E"),
    list(c("B", "D"), "C")
  ),
  attrs_order = c("A", "B", "C", "D", "E")
)
avoid_deps
#> 5 functional dependencies
#> 5 attributes: A, B, C, D, E
#>    A -> B
#>    B -> A
#> A, C -> D
#> A, C -> E
#> B, D -> C
avoid_schema <- normalise(avoid_deps)
show(avoid_schema)

Attributes A and B are equivalent, since relation A has them both as a key. In other words, relation A is a simple lookup relation. Because of this, we could remove B from relation A_C, and replace the key B, D with A, D, which is equivalent when accounting for relation A.

We can have this removal of avoidable attributes done automatically, using the remove_avoidable flag for normalise:

avoid_schema_removed <- normalise(
  avoid_deps,
  remove_avoidable = TRUE
)
show(avoid_schema_removed)

This schema is now in LTK form, with no remaining avoidable attributes. We could have also removed A from relation A_C instead of B, so this process may not have a unique result. The package’s implementation prefers to remove attributes that appear later in the original relation.

Planned improvements

Views for connected reference chains

When autodb, or autoref, generates foreign key references for relations decomposed from a single table, it naturally tends to a structure where there is a single low-level data relation on the left, and higher-level relations are “downstream” with respect to references, with the number of records strictly decreasing as we follow references to higher-level data relations.

The nice thing about this is that, while the schema is created constructively, we can view it as a series of decompositions: start with the original table, split it up into two or more sub-tables, and repeat. Rejoining the database would then do this in reverse.

Unfortunately, not every set of functional dependencies allows this. Consider the following:

Bernstein synthesis, and automatic foreign keys, give the following schema:

The schema is not connected, because d_e’s key is not present in any other relation. In terms of a series of decompositions, we could think of the original table being split into {a,b,c,d,e} and {d,e,f} first, with a reference between them, and d and e being split off later, destroying the reference.

In short, while Boyes-Codd Normal Form may be the first normal form that can’t always be satisfied with just relations, if we also want to keep all the relations related to each other with foreign key references for consistency, then third normal form also can’t always be satisfied.

One way to solve this, which is a planned addition, is to allow virtual relations, also known as views. These are relations whose data is not stored, because it’s generated from other relations that do store their data (real relations).

Usually, views are used to allow easy “look-up” of commonly-queried aggregates. Here, we could use them to maintain the connected schema structure. Here is a mock-up of what the resulting schema might look like:

if (requireNamespace("DiagrammeR", quietly = TRUE)) DiagrammeR::grViz("digraph {
  rankdir = \"LR\"
  node [shape=plaintext];

  \"a\" [label = <
    <TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
    <TR><TD PORT=\"TITLE\" COLSPAN=\"2\">a</TD></TR>
    <TR><TD PORT=\"TO_a\">a</TD><TD PORT=\"FROM_a\" BGCOLOR=\"black\"></TD></TR>
    <TR><TD PORT=\"TO_b\">b</TD><TD PORT=\"FROM_b\"></TD></TR>
    <TR><TD PORT=\"TO_c\">c</TD><TD PORT=\"FROM_c\"></TD></TR>
    </TABLE>>];
  \"b\" [label = <
    <TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
    <TR><TD PORT=\"TITLE\" COLSPAN=\"2\">b</TD></TR>
    <TR><TD PORT=\"TO_b\">b</TD><TD PORT=\"FROM_b\" BGCOLOR=\"black\"></TD></TR>
    <TR><TD PORT=\"TO_d\">d</TD><TD PORT=\"FROM_d\"></TD></TR>
    </TABLE>>];
  \"c\" [label = <
    <TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
    <TR><TD PORT=\"TITLE\" COLSPAN=\"2\">c</TD></TR>
    <TR><TD PORT=\"TO_c\">c</TD><TD PORT=\"FROM_c\" BGCOLOR=\"black\"></TD></TR>
    <TR><TD PORT=\"TO_e\">e</TD><TD PORT=\"FROM_e\"></TD></TR>
    </TABLE>>];
  \"d_e\" [label = <
    <TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
    <TR><TD PORT=\"TITLE\" COLSPAN=\"2\">d_e</TD></TR>
    <TR><TD PORT=\"TO_d\">d</TD><TD PORT=\"FROM_d\" BGCOLOR=\"black\"></TD></TR>
    <TR><TD PORT=\"TO_e\">e</TD><TD PORT=\"FROM_e\" BGCOLOR=\"black\"></TD></TR>
    <TR><TD PORT=\"TO_f\">f</TD><TD PORT=\"FROM_f\"></TD></TR>
    </TABLE>>];
  \"view\" [label = <
    <TABLE BGCOLOR=\"lightgrey\" BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
    <TR><TD PORT=\"TITLE\" COLSPAN=\"1\">view</TD></TR>
    <TR><TD>a</TD></TR>
    <TR><TD>b</TD></TR>
    <TR><TD>c</TD></TR>
    <TR><TD>d</TD></TR>
    <TR><TD>e</TD></TR>
    <TR><TD>f</TD></TR>
    </TABLE>>];

  \"a\":FROM_b -> \"b\":TO_b;
  \"a\":FROM_c -> \"c\":TO_c;

  \"view\":TITLE -> \"a\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
  \"view\":TITLE -> \"d_e\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
}")

Note that virtual relations have different rules to real relations:

Virtual relations need not have a key. In this example, view does have a as a key, but this key can be inferred from the real relations it’s a merging / joining of, so does not need to be stated explicitly.
Virtual relations do not contain their own data: it’s pulled in from the relations it merges together. This allows it to contain a set of attributes without being an instantiation of any functional dependencies between those attributes. For example, view contains {a, b}, but does not introduce a redundant instance of {a} -> b in addition to the one in relation a.
Instead of foreign key references to keys in other relations, virtual relations just “refer” to the real relations they are a merging of.

This addition of virtual relations allows databases to always be connected as a tree. This is useful for visualisation, and for planning rejoins, since those can always be done by merging tables along foreign key references.

We could explicitly show all of the decomposition steps as virtual relations, but it’s not clear that it’s worth the bother:

if (requireNamespace("DiagrammeR", quietly = TRUE)) DiagrammeR::grViz("digraph {
  rankdir = \"LR\"
  node [shape=plaintext];

  \"a\" [label = <
    <TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
    <TR><TD PORT=\"TITLE\" COLSPAN=\"2\">a</TD></TR>
    <TR><TD PORT=\"TO_a\">a</TD><TD PORT=\"FROM_a\" BGCOLOR=\"black\"></TD></TR>
    <TR><TD PORT=\"TO_b\">b</TD><TD PORT=\"FROM_b\"></TD></TR>
    <TR><TD PORT=\"TO_c\">c</TD><TD PORT=\"FROM_c\"></TD></TR>
    </TABLE>>];
  \"b\" [label = <
    <TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
    <TR><TD PORT=\"TITLE\" COLSPAN=\"2\">b</TD></TR>
    <TR><TD PORT=\"TO_b\">b</TD><TD PORT=\"FROM_b\" BGCOLOR=\"black\"></TD></TR>
    <TR><TD PORT=\"TO_d\">d</TD><TD PORT=\"FROM_d\"></TD></TR>
    </TABLE>>];
  \"c\" [label = <
    <TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
    <TR><TD PORT=\"TITLE\" COLSPAN=\"2\">c</TD></TR>
    <TR><TD PORT=\"TO_c\">c</TD><TD PORT=\"FROM_c\" BGCOLOR=\"black\"></TD></TR>
    <TR><TD PORT=\"TO_e\">e</TD><TD PORT=\"FROM_e\"></TD></TR>
    </TABLE>>];
  \"d_e\" [label = <
    <TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
    <TR><TD PORT=\"TITLE\" COLSPAN=\"2\">d_e</TD></TR>
    <TR><TD PORT=\"TO_d\">d</TD><TD PORT=\"FROM_d\" BGCOLOR=\"black\"></TD></TR>
    <TR><TD PORT=\"TO_e\">e</TD><TD PORT=\"FROM_e\" BGCOLOR=\"black\"></TD></TR>
    <TR><TD PORT=\"TO_f\">f</TD><TD PORT=\"FROM_f\"></TD></TR>
    </TABLE>>];
  \"view\" [label = <
    <TABLE BGCOLOR=\"lightgrey\" BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
    <TR><TD PORT=\"TITLE\" COLSPAN=\"1\">view</TD></TR>
    <TR><TD>a</TD></TR>
    <TR><TD>b</TD></TR>
    <TR><TD>c</TD></TR>
    <TR><TD>d</TD></TR>
    <TR><TD>e</TD></TR>
    <TR><TD>f</TD></TR>
    </TABLE>>];
  \"view2\" [label = <
    <TABLE BGCOLOR=\"lightgrey\" BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
    <TR><TD PORT=\"TITLE\" COLSPAN=\"1\">view</TD></TR>
    <TR><TD>a</TD></TR>
    <TR><TD>b</TD></TR>
    <TR><TD>c</TD></TR>
    <TR><TD>d</TD></TR>
    <TR><TD>e</TD></TR>
    </TABLE>>];
  \"view3\" [label = <
    <TABLE BGCOLOR=\"lightgrey\" BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
    <TR><TD PORT=\"TITLE\" COLSPAN=\"1\">view</TD></TR>
    <TR><TD>a</TD></TR>
    <TR><TD>b</TD></TR>
    <TR><TD>c</TD></TR>
    <TR><TD>d</TD></TR>
    </TABLE>>];

  \"a\":FROM_b -> \"b\":TO_b;
  \"a\":FROM_c -> \"c\":TO_c;

  \"view\":TITLE -> \"view2\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
  \"view\":TITLE -> \"d_e\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
  \"view2\":TITLE -> \"view3\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
  \"view2\":TITLE -> \"c\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
  \"view3\":TITLE -> \"a\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
  \"view3\":TITLE -> \"b\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
}")

Boyes-Codd Normal Form

Another use of the virtual relations mentioned above is that they make BCNF always achievable. For example, take the following standard example of a set of functional dependencies that cannot be expressed in BCNF using just relations:

The above schema is in third normal form. However, it is not in the higher Boyes-Codd normal form (BCNF), because c -> a is present in relation a_b, but not in terms of a key. This results in some redundancy, but also means that the two representations of c -> a can hold sets of data that aren’t mutually coherent :

db <- schema |>
  create() |>
  insert(data.frame(c = 1:2, a = 1:2), relations = "c") |>
  insert(data.frame(a = 1:2, b = 1L, c = 2:1), relations = "a_b")
knitr::kable(records(db)$a_b)

a	b	c
1	1	2
2	1	1

knitr::kable(records(db)$c)

c	a
1	1
2	2

The standard way to fix this schema to satisfy BCNF is by splitting relation a_b on the violated functional dependency, namely c -> a:

The resulting schema is now in BCNF, but we have a different problem: the dependency {a, b} -> c has disappeared! Fixing this can’t be done with just real relations, but we can solve it by adding a virtual relation. Here is a mock-up of what the resulting schema might look like:

if (requireNamespace("DiagrammeR", quietly = TRUE)) DiagrammeR::grViz("digraph {
  rankdir = \"LR\"
  node [shape=plaintext];

  \"c\" [label = <
    <TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
    <TR><TD PORT=\"TITLE\" COLSPAN=\"2\">c</TD></TR>
    <TR><TD PORT=\"TO_c\">c</TD><TD PORT=\"FROM_c\" BGCOLOR=\"black\"></TD></TR>
    <TR><TD PORT=\"TO_a\">a</TD><TD PORT=\"FROM_a\"></TD></TR>
    </TABLE>>];
  \"b_c\" [label = <
    <TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
    <TR><TD PORT=\"TITLE\" COLSPAN=\"2\">b_c</TD></TR>
    <TR><TD PORT=\"TO_b\">b</TD><TD PORT=\"FROM_b\" BGCOLOR=\"black\"></TD></TR>
    <TR><TD PORT=\"TO_c\">c</TD><TD PORT=\"FROM_c\" BGCOLOR=\"black\"></TD></TR>
    </TABLE>>];
  \"view\" [label = <
    <TABLE BGCOLOR=\"lightgrey\" BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
    <TR><TD PORT=\"TITLE\" COLSPAN=\"2\">view</TD></TR>
    <TR><TD>a</TD><TD BGCOLOR=\"black\"></TD></TR>
    <TR><TD>b</TD><TD BGCOLOR=\"black\"></TD></TR>
    <TR><TD>c</TD><TD></TD></TR>
    </TABLE>>];

  \"b_c\":FROM_c -> \"c\":TO_c;

  \"view\":TITLE -> \"b_c\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
  \"view\":TITLE -> \"c\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
}")

Some features to note:

Unlike before, the key for view is required, since it can’t be inferred from the real relations. This is how we preserve {a, b} -> c when decomposing to BCNF.
view references both of the real relations, to show that it’s the result of combining both of them. This means that view “references” relation c both directly, and via b_c. For real relations, the direct reference would be removed as being transitive, but in this case both give necessary information. If virtual relations get implemented, virtual-to-real “references” would be displayed differently, to make this clear.
view contains both a and c, as a_b did before. Unlike for a_b, this is not an extra appearance of c -> a, so is not a violation of BCNF. This is because, since view is virtual, its a and c values are exactly the same as those in relation c, so there is no incoherency possible.

Proper handling of nullable data

This addresses how to decompose data with NA values. See the “Handling missing values” section for details.

Handling of duplicate records

Relational theory is based on data being given as relations, that cannot have duplicate records. However, it’s expected that an R user might pass in a data frame with duplicates: for example, R comes with some datasets, such as iris, that have duplicates if we don’t include the row names.

At the moment, these duplicates are kept when searching for dependencies. This can affect results for approximate dependencies, since it changes how many records must be removed for a given dependency to be satisfied. However, the duplicates are still removed when the data frame is decomposed into a database. At the moment, I’m not certain on whether these duplicates are best handled by removing them before beginning the search, or by simply returning an error if there are duplicate records.

Handling missing values

Missing values

Strictly speaking, autodb does not search for functional dependencies, because these can’t account for missing values. Instead, it searches for a weaker variant, which has been called a literal functional dependency (LFD) in the literature.

The functional dependency X -> Y holds if, for any two possible records whose values in X are equal, their values in Y are also equal. This requires attribute values to take the equality operator, == in R. Additionally, they must take it under binary logic: whether two values are equal can only be true or false, not missing. This disallows the presence of missing values (NA and NaN), since testing for equality with a missing value returns a missing result.

The literal functional dependency X -> Y holds, or the functional dependency X -> Y holds literally, if, for any two possible records whose values in X are identical, their values in Y are also identical, where values are identical if both are non-missing and equal, or both are missing. In other words, missing values are treated as not equal to non-missing values, but as equal to each other. This requires attribute classes to take the identity operator, identical() with default optional parameter values in R. Again, they must take it under binary logic, but the identity operator cannot return non-binary results anyway: In R, for example, identical() can only return TRUE or FALSE, never NA. There is, therefore, no constraint on missing values in attribute classes, or on attribute classes in general.

LFDs are more generic than standard functional dependencies: since practically every class takes the identity operator, they practically make no assumptions about the attribute classes present in a data set.

There are other FD variants that handle missing values, but, unlike most of them, LFDs still satisfy Armstrong’s axioms, allowing them to be used in normalisation in the same way as normal FDs. For example, they still respect transitivity: if X -> Y and Y -> Z literally, then X -> Z literally. This allows us to construct a database schema with Bernstein synthesis as usual, with LFDs instead of FDs, and still get something coherent.

However, ignoring the special status of missing values in this way ignores important structural information. In the relational model, and ideally in relational databases, we want to have no missing values at all, partly to avoid awkward questions about what trinary logic means with respect to filtering or joining tables. For example, take the following data frame:

df_nas <- data.frame(
  patient = c(1L, 2L, 3L, 4L),
  trial_entry_date = as.Date(c("2022/05/02", "2022/06/06", "2022/04/01", "2022/03/19")),
  trial_exit_date = as.Date(c(NA, NA, "2022/10/07", NA))
)
knitr::kable(df_nas)

patient	trial_entry_date	trial_exit_date
1	2022-05-02	NA
2	2022-06-06	NA
3	2022-04-01	2022-10-07
4	2022-03-19	NA

autodb currently treats NA as just another value, which results in the initial data frame not being split in the database schema:

show(autodb(df_nas))

In this case, a missing exit date represents no exit date, i.e. that the patient is still in the trail, and we would prefer to move exit information to a separate relation, containing only patients with a exit date:

ideal_db <- decompose(
  df_nas,
  database_schema(
    relation_schema(
      list(
        patient = list(c("patient", "trial_entry_date"), list("patient")),
        patient_exit = list(c("patient", "trial_exit_date"), list("patient"))
      ),
      names(df_nas)
    ),
    list(list("patient_exit", "patient", "patient", "patient"))
  )
)
records(ideal_db)$patient_exit <- subset(records(ideal_db)$patient_exit, !is.na(trial_exit_date))
show(ideal_db)

This makes more of the data’s structure apparent, and enforceable, in the database structure, since it makes it clear that the death date is optional. However, removing missing values in this way isn’t handled by autodb itself, so it has to be done by manually creating the schema and inserting/removing records, as shown above.

Structure conditional on value presence

Sometimes, the structure for some attributes depends on whether other attributes are missing. Here are two possible reasons:

Some piece of information in a record takes one of several possible formats, which may have different data classes. This leaves the single piece of information spread across multiple sets of attributes, not necessarily disjoint, of which only one set is non-missing per record. In more general programming, this would be referred to as a union type. In relational theory, this is called a disjunctive existence union.
A set of attributes are missing or non-missing together. In more general programming, these attributes together represent an option type, or a nullable class.
Sets of attributes are mutually exclusive. In more general programming, these attributes together represent a discriminated union, or sum type.

For example, the data below records values, and allows them to be single values, or an interval with an associated distribution, which may have some distribution parameters. This format is common when listing model parameters.

df_options <- data.frame(
  id = 1:20,
  value = c(2.3, 2.3, 5.7, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
  lower_bound = c(NA_real_, NA_real_, NA_real_, 2.4, 0, 1, 0, 5.6, 2.4, 5.3, 5.3, 2.4, 2.4, 2.4, 2.4, 2.4, 2.4, 2.4, 5.6, 2.4),
  upper_bound = c(NA_real_, NA_real_, NA_real_, 7.1, 10, 10, 13.1, 25.8, 10, 13.1, 10, 25.8, 25.8, 25.8, 25.8,13.1, 13.1, 25.8, 25.8, 25.8),
  interval_distribution = factor(c(NA, NA, NA, "uniform", "uniform", "uniform", "uniform", "uniform", "Beta", "Beta", "Beta", "Beta", "Kumaraswamy", "Kumaraswamy", "Kumaraswamy", "Kumaraswamy", "PERT", "PERT", "PERT", "PERT")),
  param1 = c(NA, NA, NA, NA, NA, NA, NA, NA, 1, 1, 1, 2, 2, 2.1, 2, 2, 2, 1, 2, 2),
  param2 = c(NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 2, 2, 2, 1, 1, 1, NA, NA, NA, NA)
)
knitr::kable(df_options)

id	value	lower_bound	upper_bound	interval_distribution	param1	param2
1	2.3	NA	NA	NA	NA	NA
2	2.3	NA	NA	NA	NA	NA
3	5.7	NA	NA	NA	NA	NA
4	NA	2.4	7.1	uniform	NA	NA
5	NA	0.0	10.0	uniform	NA	NA
6	NA	1.0	10.0	uniform	NA	NA
7	NA	0.0	13.1	uniform	NA	NA
8	NA	5.6	25.8	uniform	NA	NA
9	NA	2.4	10.0	Beta	1.0	1
10	NA	5.3	13.1	Beta	1.0	2
11	NA	5.3	10.0	Beta	1.0	2
12	NA	2.4	25.8	Beta	2.0	2
13	NA	2.4	25.8	Kumaraswamy	2.0	2
14	NA	2.4	25.8	Kumaraswamy	2.1	1
15	NA	2.4	25.8	Kumaraswamy	2.0	1
16	NA	2.4	13.1	Kumaraswamy	2.0	1
17	NA	2.4	13.1	PERT	2.0	NA
18	NA	2.4	25.8	PERT	1.0	NA
19	NA	5.6	25.8	PERT	2.0	NA
20	NA	2.4	25.8	PERT	2.0	NA

autodb has no way to know about these relationships, because it doesn’t treat missing values as a special case, as described above. The resulting schema therefore ignores them:

db_options <- autodb(df_options)
show(db_options)

However, from looking at the data ourselves, we can see a clearly-implied structure:

A record either has a perfectly-known value, or a distribution of what the value could be. These two options have separate attributes.
Distribution information contains a lower and upper bound, and a distribution type. Different distributions have different numbers of distribution parameters, which are given in shared parameter attributes.
Parameter attributes are ordered: their value can only be non-missing if all of the previous parameter values are non-missing.

As a general hack for finding these sorts of relationships, I like taking each attribute with missing values, and adding a companion attribute that states whether the value is present or missing:

df_options_presence <- df_options[vapply(df_options, anyNA, logical(1))]
df_options_presence[] <- lapply(df_options_presence, Negate(is.na))
names(df_options_presence) <- paste0(names(df_options_presence), "_present")
df_options_with_presence <- cbind(df_options, df_options_presence)

This is not always practical, because adding columns rapidly increases the running time for the dependency search. However, when it’s practical, it helps to make the structure more apparent in the database schema:

db_options_with_presence <- autodb(df_options_with_presence)
show(db_options_with_presence)

Some of the new relations just show that an attribute determines its associated presence attribute, which is always true. Of more interest is the value_present relation, which shows that the values, bounds, and interval distributions inform the presence of each other:

knitr::kable(records(db_options_with_presence)$value_present)

	value_present	lower_bound_present	upper_bound_present	interval_distribution_present
1	TRUE	FALSE	FALSE	FALSE
4	FALSE	TRUE	TRUE	TRUE

In the interval_distribution relation, we can also see how the interval distribution determines how many parameters are required:

knitr::kable(records(db_options_with_presence)$interval_distribution)

	interval_distribution	value_present	param1_present	param2_present
1	NA	TRUE	FALSE	FALSE
4	uniform	FALSE	FALSE	FALSE
9	Beta	FALSE	TRUE	TRUE
13	Kumaraswamy	FALSE	TRUE	TRUE
17	PERT	FALSE	TRUE	FALSE

This shows addition structure, specifically an existence constraint: if param2 is present, then param1 is also present. If we notice a situation like this, we can attempt further investigation by splitting the data, and passing each part into autodb separately. Splitting on whether param1 is present, we get the following database for records where it is absent:

db_options_with_presence_p1absent <- autodb(subset(
  df_options_with_presence,
  !param1_present
))
show(db_options_with_presence_p1absent)

This only contains records for point estimates, or intervals with a uniform distribution, so the distribution value is now equivalent to its presence/absence, or that of the value and bounds. Due to the existence constraint mentioned above, param2 is constant, since it’s always absent:

knitr::kable(records(db_options_with_presence_p1absent)$constants)

param1	param2	param1_present	param2_present
NA	NA	FALSE	FALSE

We get the following database for records where param1 is present:

show(autodb(subset(
  df_options_with_presence,
  param1_present
)))

In this case, whether param2 is present is determined by the distribution used, as we’d expect.

Since this data splitting is currently a manual process, there are practical limits to how far we can take this. In the future, I’d like to have automating it as an option, but this could greatly increase the required run time.

Limitations

Automatic data normalisation, such as done by autodb, is useful, but it isn’t magic. This vignette covers some common issues that this approach can’t address.

Value-based constraints

Since the process is agnostic to what kind of data is stored in each attribute, it can’t find any constraints that depend on the actual data values. A simple case is that, for example, it won’t state whether an attribute’s values are within a given range, since that would require making use of the knowledge that the attribute has a number class, such as integer or numeric.

Semantic types

Another example of the process being agnostic to data values is that it can’t make any inference based on attribute classes. A common example of this is when we have entities of a given type referring to each other.

For example, suppose we have a dataset of publication citations, where each row contains the citer and citee IDs, plus supplementary information about both publications. The resulting database given by autodb has the following schema, after giving the relations appropriate names:

show(database_schema(
  relation_schema(
    list(
      citation = list(c("citer_id", "citee_id"), list(c("citer_id", "citee_id"))),
      citer = list(c("citer_id", "citer_title", "citer_author", "citer_year"), list("citer_id")),
      citee = list(c("citee_id", "citee_title", "citee_author", "citee_year"), list("citee_id"))
    ),
    c(
      "citer_id", "citer_title", "citer_author", "citer_year",
      "citee_id", "citee_title", "citee_author", "citee_year"
    )
  ),
  list(
    list("citation", "citer_id", "citer", "citer_id"),
    list("citation", "citee_id", "citee", "citee_id")
  )
))

Of course, citers and citees are both publications of the same type, so they shouldn’t have separate relations:

show(database_schema(
  relation_schema(
    list(
      citation = list(c("citer", "citee"), list(c("citer", "citee"))),
      publication = list(c("id", "title", "author", "year"), list("id"))
    ),
    c("citer", "citee", "id", "title", "author", "year")
  ),
  list(
    list("citation", "citer", "publication", "id"),
    list("citation", "citee", "publication", "id")
  )
))

We make this improvement because citer_id and citee_id values are, semantically, the same class of object, and the same goes for authors, titles, etc. Semantic types are not something that can be inferred by just looking at the data classes.

If we don’t account for this semantic identity, the separate citer and citee information relations in the original schema can both hold information for the same publication. If their information is the same, then there is unwanted redundancy. If their information is different, there is a data incoherency not prevented by the schema. Neither situation is desirable.

Currently, the only way to account for this semantic identity to create or modify the schema manually, like above.

Table merges don’t fix issues with `merge.data.frame`

Rejoining databases, and checking relations satisfy foreign key constraints, is done using merge.data.frame. This means that data classes that don’t work properly in merge aren’t guaranteed to work properly in autodb.

Any such issues come from how certain data classes are handled during merges in the base language, so they are issues with R, rather than with autodb, and I have no plans to fix them. If autodb seems to have odd failures, check that used data classes behave correctly in merges.

For example, in older versions of R, the built-in POSIXct date/time class didn’t have values merged properly, because the merge didn’t account for differences in time zone / daylight saving time. This would result in, for example, the nycflights13::weather data set violating the foreign key constraints of its own discovered schema, since one foreign key used a POSIXct attribute.

A more complex example, that still applies and probably always will, is a merge where two attributes being merged on have different classes. In general, this is allowed: since autodb is written for R, a dynamically-typed language, it follows SQLite in not constraining the user much when it comes to data classes in schemas. For primitive classes, R’s class coercion usually makes things work as you’d expect.

However, merging a factor attribute with a non-factor, non-character attribute should be handled with caution if the latter contains values not in the factor’s level set.

For example, we can define these data frames:

df_badmerge_int <- cbind(
  expand.grid(
    a = c(NA, 0L, 1L),
    b = c(NA, FALSE, TRUE)
  ),
  row = 1:9
)
df_badmerge_factor <- df_badmerge_int
df_badmerge_factor$a <- as.factor(df_badmerge_factor$a)
knitr::kable(df_badmerge_int)

a	b	row
NA	NA	1
0	NA	2
1	NA	3
NA	FALSE	4
0	FALSE	5
1	FALSE	6
NA	TRUE	7
0	TRUE	8
1	TRUE	9

df_badmerge_logical <- df_badmerge_int
df_badmerge_logical$a <- as.logical(df_badmerge_logical$a)
names(df_badmerge_logical)[[3]] <- "row2"
knitr::kable(df_badmerge_logical)

a	b	row2
NA	NA	1
FALSE	NA	2
TRUE	NA	3
NA	FALSE	4
FALSE	FALSE	5
TRUE	FALSE	6
NA	TRUE	7
FALSE	TRUE	8
TRUE	TRUE	9

We can then merge the data frame with logical a with the other two, keeping the row attributes to track which records were merged.

Whichever other data frame we merge with, the two sets of a values have different classes, so R does coercion. When merging with just a, this gives the result we’d expect, for both other data frames and regardless of merge order. For the integer version, the logical values are coerced to integers:

knitr::kable(merge(
  df_badmerge_int[, c("a", "row")],
  df_badmerge_logical[, c("a", "row2")]
))

a	row	row2
0	2	2
0	2	5
0	2	8
0	5	2
0	5	5
0	5	8
0	8	2
0	8	5
0	8	8
1	6	6
1	6	3
1	6	9
1	3	6
1	3	3
1	3	9
1	9	6
1	9	3
1	9	9
NA	1	1
NA	1	7
NA	1	4
NA	7	1
NA	7	7
NA	7	4
NA	4	1
NA	4	7
NA	4	4

knitr::kable(merge(
  df_badmerge_logical[, c("a", "row2")],
  df_badmerge_int[, c("a", "row")]
))

a	row2	row
FALSE	2	2
FALSE	2	5
FALSE	2	8
FALSE	5	2
FALSE	5	5
FALSE	5	8
FALSE	8	2
FALSE	8	5
FALSE	8	8
TRUE	6	6
TRUE	6	3
TRUE	6	9
TRUE	3	6
TRUE	3	3
TRUE	3	9
TRUE	9	6
TRUE	9	3
TRUE	9	9
NA	1	1
NA	1	7
NA	1	4
NA	7	1
NA	7	7
NA	7	4
NA	4	1
NA	4	7
NA	4	4

For the factor version, the logical values are coerced to factors, but they don’t match any of the given levels, so they all become NA:

knitr::kable(merge(
  df_badmerge_factor[, c("a", "row")],
  df_badmerge_logical[, c("a", "row2")]
))

a	row	row2
NA	7	7
NA	7	4
NA	7	1
NA	4	7
NA	4	4
NA	4	1
NA	1	7
NA	1	4
NA	1	1

knitr::kable(merge(
  df_badmerge_logical[, c("a", "row2")],
  df_badmerge_factor[, c("a", "row")]
))

a	row2	row
NA	7	7
NA	7	4
NA	7	1
NA	4	7
NA	4	4
NA	4	1
NA	1	7
NA	1	4
NA	1	1

However, we see unexpected behaviour with the factor version, when also merging on another attribute, b: the merge result now depends on the input order. With the factor version first, the result is similar to before:

knitr::kable(merge(
  df_badmerge_factor,
  df_badmerge_logical
))
#> Warning in `[<-.factor`(`*tmp*`, ri, value = c(NA, FALSE, TRUE, NA, FALSE, :
#> invalid factor level, NA generated

a	b	row	row2
NA	FALSE	4	4
NA	FALSE	4	5
NA	FALSE	4	6
NA	NA	1	1
NA	NA	1	2
NA	NA	1	3
NA	TRUE	7	7
NA	TRUE	7	8
NA	TRUE	7	9

With the logical version first, however, only the logical a values that are NA before coercion are kept, rather than all of them:

knitr::kable(merge(
  df_badmerge_logical,
  df_badmerge_factor
))

a	b	row2	row
NA	FALSE	4	4
NA	NA	1	1
NA	TRUE	7	7

In autodb, this could be a problem if two relations in a foreign key constraint contain the same attribute, but one of them has it store factors in records, due to separate insertions. Letting an attribute’s class vary across relations should, therefore, be done with caution.

Synthesis doesn’t minimise relation key count

Bernstein’s synthesis is guaranteed to minimise the number of relations created for a given set of functional dependencies, and removing avoidable attributes can reduce the number of attributes in those relations. However, there can still be redundant keys. For example, we can take the following set of functional dependencies:

fds_redkey <- functional_dependency(
  list(
    list("a", "b"),
    list("d", "c"),
    list(c("b", "d"), "a"),
    list("a", "c"),
    list(c("b", "c"), "d")
  ),
  letters[1:4]
)
fds_redkey
#> 5 functional dependencies
#> 4 attributes: a, b, c, d
#>    a -> b
#>    d -> c
#> b, d -> a
#>    a -> c
#> b, c -> d

Normalising gives the following relations:

schema_redkey <- normalise(fds_redkey, remove_avoidable = TRUE)
show(schema_redkey)

These relations have some redundancy: relation a implies {b, d} -> c, but relation d implies that {d} -> c. This isn’t resolved by removing avoidable attributes, because d still needs to be in relation a: we just need to remove {b, d} as a key. However, this is resolved if we instead use this set of functional dependencies, which is equivalent to the previous set:

fds_redkey_fix <- functional_dependency(
  list(
    list("a", "b"),
    list("d", "c"),
    list(c("b", "c"), "a"),
    list("a", "d")
  ),
  letters[1:4]
)
fds_redkey_fix
#> 4 functional dependencies
#> 4 attributes: a, b, c, d
#>    a -> b
#>    d -> c
#> b, c -> a
#>    a -> d
schema_redkey_fix <- normalise(fds_redkey_fix, remove_avoidable = TRUE)
show(schema_redkey_fix)

Unfortunately, there’s no way in the package to find better sets like this.

Normal forms are’t all there is to database design

Normal forms only refer to one relation at a time: referring to a database as being in a given normal form just means that each if its relations are in it individually. This doesn’t address the addition of foreign key references, but, even if the references are sensible, schemas that are in higher normal forms may not be ones that we’d want to use.

For example, take this database schema, whose relation schemas are in third normal form:

dup_db <- autodb(ChickWeight)
show(dup_db)

If we now create copies of these relations, with the intention that copies always contain the same data, then all relations are still in third normal form, and so we’d say this database is also in third normal form:

show(dup_db[c(1, 1, 2, 2, 2)])

However, no one would claim that this is a good database design, since there is clearly a large amount of data redundancy. Higher normal forms would not change this. In fact, since the copies do not reference each other, it is also trivial to insert data to make this database incoherent. For example, Chick and Chick.1 could contain different diet assignments.

a	row	row2
0	2	2
0	2	5
0	2	8
0	5	2
0	5	5
0	5	8
0	8	2
0	8	5
0	8	8
1	6	6
1	6	3
1	6	9
1	3	6
1	3	3
1	3	9
1	9	6
1	9	3
1	9	9
NA	1	1
NA	1	7
NA	1	4
NA	7	1
NA	7	7
NA	7	4
NA	4	1
NA	4	7
NA	4	4

a	row	row2
0	2	2
0	2	5
0	2	8
0	5	2
0	5	5
0	5	8
0	8	2
0	8	5
0	8	8
1	6	6
1	6	3
1	6	9
1	3	6
1	3	3
1	3	9
1	9	6
1	9	3
1	9	9
NA	1	1
NA	1	7
NA	1	4
NA	7	1
NA	7	7
NA	7	4
NA	4	1
NA	4	7
NA	4	4

Using autodb