library(autodb)
#>
#> Attaching package: 'autodb'
#> The following object is masked from 'package:stats':
#>
#> decompose
To avoid confusion, this vignette is consistent with terminology: a database consists of relations, which have records with the same attributes. This is in line with terms used in the relational model; in Statistics, we would talk about tables, which have rows with values for the same variables, with a column for each variable.
For the original data set, which is a flat table, we refer to a data frame instead, which also consists of records with attributes.
When we talk about the type of values an attribute can take, we talk about a class, as in the relational model and R, rather than a value type, as in many other programming languages.
autodb
is described in the DESCRIPTION file in the
following way:
Automatic normalisation of a data frame to third normal form, with the intention of easing the process of data cleaning. (Usage to design your actual database for you is not advised.)
Let us begin there.
Database normalisation works on the principle that our data should only express a piece of information in a single place. This is done to reduce the chance that a modification will accidentally change or remove other pieces of information.
Data cleaning works on the principle that we remove duplicate observations and structural errors, and have a clear understanding of any associations between different observations, or parts of observations. This is done to ensure a high quality for the data used in analysis, but I have found that it is equally important for making sure that I understand the domain I am hoping to model.
Normalisation is helpful for both of these areas. I usually use the original, flat, “tidy” data for the analysis itself. However, for checking the data’s consistency, and for using the data to learn about the domain, normalised data is often helpful, whether it’s used for the analysis or not.
I usually do data checking under the following conditions:
It’s in this context that I would usually have to do the normalisation by hand, and I would like to have a tool that does it semi-automatically. This package is intended to be that tool, using a minimal amount of additional libraries.
Good database design is a hard and involved process, which requires a strong grasp on the domain that the database will hold information on. Please don’t try to use this package to do it on autopilot. In particular, the conditions described above mean we do things differently to good practice for normalised relational databases:
merge
function.autodb
gets a given data frame to third normal form:
every attribute depends on the whole key(s), and non-key attributes
depend on nothing but the key(s). This was chosen because there is an
existing algorithm, Bernstein’s synthesis, for normalising to third
normal form, and because it’s the highest normal form that is attainable
with arbitrary original data frames. There is an additional enhancement
available as an option: see the section on avoidable attributes for more
details.
For most of these simple examples, we use the
ChickWeight
dataset, included with base R.
summary(ChickWeight)
#> weight Time Chick Diet
#> Min. : 35.0 Min. : 0.00 13 : 12 1:220
#> 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
#> Median :103.0 Median :10.00 20 : 12 3:120
#> Mean :121.8 Mean :10.72 10 : 12 4:118
#> 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
#> Max. :373.0 Max. :21.00 19 : 12
#> (Other):506
autodb
The simplest way to do normalisation is with the autodb
function:
db <- autodb(ChickWeight)
db
#> database with 2 relations
#> 4 attributes: weight, Time, Chick, Diet
#> relation Chick: Chick, Diet; 50 records
#> key 1: Chick
#> relation Time_Chick: Time, Chick, weight; 578 records
#> key 1: Time, Chick
#> references:
#> Time_Chick.{Chick} -> Chick.{Chick}
There is no plotting method within the autodb
package
itself. Instead, there are functions to write normalised objects as
inputs to the Graphviz visualisation software, using the DOT
language.
The generic function to do this is gv
, which calls the
gv.database
method on our database, db
:
db_text <- gv(db)
cat(db_text)
#> digraph {
#> rankdir = "LR"
#> node [shape=plaintext];
#>
#> "Chick" [label = <
#> <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
#> <TR><TD COLSPAN="3">Chick (50 records)</TD></TR>
#> <TR><TD PORT="TO_chick">Chick</TD><TD BGCOLOR="black"></TD><TD PORT="FROM_chick">ordered</TD></TR>
#> <TR><TD PORT="TO_diet">Diet</TD><TD></TD><TD PORT="FROM_diet">factor</TD></TR>
#> </TABLE>>];
#> "Time_Chick" [label = <
#> <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
#> <TR><TD COLSPAN="3">Time_Chick (578 records)</TD></TR>
#> <TR><TD PORT="TO_time">Time</TD><TD BGCOLOR="black"></TD><TD PORT="FROM_time">numeric</TD></TR>
#> <TR><TD PORT="TO_chick">Chick</TD><TD BGCOLOR="black"></TD><TD PORT="FROM_chick">ordered</TD></TR>
#> <TR><TD PORT="TO_weight">weight</TD><TD></TD><TD PORT="FROM_weight">numeric</TD></TR>
#> </TABLE>>];
#>
#> "Time_Chick":FROM_chick -> "Chick":TO_chick;
#> }
This can be saved to a file for passing to Graphviz elsewhere, or can
be passed into an R function that renders Graphviz code. This vignette
uses grViz
from the DiagrammeR
package. If
you’re writing a document in Quarto, you can call GraphViz directly
using the dot
engine; this approach avoids loading 2MB of
Javascript for HTML widgets into the rendered file.
if (requireNamespace("DiagrammeR", quietly = TRUE)) {
show <- function(x) DiagrammeR::grViz(gv(x))
maybe_plot <- function(x) DiagrammeR::grViz(gv(x))
}else{
show <- print
maybe_plot <- function(x) invisible(NULL)
}
maybe_plot(db)
Each relation is represented as a box, with the top row giving the relation name and the number of records, and the other rows detailing the attributes. In the middle is a grid of cells: each column of cells represents a (candidate) key for the relation, where attributes in that key have their cell filled in black.
We can see that ChickWeight
was split into two
relations, with a weight for each unique combination of Chick number and
time, and the Chick number by itself determining the diet. We can also
see the number of records in each relation, and the classes of the
attributes in those relations.
gv
also has methods for data.frame
, and for
database_schema
, which is introduced later.
Having looked at the final result first, we now look at the
individual steps. The first step is to find the (non-trivial and
minimal) dependencies present in the given data frame. There are various
ways to do this; by default, the package uses FDHitsSep, a depth-first
search algorithm. We run this using the discover
function,
setting progress
to TRUE
to see the steps
taken:
deps <- discover(ChickWeight, progress = TRUE)
#> formatting numerical/complex variables with 7 significant digits
#> simplifying data types
#> calculating single-attribute PLIs
#> sampling difference sets
#> 7 initial diffsets
#>
#> dependant weight
#> dependant Time
#> dependant Chick
#> dependant Diet
#>
#> FDHitsSep complete
#> 9 final diffsets
#> 10 nodes visited
#> 7 partitions cached
deps
#> 2 functional dependencies
#> 4 attributes: weight, Time, Chick, Diet
#> Time, Chick -> weight
#> Chick -> Diet
After simplifying the data to something quicker to iterate over – specifically, by converting all attributes to have integer values – the algorithm takes each attribute in turn as a possible dependant, and finds sets of the other attributes that determine it.
The result is a list of functional dependencies, in the format
determinant set -> dependant
, with an attribute named
attrs_order
that gives all the attribute names in their
original order. Each of these three parts has its own (generic)
extraction function:
detset(deps)
#> [[1]]
#> [1] "Time" "Chick"
#>
#> [[2]]
#> [1] "Chick"
dependant(deps)
#> [1] "weight" "Diet"
attrs_order(deps)
#> [1] "weight" "Time" "Chick" "Diet"
The former two, especially, are useful for filtering, as we’ll see later.
Now that we have a list of discovered dependencies, we can construct a database schema, where the relation schemas are normalised to third normal form. This is done using a version of Bernstein’s synthesis.
schema <- synthesise(deps)
schema
#> 2 relation schemas
#> 4 attributes: weight, Time, Chick, Diet
#> schema Chick: Chick, Diet
#> key 1: Chick
#> schema Time_Chick: Time, Chick, weight
#> key 1: Time, Chick
Like the database before, we can also plot this database schema:
maybe_plot(schema)
This is similar to the database plot given before, but there is some information not present, that requires the data frame itself. We do have class information about the attributes, extracted from the data frame during dependency discovery, but we have no record counts for the individual relation schemas. However, we do have automatically-generated names for the individual relations.
Additionally, at this point we have no connections between the relation schemas, since Bernstein’s synthesis doesn’t supply information about foreign key references. We could use this database schema to build a database, but we’d rather add the foreign key references first.
Let’s look at a different dataset for a moment, to look at some cases where we don’t want to use the dependencies as given. We’ll use the Titanic data set, also provided with base R. This data is in array form, so we first convert it to data frame form:
knitr::kable(as.data.frame(Titanic))
Class | Sex | Age | Survived | Freq |
---|---|---|---|---|
1st | Male | Child | No | 0 |
2nd | Male | Child | No | 0 |
3rd | Male | Child | No | 35 |
Crew | Male | Child | No | 0 |
1st | Female | Child | No | 0 |
2nd | Female | Child | No | 0 |
3rd | Female | Child | No | 17 |
Crew | Female | Child | No | 0 |
1st | Male | Adult | No | 118 |
2nd | Male | Adult | No | 154 |
3rd | Male | Adult | No | 387 |
Crew | Male | Adult | No | 670 |
1st | Female | Adult | No | 4 |
2nd | Female | Adult | No | 13 |
3rd | Female | Adult | No | 89 |
Crew | Female | Adult | No | 3 |
1st | Male | Child | Yes | 5 |
2nd | Male | Child | Yes | 11 |
3rd | Male | Child | Yes | 13 |
Crew | Male | Child | Yes | 0 |
1st | Female | Child | Yes | 1 |
2nd | Female | Child | Yes | 13 |
3rd | Female | Child | Yes | 14 |
Crew | Female | Child | Yes | 0 |
1st | Male | Adult | Yes | 57 |
2nd | Male | Adult | Yes | 14 |
3rd | Male | Adult | Yes | 75 |
Crew | Male | Adult | Yes | 192 |
1st | Female | Adult | Yes | 140 |
2nd | Female | Adult | Yes | 80 |
3rd | Female | Adult | Yes | 76 |
Crew | Female | Adult | Yes | 20 |
This is a simple set of data, with a single count observation,
Freq
, for each combination of the four determining
attributes. In other words, the relation is already normalised, so we
only expect one relation in the normalised database.
If we use autodb
again, we get the following
database:
show(autodb(as.data.frame(Titanic)))
Oops! The search found some functional dependencies where the count could be used to determine another attribute. These are clearly spurious: frequency count can’t causally determine age, for example. However, the algorithm still finds these dependencies, because the counts are unique often enough to make these dependencies hold in the given data.
There are two approaches we can take to eliminate these spurious
dependencies: not letting them be detected in the first place, and
removing them before using synthesise
.
To stop them being detected, we can put constraints on what is
discovered by discover
: we can ask for certain attributes
to not be considered as determinants, or we can exclude attributes that
inherit from certain classes. In this example, we could exclude
Freq
from being considered:
titanic_deps_freqonly <- discover(as.data.frame(Titanic), exclude = "Freq")
titanic_deps_freqonly
#> 1 functional dependency
#> 5 attributes: Class, Sex, Age, Survived, Freq
#> Class, Sex, Age, Survived -> Freq
Alternatively, we could exclude all attributes that inherit from “numeric”:
identical(titanic_deps_freqonly, discover(as.data.frame(Titanic), exclude_class = "numeric"))
#> [1] TRUE
These can both be used as arguments to autodb
too:
show(autodb(as.data.frame(Titanic), exclude = "Freq"))
Generally, excluding numeric attributes as determinants is often useful, because we expect non-integer numbers to be a measurement, not part of a primary key.
Alternatively, we could remove unwanted dependencies before using
decompose
. Here are the found dependencies, if we don’t
exclude anything:
titanic_deps <- discover(as.data.frame(Titanic))
titanic_deps
#> 3 functional dependencies
#> 5 attributes: Class, Sex, Age, Survived, Freq
#> Sex, Survived, Freq -> Age
#> Class, Survived, Freq -> Age
#> Class, Sex, Age, Survived -> Freq
We can remove the unwanted dependencies, where Age
is
the dependant:
titanic_deps[dependant(titanic_deps) == "Freq"]
#> 1 functional dependency
#> 5 attributes: Class, Sex, Age, Survived, Freq
#> Class, Sex, Age, Survived -> Freq
Getting back to our ChickWeight
example, we now have a
database schema, consisting of a list of relation
schemas.
However, we have no information about how these relation schemas are
linked to each other. In particular, we have no information about
foreign keys. We can add this information using
autoref
:
linked_schema <- autoref(schema)
linked_schema
#> database schema with 2 relation schemas
#> 4 attributes: weight, Time, Chick, Diet
#> schema Chick: Chick, Diet
#> key 1: Chick
#> schema Time_Chick: Time, Chick, weight
#> key 1: Time, Chick
#> references:
#> Time_Chick.{Chick} -> Chick.{Chick}
We could also have used normalise
, instead of
synthesise
and autoref
separately:
normalise(deps)
#> database schema with 2 relation schemas
#> 4 attributes: weight, Time, Chick, Diet
#> schema Chick: Chick, Diet
#> key 1: Chick
#> schema Time_Chick: Time, Chick, weight
#> key 1: Time, Chick
#> references:
#> Time_Chick.{Chick} -> Chick.{Chick}
Plotting this updated database schema shows the same relation schemas as before, linked together by foreign key references:
show(linked_schema)
Finally, once we have our normalised database schema, we can apply it
to our original data frame, or a new one with the same structure. This
results in a normalised database, as we got from using
autodb
:
db2 <- decompose(ChickWeight, linked_schema)
show(db2)
We now have the record counts added.
We can reverse the process of turning a data frame into a database
with the rejoin
function. This may not be identical to
ChickWeight
, since the rows may have been rearranged.
However, we can use the convenience function df_equiv
to
check for equivalence under row reordering:
rejoined <- rejoin(db)
summary(rejoined)
#> weight Time Chick Diet
#> Min. : 35.0 Min. : 0.00 13 : 12 1:220
#> 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
#> Median :103.0 Median :10.00 20 : 12 3:120
#> Mean :121.8 Mean :10.72 10 : 12 4:118
#> 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
#> Max. :373.0 Max. :21.00 19 : 12
#> (Other):506
identical(rejoined, ChickWeight)
#> [1] FALSE
df_equiv(rejoined, ChickWeight)
#> [1] TRUE
When rejoined, the relation attributes will be in the original order. However, the record order might have changed.
Included in the package is a 447-by-25 data frame called
nudge
:
show(nudge)
This is the data set for a meta-analysis, looking at the effectiveness of “nudge” interventions. Measurements are taken within a three-layer hierarchy: publications contain studies, which contain effect size measurements. We expect this hierarchy to be apparent in the normalisation.
Getting full dependency information for this relation can take a long time, so here we search for a reduced set, not considering numeric or sample size attributes as determinants:
nudge_deps <- discover(
nudge,
exclude = c("n_study", "n_comparison", "n_control", "n_intervention"),
exclude_class = "numeric"
)
nudge_schema <- normalise(nudge_deps, remove_avoidable = TRUE)
show(nudge_schema)
We can see a relation, with many attributes, determined by the effect
size ID number, es_id
. This contains all of the numeric
measurements, as expected in the relation for the lowest level in the
hierarchy. As also expected, this has a foreign key reference to a
relation for study-level information, determined by the study ID number,
study_id
.
However, the publication-level relation this refers to is not
determined by the publication ID, publication_id
, as we
would expect; neither is it determined by the publication reference.
Instead, it is solely determined by the publication’s title: to use the
ID, we would need to supplement it with other information, like the
publication year, or other choices that come from the middle column of
relations, which look likely to be spurious.
This suggests that some publication ID numbers have been erroneously assigned to several publications, which we can easily test:
nudge_database <- decompose(nudge, nudge_schema)
nudge_title_relation <- records(nudge_database)$title
nudge_pid_duplicates <- unique(nudge_title_relation$publication_id[
duplicated(nudge_title_relation$publication_id)
])
knitr::kable(subset(nudge_title_relation, publication_id %in% nudge_pid_duplicates))
title | publication_id | reference | year | |
---|---|---|---|---|
44 | Enhanced active choice: A new method to motivate behavior change | 95 | Keller et al. (2011) | 2011 |
130 | Nudging product choices: The effect of position change on snack bar choice | 95 | Keller et al. (2015) | 2015 |
We’d also expect reference
to determine the publication,
but this is also not the case: it needs additional information, such as
binary_outcome
.
This means that there are publications that share a reference:
nudge_reference_duplicates <- unique(nudge_title_relation$reference[
duplicated(nudge_title_relation$reference)
])
knitr::kable(subset(nudge_title_relation, reference %in% nudge_reference_duplicates))
title | publication_id | reference | year | |
---|---|---|---|---|
214 | Nudge vs superbugs: A behavioural economics trial to reduce the overprescribing of antibiotics | 18 | BETA (2018) | 2018 |
399 | Energy labels that make cents | 19 | BETA (2018) | 2018 |
BETA is the Behavioural Economics Team of the Australian Government, so it’s not surprising that they’d have multiple publications/reports per year. Duplicate references is not necessarily an error, but would be awkward if the references were to be used.
There are other, clearly spurious, relations involving
reference
. One of these combines it with
type_experiment
to determine location
. Since
reference
is a publication-level attribute, and
location
is a study-level attribute, this can’t be
right.
(As far as I’m aware, the publication ID and reference errors mentioned above would not have affected the meta-analysis results.)
The publication ID and the reference clearly have issues, so we need to decide what to do about them. Here, I’ll show what is usually the correct approach, which is to remove the spurious functional dependencies. We’ll look at the other options later.
The better option is to remove all functional dependencies that we’d consider to be spurious: those where publication ID or reference are part of a multiple-attribute determinant set, and the determinant set isn’t just those two:
nudge_deps_filtered <- nudge_deps[
lengths(detset(nudge_deps)) == 1 |
vapply(
detset(nudge_deps),
\(ds) length(setdiff(ds, c("publication_id", "reference"))) != 1,
logical(1)
)
]
nudge_schema_filtered <- normalise(nudge_deps_filtered, remove_avoidable = TRUE)
show(nudge_schema_filtered)
We now, finally, have the study location in a non-spurious relation. If we want to, we could also add the dependencies we fail to find, due to excluding the sample size attributes from determinant sets:
nudge_deps_size <- discover(nudge[, startsWith(names(nudge), "n_")])
nudge_deps_size
#> 3 functional dependencies
#> 4 attributes: n_study, n_comparison, n_control, n_intervention
#> n_control, n_intervention -> n_comparison
#> n_comparison, n_intervention -> n_control
#> n_comparison, n_control -> n_intervention
nudge_deps_final <- c(nudge_deps_filtered, nudge_deps_size)
nudge_schema_final <- normalise(nudge_deps_final, remove_avoidable = TRUE)
nudge_database_final <- decompose(nudge, nudge_schema_final)
show(nudge_schema_final)
The new schema confirms what we might expect, that the total sample size, and the sizes of control and treatment arms, only have two degrees of freedom: knowing two of them determines the other.
We’d expect this to be because the total sample size is the sum of the other two. However, the functional dependency can’t determine this. Indeed, if we check for this more specific condition, we find that it doesn’t hold:
knitr::kable(unique(subset(
nudge,
n_comparison != n_control + n_intervention,
c(
es_id,
reference,
title,
n_study,
n_comparison,
n_control,
n_intervention
)
)))
es_id | reference | title | n_study | n_comparison | n_control | n_intervention | |
---|---|---|---|---|---|---|---|
237 | 180 | Hedlin & Sunstein (2016) | Does active choosing promote green energy use? Experimental evidence | 1037 | 1037 | 345 | 346 |
302 | 181 | Hedlin & Sunstein (2016) | Does active choosing promote green energy use? Experimental evidence | 1037 | 1037 | 345 | 346 |
Looking through the paper, the arm sample sizes aren’t given explicitly, but the numbers here are consistent with the parts of the study that have two treatments.
Automatic search for functional dependencies cannot check domain-specific constraints like the summation constraint above, but it can give schemas that suggest constraints to check for, such as the two-degrees relation above.
Going back to the final database schema, removing dependencies has
revealed extra information about study locations, via the new
title_type_experiment
relation: studies of the same
experiment type in a publication always have the same location. Looking
at the resulting database shows that this removes many entries of what
would be redundant location information if kept in the study
relation:
show(nudge_database_final)
While this is not a dependency we could expect to hold if more data was collected, it’s a reasonable dependency for the given data set, which won’t be added to.
The additional relation mentioned at the end of the previous section is also notable for what we’ll now discuss: alternative ways to get rid of spurious structure. How can we do this?
publication_id
above, since it doesn’t do its intended
job of unique publication identification – and re-do the entire process
(re-running discover
can be expensive; we don’t always want
to throw out everything to do with an attribute,
e.g. reference -> year
, or
title -> reference
);exclude
, to disallow them
in determinant sets, and re-do the entire process (same issues as above,
except that we keep title -> reference
, since an
excluded reference
can still be a dependant);What happens if we remove the offending schemas?
nudge_schema_relfiltered <- nudge_schema[
!grepl("publication_id_", names(nudge_schema), fixed = TRUE) &
!grepl("_publication_id", names(nudge_schema), fixed = TRUE) &
!grepl("reference_", names(nudge_schema), fixed = TRUE) &
!grepl("_reference", names(nudge_schema), fixed = TRUE)
]
Subsetting a database schema also removes any foreign key references involving removed schemas, so the resulting schema is still valid. However, any foreign key reference chains with removed schemas in the middle are broken. Amongst others, in this case we lose the reference between the three relations for the three hierarchical data levels:
show(nudge_schema_relfiltered)
Re-running autoref
has no effect, because the
es_id
section doesn’t contain any key for the
title
section:
identical(autoref(nudge_schema_relfiltered), nudge_schema_relfiltered)
#> [1] TRUE
Because the database subschema including the es_id
relation only has publication_id
for distinguishing
publications, effect sizes for the two publications with the same ID
cannot be uniquely associated with their publication: removing schemas
irreversibly separates two parts of the database schema.
Furthermore, compared to nudge_schema_filtered
, there is
a relation schema missing: directly removing schemas loses the structure
information that results in the title_type_experiment
relation. How does this happen?
The relation isn’t present in the original schema,
nudge_schema_filtered
, because the full set of functional
dependencies make its implied dependency transitive. Specifically,
nudge_fds
contains the following functional
dependencies:
example_fds <- functional_dependency(
list(
list("title", "reference"),
list(c("reference", "type_experiment"), "location"),
list(c("title", "type_experiment"), "location")
),
c("title", "reference", "type_experiment", "location")
)
example_fds
#> 3 functional dependencies
#> 4 attributes: title, reference, type_experiment, location
#> title -> reference
#> reference, type_experiment -> location
#> title, type_experiment -> location
The first two dependencies make the third transitive, so it’s ignored
when constructing the relation schemas: it’s implied, and enforced, by
the title
and reference_type_experiment
relation schemas.
show(normalise(example_fds, ensure_lossless = FALSE))
However, when reference, type_experiment -> location
is removed as spurious,
title, type_experiment -> location
is no longer
transitive, so it now appears as its own relation schema.
normalise(example_fds[-2], ensure_lossless = FALSE)
show(normalise(example_fds[-2], ensure_lossless = FALSE))
When we instead remove the reference_type_experiment
schema, this implicitly also throws away the
title, type_experiment -> location
dependency that it
co-implies. We throw away non-spurious information that we don’t intend
to.
Between the risk of irreversibly breaking the database schema’s structure, and the risk of unintentionally removing additional structure information, I hope it’s clear why directly removing relations should be avoided.
Larger datasets can often have entry errors, without an easy way to remove or deal with them. For this reason, we might be interested in “approximate” functional dependencies, which hold after removing a bounded amount of violating records.
Suppose that we normalise nudge
again, without any
manual dependency removal, but allowing approximate dependencies. We
could cheat, knowing that the questionable data example found above
showed there to be two questionable records: one for a duplicated
publication ID, and one for a duplicated reference. Since
nudge
has 447 records, we can get rid of the resulting
questionable dependencies by setting the accuracy to allow two
discrepant records.
The accuracy
argument for discover
expects
a number between zero and one, determining what proportion of a data
frame’s records need to satisfy a given dependency for the algorithm to
consider it valid. By default, it’s equal to one, so only exact
dependencies are returned. If we set it to less than one, we need to use
the slower DFD search algorithm, since approximate dependency search is
not implemented for the default FDHitsSep algorithm.
show(normalise(discover(
nudge,
accuracy = 1 - 2/nrow(nudge),
method = "DFD",
exclude = c("n_study", "n_comparison", "n_control", "n_intervention"),
exclude_class = "numeric"
)))
Currently decompose
doesn’t account for approximate
dependencies, resulting in invalid databases, so here we just work with
the database schema.
Compare this to the database schema we arrived at manually:
approximation
to be approximately determined by some publication- and study-level
attributes, rather than functionally determined at the effect-size
level.Lowering the accuracy results in more dependencies being found, and
so in more relations. The number of relations can get very large. For
example, suppose we instead set the accuracy to 0.99
,
returning any approximate dependencies that hold in at least 443
records.
nudge_approx_database_schema <- discover(
nudge,
accuracy = 0.99,
method = "DFD",
exclude = c("n_study", "n_comparison", "n_control", "n_intervention"),
exclude_class = "numeric"
) |>
normalise()
show(nudge_approx_database_schema)
This is a little overwhelming, so we’ll use a utility function called
reduce
. This returns only the named “main” relation, and
any relations for which it is a descendant via foreign key references.
Reducing the approximate schema, with the effect size table as the main
table, gives us this set of relations:
show(reduce(nudge_approx_database_schema, "es_id"))
Questionable intermediate relations aside, we can see that there is now a publication-level relation with the publication ID as a key, since it determines the publication attributes once we remove one of the duplicate records we discovered before.
reduce
can also be used on databases, where it considers
each relation with the largest number of records as a main table. It
needs to be used with caution: while the relations it removes are often
spurious, it’s possible to find databases where relations not linked to
the main relation by foreign key references are required to rejoin to
the original data frame, so reduction can remove necessary relations.
Its intent is mostly to make glancing at database plots more
manageable.
The next normal form after third normal form (3NF) is Boyes-Codd normal form (BCNF). Ensuring BCNF is enforced by the database is trickier, as in some cases it can’t be enforced with just relations and foreign key constraints.
However, the package includes an option to convert to enhanced third normal form, also known as LTK form, which can be so enforced. This enhancement is tangential to BCNF, and could also be used to enhance schemas in BCNF.
In brief, the standard normal forms only put constraints on the attributes present in the relations one relation at a time. The enhancement is a constraint on the attributes present in a relation, while considering their presence in other relations. If a attribute in a relation can be removed, and still be determined from that relation by joining it to others, then the attribute is “avoidable”, and can be removed. If the attribute is in any of the relation’s keys, they’ll be replaced by keys that use the attributes not being removed. This removes attributes from relations without removing any information from the database as a whole.
For example, we can take this simple example from Chapter 6 of The Theory of Relational Databases, by David Maier:
avoid_deps <- functional_dependency(
list(
list("A", "B"),
list("B", "A"),
list(c("A", "C"), "D"),
list(c("A", "C"), "E"),
list(c("B", "D"), "C")
),
attrs_order = c("A", "B", "C", "D", "E")
)
avoid_deps
#> 5 functional dependencies
#> 5 attributes: A, B, C, D, E
#> A -> B
#> B -> A
#> A, C -> D
#> A, C -> E
#> B, D -> C
avoid_schema <- normalise(avoid_deps)
show(avoid_schema)
Attributes A
and B
are equivalent, since
relation A
has them both as a key. In other words, relation
A
is a simple lookup relation. Because of this, we could
remove B
from relation A_C
, and replace the
key B, D
with A, D
, which is equivalent when
accounting for relation A
.
We can have this removal of avoidable attributes done automatically,
using the remove_avoidable
flag for
normalise
:
avoid_schema_removed <- normalise(
avoid_deps,
remove_avoidable = TRUE
)
show(avoid_schema_removed)
This schema is now in LTK form, with no remaining avoidable
attributes. We could have also removed A
from relation
A_C
instead of B
, so this process may not have
a unique result. The package’s implementation prefers to remove
attributes that appear later in the original relation.
When autodb
, or autoref
, generates foreign
key references for relations decomposed from a single table, it
naturally tends to a structure where there is a single low-level data
relation on the left, and higher-level relations are “downstream” with
respect to references, with the number of records strictly decreasing as
we follow references to higher-level data relations.
The nice thing about this is that, while the schema is created constructively, we can view it as a series of decompositions: start with the original table, split it up into two or more sub-tables, and repeat. Rejoining the database would then do this in reverse.
Unfortunately, not every set of functional dependencies allows this. Consider the following:
Bernstein synthesis, and automatic foreign keys, give the following schema:
The schema is not connected, because d_e
’s key is not
present in any other relation. In terms of a series of decompositions,
we could think of the original table being split into
{a,b,c,d,e}
and {d,e,f}
first, with a
reference between them, and d
and e
being
split off later, destroying the reference.
In short, while Boyes-Codd Normal Form may be the first normal form that can’t always be satisfied with just relations, if we also want to keep all the relations related to each other with foreign key references for consistency, then third normal form also can’t always be satisfied.
One way to solve this, which is a planned addition, is to allow virtual relations, also known as views. These are relations whose data is not stored, because it’s generated from other relations that do store their data (real relations).
Usually, views are used to allow easy “look-up” of commonly-queried aggregates. Here, we could use them to maintain the connected schema structure. Here is a mock-up of what the resulting schema might look like:
if (requireNamespace("DiagrammeR", quietly = TRUE)) DiagrammeR::grViz("digraph {
rankdir = \"LR\"
node [shape=plaintext];
\"a\" [label = <
<TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
<TR><TD PORT=\"TITLE\" COLSPAN=\"2\">a</TD></TR>
<TR><TD PORT=\"TO_a\">a</TD><TD PORT=\"FROM_a\" BGCOLOR=\"black\"></TD></TR>
<TR><TD PORT=\"TO_b\">b</TD><TD PORT=\"FROM_b\"></TD></TR>
<TR><TD PORT=\"TO_c\">c</TD><TD PORT=\"FROM_c\"></TD></TR>
</TABLE>>];
\"b\" [label = <
<TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
<TR><TD PORT=\"TITLE\" COLSPAN=\"2\">b</TD></TR>
<TR><TD PORT=\"TO_b\">b</TD><TD PORT=\"FROM_b\" BGCOLOR=\"black\"></TD></TR>
<TR><TD PORT=\"TO_d\">d</TD><TD PORT=\"FROM_d\"></TD></TR>
</TABLE>>];
\"c\" [label = <
<TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
<TR><TD PORT=\"TITLE\" COLSPAN=\"2\">c</TD></TR>
<TR><TD PORT=\"TO_c\">c</TD><TD PORT=\"FROM_c\" BGCOLOR=\"black\"></TD></TR>
<TR><TD PORT=\"TO_e\">e</TD><TD PORT=\"FROM_e\"></TD></TR>
</TABLE>>];
\"d_e\" [label = <
<TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
<TR><TD PORT=\"TITLE\" COLSPAN=\"2\">d_e</TD></TR>
<TR><TD PORT=\"TO_d\">d</TD><TD PORT=\"FROM_d\" BGCOLOR=\"black\"></TD></TR>
<TR><TD PORT=\"TO_e\">e</TD><TD PORT=\"FROM_e\" BGCOLOR=\"black\"></TD></TR>
<TR><TD PORT=\"TO_f\">f</TD><TD PORT=\"FROM_f\"></TD></TR>
</TABLE>>];
\"view\" [label = <
<TABLE BGCOLOR=\"lightgrey\" BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
<TR><TD PORT=\"TITLE\" COLSPAN=\"1\">view</TD></TR>
<TR><TD>a</TD></TR>
<TR><TD>b</TD></TR>
<TR><TD>c</TD></TR>
<TR><TD>d</TD></TR>
<TR><TD>e</TD></TR>
<TR><TD>f</TD></TR>
</TABLE>>];
\"a\":FROM_b -> \"b\":TO_b;
\"a\":FROM_c -> \"c\":TO_c;
\"view\":TITLE -> \"a\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
\"view\":TITLE -> \"d_e\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
}")
Note that virtual relations have different rules to real relations:
view
does have a
as a key, but this key can be
inferred from the real relations it’s a merging / joining of, so does
not need to be stated explicitly.view
contains
{a, b}
, but does not introduce a redundant instance of
{a} -> b
in addition to the one in relation
a
.This addition of virtual relations allows databases to always be connected as a tree. This is useful for visualisation, and for planning rejoins, since those can always be done by merging tables along foreign key references.
We could explicitly show all of the decomposition steps as virtual relations, but it’s not clear that it’s worth the bother:
if (requireNamespace("DiagrammeR", quietly = TRUE)) DiagrammeR::grViz("digraph {
rankdir = \"LR\"
node [shape=plaintext];
\"a\" [label = <
<TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
<TR><TD PORT=\"TITLE\" COLSPAN=\"2\">a</TD></TR>
<TR><TD PORT=\"TO_a\">a</TD><TD PORT=\"FROM_a\" BGCOLOR=\"black\"></TD></TR>
<TR><TD PORT=\"TO_b\">b</TD><TD PORT=\"FROM_b\"></TD></TR>
<TR><TD PORT=\"TO_c\">c</TD><TD PORT=\"FROM_c\"></TD></TR>
</TABLE>>];
\"b\" [label = <
<TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
<TR><TD PORT=\"TITLE\" COLSPAN=\"2\">b</TD></TR>
<TR><TD PORT=\"TO_b\">b</TD><TD PORT=\"FROM_b\" BGCOLOR=\"black\"></TD></TR>
<TR><TD PORT=\"TO_d\">d</TD><TD PORT=\"FROM_d\"></TD></TR>
</TABLE>>];
\"c\" [label = <
<TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
<TR><TD PORT=\"TITLE\" COLSPAN=\"2\">c</TD></TR>
<TR><TD PORT=\"TO_c\">c</TD><TD PORT=\"FROM_c\" BGCOLOR=\"black\"></TD></TR>
<TR><TD PORT=\"TO_e\">e</TD><TD PORT=\"FROM_e\"></TD></TR>
</TABLE>>];
\"d_e\" [label = <
<TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
<TR><TD PORT=\"TITLE\" COLSPAN=\"2\">d_e</TD></TR>
<TR><TD PORT=\"TO_d\">d</TD><TD PORT=\"FROM_d\" BGCOLOR=\"black\"></TD></TR>
<TR><TD PORT=\"TO_e\">e</TD><TD PORT=\"FROM_e\" BGCOLOR=\"black\"></TD></TR>
<TR><TD PORT=\"TO_f\">f</TD><TD PORT=\"FROM_f\"></TD></TR>
</TABLE>>];
\"view\" [label = <
<TABLE BGCOLOR=\"lightgrey\" BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
<TR><TD PORT=\"TITLE\" COLSPAN=\"1\">view</TD></TR>
<TR><TD>a</TD></TR>
<TR><TD>b</TD></TR>
<TR><TD>c</TD></TR>
<TR><TD>d</TD></TR>
<TR><TD>e</TD></TR>
<TR><TD>f</TD></TR>
</TABLE>>];
\"view2\" [label = <
<TABLE BGCOLOR=\"lightgrey\" BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
<TR><TD PORT=\"TITLE\" COLSPAN=\"1\">view</TD></TR>
<TR><TD>a</TD></TR>
<TR><TD>b</TD></TR>
<TR><TD>c</TD></TR>
<TR><TD>d</TD></TR>
<TR><TD>e</TD></TR>
</TABLE>>];
\"view3\" [label = <
<TABLE BGCOLOR=\"lightgrey\" BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
<TR><TD PORT=\"TITLE\" COLSPAN=\"1\">view</TD></TR>
<TR><TD>a</TD></TR>
<TR><TD>b</TD></TR>
<TR><TD>c</TD></TR>
<TR><TD>d</TD></TR>
</TABLE>>];
\"a\":FROM_b -> \"b\":TO_b;
\"a\":FROM_c -> \"c\":TO_c;
\"view\":TITLE -> \"view2\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
\"view\":TITLE -> \"d_e\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
\"view2\":TITLE -> \"view3\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
\"view2\":TITLE -> \"c\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
\"view3\":TITLE -> \"a\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
\"view3\":TITLE -> \"b\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
}")
Another use of the virtual relations mentioned above is that they make BCNF always achievable. For example, take the following standard example of a set of functional dependencies that cannot be expressed in BCNF using just relations:
The above schema is in third normal form. However, it is not in the
higher Boyes-Codd normal form (BCNF), because c -> a
is
present in relation a_b
, but not in terms of a key. This
results in some redundancy, but also means that the two representations
of c -> a
can hold sets of data that aren’t mutually
coherent :
db <- schema |>
create() |>
insert(data.frame(c = 1:2, a = 1:2), relations = "c") |>
insert(data.frame(a = 1:2, b = 1L, c = 2:1), relations = "a_b")
knitr::kable(records(db)$a_b)
a | b | c |
---|---|---|
1 | 1 | 2 |
2 | 1 | 1 |
knitr::kable(records(db)$c)
c | a |
---|---|
1 | 1 |
2 | 2 |
The standard way to fix this schema to satisfy BCNF is by splitting
relation a_b
on the violated functional dependency, namely
c -> a
:
The resulting schema is now in BCNF, but we have a different problem:
the dependency {a, b} -> c
has disappeared! Fixing this
can’t be done with just real relations, but we can solve it by adding a
virtual relation. Here is a mock-up of what the resulting schema might
look like:
if (requireNamespace("DiagrammeR", quietly = TRUE)) DiagrammeR::grViz("digraph {
rankdir = \"LR\"
node [shape=plaintext];
\"c\" [label = <
<TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
<TR><TD PORT=\"TITLE\" COLSPAN=\"2\">c</TD></TR>
<TR><TD PORT=\"TO_c\">c</TD><TD PORT=\"FROM_c\" BGCOLOR=\"black\"></TD></TR>
<TR><TD PORT=\"TO_a\">a</TD><TD PORT=\"FROM_a\"></TD></TR>
</TABLE>>];
\"b_c\" [label = <
<TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
<TR><TD PORT=\"TITLE\" COLSPAN=\"2\">b_c</TD></TR>
<TR><TD PORT=\"TO_b\">b</TD><TD PORT=\"FROM_b\" BGCOLOR=\"black\"></TD></TR>
<TR><TD PORT=\"TO_c\">c</TD><TD PORT=\"FROM_c\" BGCOLOR=\"black\"></TD></TR>
</TABLE>>];
\"view\" [label = <
<TABLE BGCOLOR=\"lightgrey\" BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"4\">
<TR><TD PORT=\"TITLE\" COLSPAN=\"2\">view</TD></TR>
<TR><TD>a</TD><TD BGCOLOR=\"black\"></TD></TR>
<TR><TD>b</TD><TD BGCOLOR=\"black\"></TD></TR>
<TR><TD>c</TD><TD></TD></TR>
</TABLE>>];
\"b_c\":FROM_c -> \"c\":TO_c;
\"view\":TITLE -> \"b_c\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
\"view\":TITLE -> \"c\":TITLE [arrowhead=\"empty\" style=\"dashed\"];
}")
Some features to note:
view
is required, since it
can’t be inferred from the real relations. This is how we preserve
{a, b} -> c
when decomposing to BCNF.view
references both of the real relations, to show
that it’s the result of combining both of them. This means that
view
“references” relation c
both directly,
and via b_c
. For real relations, the direct reference would
be removed as being transitive, but in this case both give necessary
information. If virtual relations get implemented, virtual-to-real
“references” would be displayed differently, to make this clear.view
contains both a
and c
,
as a_b
did before. Unlike for a_b
, this is not
an extra appearance of c -> a
, so is not a violation of
BCNF. This is because, since view
is virtual, its
a
and c
values are exactly the same as those
in relation c
, so there is no incoherency possible.This addresses how to decompose data with NA
values. See
the “Handling missing values” section for details.
Relational theory is based on data being given as relations, that
cannot have duplicate records. However, it’s expected that an R user
might pass in a data frame with duplicates: for example, R comes with
some datasets, such as iris
, that have duplicates if we
don’t include the row names.
At the moment, these duplicates are kept when searching for dependencies. This can affect results for approximate dependencies, since it changes how many records must be removed for a given dependency to be satisfied. However, the duplicates are still removed when the data frame is decomposed into a database. At the moment, I’m not certain on whether these duplicates are best handled by removing them before beginning the search, or by simply returning an error if there are duplicate records.
Strictly speaking, autodb
does not search for functional
dependencies, because these can’t account for missing values. Instead,
it searches for a weaker variant, which has been called a
literal functional dependency (LFD) in the literature.
The functional dependency X -> Y
holds if, for any
two possible records whose values in X
are equal,
their values in Y
are also equal. This requires
attribute values to take the equality operator, ==
in R. Additionally, they must take it under binary logic: whether two
values are equal can only be true or false, not missing. This disallows
the presence of missing values (NA
and NaN
),
since testing for equality with a missing value returns a missing
result.
The literal functional dependency X -> Y
holds, or
the functional dependency X -> Y
holds literally, if,
for any two possible records whose values in X
are
identical, their values in Y
are also
identical, where values are identical if both are non-missing
and equal, or both are missing. In other words, missing values are
treated as not equal to non-missing values, but as equal to each other.
This requires attribute classes to take the identity operator,
identical()
with default optional parameter values in R.
Again, they must take it under binary logic, but the identity operator
cannot return non-binary results anyway: In R, for example,
identical()
can only return TRUE
or
FALSE
, never NA
. There is, therefore, no
constraint on missing values in attribute classes, or on attribute
classes in general.
LFDs are more generic than standard functional dependencies: since practically every class takes the identity operator, they practically make no assumptions about the attribute classes present in a data set.
There are other FD variants that handle missing values, but, unlike most of them, LFDs still satisfy Armstrong’s axioms, allowing them to be used in normalisation in the same way as normal FDs. For example, they still respect transitivity: if X -> Y and Y -> Z literally, then X -> Z literally. This allows us to construct a database schema with Bernstein synthesis as usual, with LFDs instead of FDs, and still get something coherent.
However, ignoring the special status of missing values in this way ignores important structural information. In the relational model, and ideally in relational databases, we want to have no missing values at all, partly to avoid awkward questions about what trinary logic means with respect to filtering or joining tables. For example, take the following data frame:
df_nas <- data.frame(
patient = c(1L, 2L, 3L, 4L),
trial_entry_date = as.Date(c("2022/05/02", "2022/06/06", "2022/04/01", "2022/03/19")),
trial_exit_date = as.Date(c(NA, NA, "2022/10/07", NA))
)
knitr::kable(df_nas)
patient | trial_entry_date | trial_exit_date |
---|---|---|
1 | 2022-05-02 | NA |
2 | 2022-06-06 | NA |
3 | 2022-04-01 | 2022-10-07 |
4 | 2022-03-19 | NA |
autodb
currently treats NA
as just another
value, which results in the initial data frame not being split in the
database schema:
show(autodb(df_nas))
In this case, a missing exit date represents no exit date, i.e. that the patient is still in the trail, and we would prefer to move exit information to a separate relation, containing only patients with a exit date:
ideal_db <- decompose(
df_nas,
database_schema(
relation_schema(
list(
patient = list(c("patient", "trial_entry_date"), list("patient")),
patient_exit = list(c("patient", "trial_exit_date"), list("patient"))
),
names(df_nas)
),
list(list("patient_exit", "patient", "patient", "patient"))
)
)
records(ideal_db)$patient_exit <- subset(records(ideal_db)$patient_exit, !is.na(trial_exit_date))
show(ideal_db)
This makes more of the data’s structure apparent, and enforceable, in
the database structure, since it makes it clear that the death date is
optional. However, removing missing values in this way isn’t handled by
autodb
itself, so it has to be done by manually creating
the schema and inserting/removing records, as shown above.
Sometimes, the structure for some attributes depends on whether other attributes are missing. Here are two possible reasons:
For example, the data below records values, and allows them to be single values, or an interval with an associated distribution, which may have some distribution parameters. This format is common when listing model parameters.
df_options <- data.frame(
id = 1:20,
value = c(2.3, 2.3, 5.7, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
lower_bound = c(NA_real_, NA_real_, NA_real_, 2.4, 0, 1, 0, 5.6, 2.4, 5.3, 5.3, 2.4, 2.4, 2.4, 2.4, 2.4, 2.4, 2.4, 5.6, 2.4),
upper_bound = c(NA_real_, NA_real_, NA_real_, 7.1, 10, 10, 13.1, 25.8, 10, 13.1, 10, 25.8, 25.8, 25.8, 25.8,13.1, 13.1, 25.8, 25.8, 25.8),
interval_distribution = factor(c(NA, NA, NA, "uniform", "uniform", "uniform", "uniform", "uniform", "Beta", "Beta", "Beta", "Beta", "Kumaraswamy", "Kumaraswamy", "Kumaraswamy", "Kumaraswamy", "PERT", "PERT", "PERT", "PERT")),
param1 = c(NA, NA, NA, NA, NA, NA, NA, NA, 1, 1, 1, 2, 2, 2.1, 2, 2, 2, 1, 2, 2),
param2 = c(NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 2, 2, 2, 1, 1, 1, NA, NA, NA, NA)
)
knitr::kable(df_options)
id | value | lower_bound | upper_bound | interval_distribution | param1 | param2 |
---|---|---|---|---|---|---|
1 | 2.3 | NA | NA | NA | NA | NA |
2 | 2.3 | NA | NA | NA | NA | NA |
3 | 5.7 | NA | NA | NA | NA | NA |
4 | NA | 2.4 | 7.1 | uniform | NA | NA |
5 | NA | 0.0 | 10.0 | uniform | NA | NA |
6 | NA | 1.0 | 10.0 | uniform | NA | NA |
7 | NA | 0.0 | 13.1 | uniform | NA | NA |
8 | NA | 5.6 | 25.8 | uniform | NA | NA |
9 | NA | 2.4 | 10.0 | Beta | 1.0 | 1 |
10 | NA | 5.3 | 13.1 | Beta | 1.0 | 2 |
11 | NA | 5.3 | 10.0 | Beta | 1.0 | 2 |
12 | NA | 2.4 | 25.8 | Beta | 2.0 | 2 |
13 | NA | 2.4 | 25.8 | Kumaraswamy | 2.0 | 2 |
14 | NA | 2.4 | 25.8 | Kumaraswamy | 2.1 | 1 |
15 | NA | 2.4 | 25.8 | Kumaraswamy | 2.0 | 1 |
16 | NA | 2.4 | 13.1 | Kumaraswamy | 2.0 | 1 |
17 | NA | 2.4 | 13.1 | PERT | 2.0 | NA |
18 | NA | 2.4 | 25.8 | PERT | 1.0 | NA |
19 | NA | 5.6 | 25.8 | PERT | 2.0 | NA |
20 | NA | 2.4 | 25.8 | PERT | 2.0 | NA |
autodb
has no way to know about these relationships,
because it doesn’t treat missing values as a special case, as described
above. The resulting schema therefore ignores them:
db_options <- autodb(df_options)
show(db_options)
However, from looking at the data ourselves, we can see a clearly-implied structure:
As a general hack for finding these sorts of relationships, I like taking each attribute with missing values, and adding a companion attribute that states whether the value is present or missing:
df_options_presence <- df_options[vapply(df_options, anyNA, logical(1))]
df_options_presence[] <- lapply(df_options_presence, Negate(is.na))
names(df_options_presence) <- paste0(names(df_options_presence), "_present")
df_options_with_presence <- cbind(df_options, df_options_presence)
This is not always practical, because adding columns rapidly increases the running time for the dependency search. However, when it’s practical, it helps to make the structure more apparent in the database schema:
db_options_with_presence <- autodb(df_options_with_presence)
show(db_options_with_presence)
Some of the new relations just show that an attribute determines its
associated presence attribute, which is always true. Of more interest is
the value_present
relation, which shows that the values,
bounds, and interval distributions inform the presence of each
other:
knitr::kable(records(db_options_with_presence)$value_present)
value_present | lower_bound_present | upper_bound_present | interval_distribution_present | |
---|---|---|---|---|
1 | TRUE | FALSE | FALSE | FALSE |
4 | FALSE | TRUE | TRUE | TRUE |
In the interval_distribution
relation, we can also see
how the interval distribution determines how many parameters are
required:
knitr::kable(records(db_options_with_presence)$interval_distribution)
interval_distribution | value_present | param1_present | param2_present | |
---|---|---|---|---|
1 | NA | TRUE | FALSE | FALSE |
4 | uniform | FALSE | FALSE | FALSE |
9 | Beta | FALSE | TRUE | TRUE |
13 | Kumaraswamy | FALSE | TRUE | TRUE |
17 | PERT | FALSE | TRUE | FALSE |
This shows addition structure, specifically an existence constraint:
if param2
is present, then param1
is also
present. If we notice a situation like this, we can attempt further
investigation by splitting the data, and passing each part into
autodb
separately. Splitting on whether param1
is present, we get the following database for records where it is
absent:
db_options_with_presence_p1absent <- autodb(subset(
df_options_with_presence,
!param1_present
))
show(db_options_with_presence_p1absent)
This only contains records for point estimates, or intervals with a
uniform distribution, so the distribution value is now equivalent to its
presence/absence, or that of the value and bounds. Due to the existence
constraint mentioned above, param2
is constant, since it’s
always absent:
knitr::kable(records(db_options_with_presence_p1absent)$constants)
param1 | param2 | param1_present | param2_present |
---|---|---|---|
NA | NA | FALSE | FALSE |
We get the following database for records where param1
is present:
show(autodb(subset(
df_options_with_presence,
param1_present
)))
In this case, whether param2
is present is determined by
the distribution used, as we’d expect.
Since this data splitting is currently a manual process, there are practical limits to how far we can take this. In the future, I’d like to have automating it as an option, but this could greatly increase the required run time.
Automatic data normalisation, such as done by autodb
, is
useful, but it isn’t magic. This vignette covers some common issues that
this approach can’t address.
Since the process is agnostic to what kind of data is stored in each
attribute, it can’t find any constraints that depend on the actual data
values. A simple case is that, for example, it won’t state whether an
attribute’s values are within a given range, since that would require
making use of the knowledge that the attribute has a number class, such
as integer
or numeric
.
Another example of the process being agnostic to data values is that it can’t make any inference based on attribute classes. A common example of this is when we have entities of a given type referring to each other.
For example, suppose we have a dataset of publication citations,
where each row contains the citer and citee IDs, plus supplementary
information about both publications. The resulting database given by
autodb
has the following schema, after giving the relations
appropriate names:
show(database_schema(
relation_schema(
list(
citation = list(c("citer_id", "citee_id"), list(c("citer_id", "citee_id"))),
citer = list(c("citer_id", "citer_title", "citer_author", "citer_year"), list("citer_id")),
citee = list(c("citee_id", "citee_title", "citee_author", "citee_year"), list("citee_id"))
),
c(
"citer_id", "citer_title", "citer_author", "citer_year",
"citee_id", "citee_title", "citee_author", "citee_year"
)
),
list(
list("citation", "citer_id", "citer", "citer_id"),
list("citation", "citee_id", "citee", "citee_id")
)
))
Of course, citers and citees are both publications of the same type, so they shouldn’t have separate relations:
show(database_schema(
relation_schema(
list(
citation = list(c("citer", "citee"), list(c("citer", "citee"))),
publication = list(c("id", "title", "author", "year"), list("id"))
),
c("citer", "citee", "id", "title", "author", "year")
),
list(
list("citation", "citer", "publication", "id"),
list("citation", "citee", "publication", "id")
)
))
We make this improvement because citer_id
and
citee_id
values are, semantically, the same class of
object, and the same goes for authors, titles, etc. Semantic types are
not something that can be inferred by just looking at the data
classes.
If we don’t account for this semantic identity, the separate citer and citee information relations in the original schema can both hold information for the same publication. If their information is the same, then there is unwanted redundancy. If their information is different, there is a data incoherency not prevented by the schema. Neither situation is desirable.
Currently, the only way to account for this semantic identity to create or modify the schema manually, like above.
merge.data.frame
Rejoining databases, and checking relations satisfy foreign key
constraints, is done using merge.data.frame
. This means
that data classes that don’t work properly in merge
aren’t
guaranteed to work properly in autodb
.
Any such issues come from how certain data classes are handled during
merges in the base language, so they are issues with R, rather than with
autodb
, and I have no plans to fix them. If
autodb
seems to have odd failures, check that used data
classes behave correctly in merges.
For example, in older versions of R, the built-in POSIXct date/time
class didn’t have values merged properly, because the merge didn’t
account for differences in time zone / daylight saving time. This would
result in, for example, the nycflights13::weather
data set
violating the foreign key constraints of its own discovered schema,
since one foreign key used a POSIXct attribute.
A more complex example, that still applies and probably always will,
is a merge where two attributes being merged on have different classes.
In general, this is allowed: since autodb
is written for R,
a dynamically-typed language, it follows SQLite in not constraining the
user much when it comes to data classes in schemas. For primitive
classes, R’s class coercion usually makes things work as you’d
expect.
However, merging a factor attribute with a non-factor, non-character attribute should be handled with caution if the latter contains values not in the factor’s level set.
For example, we can define these data frames:
df_badmerge_int <- cbind(
expand.grid(
a = c(NA, 0L, 1L),
b = c(NA, FALSE, TRUE)
),
row = 1:9
)
df_badmerge_factor <- df_badmerge_int
df_badmerge_factor$a <- as.factor(df_badmerge_factor$a)
knitr::kable(df_badmerge_int)
a | b | row |
---|---|---|
NA | NA | 1 |
0 | NA | 2 |
1 | NA | 3 |
NA | FALSE | 4 |
0 | FALSE | 5 |
1 | FALSE | 6 |
NA | TRUE | 7 |
0 | TRUE | 8 |
1 | TRUE | 9 |
df_badmerge_logical <- df_badmerge_int
df_badmerge_logical$a <- as.logical(df_badmerge_logical$a)
names(df_badmerge_logical)[[3]] <- "row2"
knitr::kable(df_badmerge_logical)
a | b | row2 |
---|---|---|
NA | NA | 1 |
FALSE | NA | 2 |
TRUE | NA | 3 |
NA | FALSE | 4 |
FALSE | FALSE | 5 |
TRUE | FALSE | 6 |
NA | TRUE | 7 |
FALSE | TRUE | 8 |
TRUE | TRUE | 9 |
We can then merge the data frame with logical a
with the
other two, keeping the row
attributes to track which
records were merged.
Whichever other data frame we merge with, the two sets of
a
values have different classes, so R does coercion. When
merging with just a
, this gives the result we’d expect, for
both other data frames and regardless of merge order. For the integer
version, the logical values are coerced to integers:
knitr::kable(merge(
df_badmerge_int[, c("a", "row")],
df_badmerge_logical[, c("a", "row2")]
))
a | row | row2 |
---|---|---|
0 | 2 | 2 |
0 | 2 | 5 |
0 | 2 | 8 |
0 | 5 | 2 |
0 | 5 | 5 |
0 | 5 | 8 |
0 | 8 | 2 |
0 | 8 | 5 |
0 | 8 | 8 |
1 | 6 | 6 |
1 | 6 | 3 |
1 | 6 | 9 |
1 | 3 | 6 |
1 | 3 | 3 |
1 | 3 | 9 |
1 | 9 | 6 |
1 | 9 | 3 |
1 | 9 | 9 |
NA | 1 | 1 |
NA | 1 | 7 |
NA | 1 | 4 |
NA | 7 | 1 |
NA | 7 | 7 |
NA | 7 | 4 |
NA | 4 | 1 |
NA | 4 | 7 |
NA | 4 | 4 |
knitr::kable(merge(
df_badmerge_logical[, c("a", "row2")],
df_badmerge_int[, c("a", "row")]
))
a | row2 | row |
---|---|---|
FALSE | 2 | 2 |
FALSE | 2 | 5 |
FALSE | 2 | 8 |
FALSE | 5 | 2 |
FALSE | 5 | 5 |
FALSE | 5 | 8 |
FALSE | 8 | 2 |
FALSE | 8 | 5 |
FALSE | 8 | 8 |
TRUE | 6 | 6 |
TRUE | 6 | 3 |
TRUE | 6 | 9 |
TRUE | 3 | 6 |
TRUE | 3 | 3 |
TRUE | 3 | 9 |
TRUE | 9 | 6 |
TRUE | 9 | 3 |
TRUE | 9 | 9 |
NA | 1 | 1 |
NA | 1 | 7 |
NA | 1 | 4 |
NA | 7 | 1 |
NA | 7 | 7 |
NA | 7 | 4 |
NA | 4 | 1 |
NA | 4 | 7 |
NA | 4 | 4 |
For the factor version, the logical values are coerced to factors,
but they don’t match any of the given levels, so they all become
NA
:
knitr::kable(merge(
df_badmerge_factor[, c("a", "row")],
df_badmerge_logical[, c("a", "row2")]
))
a | row | row2 |
---|---|---|
NA | 7 | 7 |
NA | 7 | 4 |
NA | 7 | 1 |
NA | 4 | 7 |
NA | 4 | 4 |
NA | 4 | 1 |
NA | 1 | 7 |
NA | 1 | 4 |
NA | 1 | 1 |
knitr::kable(merge(
df_badmerge_logical[, c("a", "row2")],
df_badmerge_factor[, c("a", "row")]
))
a | row2 | row |
---|---|---|
NA | 7 | 7 |
NA | 7 | 4 |
NA | 7 | 1 |
NA | 4 | 7 |
NA | 4 | 4 |
NA | 4 | 1 |
NA | 1 | 7 |
NA | 1 | 4 |
NA | 1 | 1 |
However, we see unexpected behaviour with the factor version, when
also merging on another attribute, b
: the merge result now
depends on the input order. With the factor version first, the result is
similar to before:
knitr::kable(merge(
df_badmerge_factor,
df_badmerge_logical
))
#> Warning in `[<-.factor`(`*tmp*`, ri, value = c(NA, FALSE, TRUE, NA, FALSE, :
#> invalid factor level, NA generated
a | b | row | row2 |
---|---|---|---|
NA | FALSE | 4 | 4 |
NA | FALSE | 4 | 5 |
NA | FALSE | 4 | 6 |
NA | NA | 1 | 1 |
NA | NA | 1 | 2 |
NA | NA | 1 | 3 |
NA | TRUE | 7 | 7 |
NA | TRUE | 7 | 8 |
NA | TRUE | 7 | 9 |
With the logical version first, however, only the logical
a
values that are NA
before coercion are kept,
rather than all of them:
knitr::kable(merge(
df_badmerge_logical,
df_badmerge_factor
))
a | b | row2 | row |
---|---|---|---|
NA | FALSE | 4 | 4 |
NA | NA | 1 | 1 |
NA | TRUE | 7 | 7 |
In autodb
, this could be a problem if two relations in a
foreign key constraint contain the same attribute, but one of them has
it store factors in records, due to separate insertions. Letting an
attribute’s class vary across relations should, therefore, be done with
caution.
Bernstein’s synthesis is guaranteed to minimise the number of relations created for a given set of functional dependencies, and removing avoidable attributes can reduce the number of attributes in those relations. However, there can still be redundant keys. For example, we can take the following set of functional dependencies:
fds_redkey <- functional_dependency(
list(
list("a", "b"),
list("d", "c"),
list(c("b", "d"), "a"),
list("a", "c"),
list(c("b", "c"), "d")
),
letters[1:4]
)
fds_redkey
#> 5 functional dependencies
#> 4 attributes: a, b, c, d
#> a -> b
#> d -> c
#> b, d -> a
#> a -> c
#> b, c -> d
Normalising gives the following relations:
schema_redkey <- normalise(fds_redkey, remove_avoidable = TRUE)
show(schema_redkey)
These relations have some redundancy: relation a
implies
{b, d} -> c
, but relation d
implies that
{d} -> c
. This isn’t resolved by removing avoidable
attributes, because d
still needs to be in relation
a
: we just need to remove {b, d}
as a key.
However, this is resolved if we instead use this set of functional
dependencies, which is equivalent to the previous set:
fds_redkey_fix <- functional_dependency(
list(
list("a", "b"),
list("d", "c"),
list(c("b", "c"), "a"),
list("a", "d")
),
letters[1:4]
)
fds_redkey_fix
#> 4 functional dependencies
#> 4 attributes: a, b, c, d
#> a -> b
#> d -> c
#> b, c -> a
#> a -> d
schema_redkey_fix <- normalise(fds_redkey_fix, remove_avoidable = TRUE)
show(schema_redkey_fix)
Unfortunately, there’s no way in the package to find better sets like this.
Normal forms only refer to one relation at a time: referring to a database as being in a given normal form just means that each if its relations are in it individually. This doesn’t address the addition of foreign key references, but, even if the references are sensible, schemas that are in higher normal forms may not be ones that we’d want to use.
For example, take this database schema, whose relation schemas are in third normal form:
dup_db <- autodb(ChickWeight)
show(dup_db)
If we now create copies of these relations, with the intention that copies always contain the same data, then all relations are still in third normal form, and so we’d say this database is also in third normal form:
show(dup_db[c(1, 1, 2, 2, 2)])
However, no one would claim that this is a good database design,
since there is clearly a large amount of data redundancy. Higher normal
forms would not change this. In fact, since the copies do not reference
each other, it is also trivial to insert data to make this database
incoherent. For example, Chick
and Chick.1
could contain different diet assignments.