Transfer REDCap data to a database and use in R without exceeding available memory. Compatible with all databases but specifically optimized for DuckDB—a fast and portable SQL engine with first-class integration in R/Posit products.
R objects live entirely in memory, causing three problems if not using a specialized framework:
redquack’s solution to this problem is to:
From CRAN:
# install.packages("pak")
::pak("redquack") pak
Development version:
::pak("dylanpieper/redquack") pak
These packages are used in the examples and are not imported by redquack:
::pak(c("dplyr", "duckdb", "keyring")) pak
Your REDCap API token allows R to interface with REDCap and should be stored securely. I recommend using the keyring package to store your API token. For example:
::key_set("redcap_token") keyring
Data from REDCap is transferred to a database via a DBI connection in chunks of record IDs:
library(redquack)
<- DBI::dbConnect(duckdb::duckdb(), "redcap.duckdb")
duckdb
<- redcap_to_db(
result conn = duckdb,
redcap_uri = "https://redcap.example.org/api/",
token = keyring::key_get("redcap_token"),
record_id_name = "record_id",
chunk_size = 1000
# Increase chunk size for memory-efficient systems (faster)
# Decrease chunk size for memory-constrained systems (slower)
)
The function returns a list with class
redcap_transfer_result
:
success
: Logical if the transfer was completed with no
failed processingerror_chunks
: Vector of chunk numbers that failed
processingtime_s
: Numeric value for total seconds to transfer and
optimize dataThe database created by redcap_to_db()
contains two
tables:
data
: Contains all exported REDCap records with
optimized column types
<- DBI::dbGetQuery(duckdb, "SELECT * FROM data LIMIT 1000") data
log
: Contains timestamped logs of the transfer
process for troubleshooting
<- DBI::dbGetQuery(duckdb, "SELECT * FROM log") log
Data is imported as VARCHAR/TEXT for consistent handling across chunks.
For DuckDB, data types are automatically optimized after transfer to improve query performance:
In DuckDB, you can query the data to inspect the data types:
::dbGetQuery(duckdb, "PRAGMA table_info(data)") DBI
You can also automatically convert data types in R using readr:
::type_convert(data) readr
To optimize query performance with other databases, you must alter the data table manually.
Query and collect the data with dplyr:
library(dplyr)
<- tbl(duckdb, "data") |>
demographics filter(is.na(redcap_repeat_instrument)) |>
select(record_id, age, race, sex, gender) |>
collect()
If you collect()
your data into memory in the last step,
it can make a slow process nearly instantaneous. The following example
data is 2,825,092 rows x 397 columns:
system.time(
<- duckdb |>
records tbl("data") |>
collect() |>
group_by(redcap_repeat_instrument) |>
summarize(count = n()) |>
arrange(desc(count))
)#> user system elapsed
#> 5.048 5.006 6.077
system.time(
<- duckdb |>
records tbl("data") |>
group_by(redcap_repeat_instrument) |>
summarize(count = n()) |>
arrange(desc(count)) |>
collect()
)#> user system elapsed
#> 0.040 0.015 0.040
You can also write a Parquet file directly from DuckDB and use arrow. A Parquet file will be about 5 times smaller than a DuckDB file:
::dbExecute(duckdb, "COPY (SELECT * FROM data) TO 'redcap.parquet' (FORMAT PARQUET)") DBI
Remember to close the connection when finished:
::dbDisconnect(duckdb) DBI
While this package is only optimized for DuckDB, I invite
collaborators to help optimize it for other databases. The pathway I
suggest right now is to target your edits in
R/optimize_data_types.R
. Feel free to submit a PR and share
any other ideas you may have.