
Transfer REDCap data to a database and use in R without exceeding available memory. Compatible with all databases but specifically optimized for DuckDB—a fast and portable SQL engine with first-class integration in R/Posit products.
R objects live entirely in memory, causing three problems if not using a specialized framework:
redquack’s solution to this problem is to:
From CRAN:
# install.packages("pak")
pak::pak("redquack")Development version:
pak::pak("dylanpieper/redquack")These packages are used in the examples and are not imported by redquack:
pak::pak(c("dplyr", "duckdb", "keyring"))Your REDCap API token allows R to interface with REDCap and should be stored securely. I recommend using the keyring package to store your API token. For example:
keyring::key_set("redcap_token")Data from REDCap is transferred to a database via a DBI connection in chunks of record IDs:
library(redquack)
duckdb <- DBI::dbConnect(duckdb::duckdb(), "redcap.duckdb")
result <- redcap_to_db(
conn = duckdb,
redcap_uri = "https://redcap.example.org/api/",
token = keyring::key_get("redcap_token"),
record_id_name = "record_id",
chunk_size = 1000
# Increase chunk size for memory-efficient systems (faster)
# Decrease chunk size for memory-constrained systems (slower)
)The function returns a list with class
redcap_transfer_result:
success: Logical if the transfer was completed with no
failed processingerror_chunks: Vector of chunk numbers that failed
processingtime_s: Numeric value for total seconds to transfer and
optimize dataThe database created by redcap_to_db() contains two
tables:
data: Contains all exported REDCap records with
optimized column types
data <- DBI::dbGetQuery(duckdb, "SELECT * FROM data LIMIT 1000")log: Contains timestamped logs of the transfer
process for troubleshooting
log <- DBI::dbGetQuery(duckdb, "SELECT * FROM log")Data is imported as VARCHAR/TEXT for consistent handling across chunks.
For DuckDB, data types are automatically optimized after transfer to improve query performance:
In DuckDB, you can query the data to inspect the data types:
DBI::dbGetQuery(duckdb, "PRAGMA table_info(data)")You can also automatically convert data types in R using readr:
readr::type_convert(data)To optimize query performance with other databases, you must alter the data table manually.
Query and collect the data with dplyr:
library(dplyr)
demographics <- tbl(duckdb, "data") |>
filter(is.na(redcap_repeat_instrument)) |>
select(record_id, age, race, sex, gender) |>
collect()If you collect() your data into memory in the last step,
it can make a slow process nearly instantaneous. The following example
data is 2,825,092 rows x 397 columns:
system.time(
records <- duckdb |>
tbl("data") |>
collect() |>
group_by(redcap_repeat_instrument) |>
summarize(count = n()) |>
arrange(desc(count))
)
#> user system elapsed
#> 5.048 5.006 6.077
system.time(
records <- duckdb |>
tbl("data") |>
group_by(redcap_repeat_instrument) |>
summarize(count = n()) |>
arrange(desc(count)) |>
collect()
)
#> user system elapsed
#> 0.040 0.015 0.040You can also write a Parquet file directly from DuckDB and use arrow. A Parquet file will be about 5 times smaller than a DuckDB file:
DBI::dbExecute(duckdb, "COPY (SELECT * FROM data) TO 'redcap.parquet' (FORMAT PARQUET)")Remember to close the connection when finished:
DBI::dbDisconnect(duckdb)While this package is only optimized for DuckDB, I invite
collaborators to help optimize it for other databases. The pathway I
suggest right now is to target your edits in
R/optimize_data_types.R. Feel free to submit a PR and share
any other ideas you may have.