The Researcher Workbench, available to registered All of Us researchers, provides tools for creating datasets using the All of Us database. Researchers first use the Workbench to create a cohort defined by values of survey variables (e.g., gender, race, age), observations from the electronic health record (e.g., diagnosis codes, lab values), data from wearable devices (e.g., fitbit), genomic data, and/or physical measurements. Data – either predefined or user-specified concept sets – from each of the database tables can then be extracted for the cohort.
The allofus
package takes a programmatic approach to
extracting data from the All of Us database. One advantage to this
approach is that it is easier to document and share the process. This is
especially important for reproducibility and transparency. To
demonstrate, this vignette will walk through some examples of extracting
data from the All of Us database.
Suppose we want a cohort of the female All of Us participants who weren’t born in the US. For this cohort, we want to extract data on their history of diabetes, their A1C lab values, and their use of metformin after joining All of Us.
The first step is extracting gender and country of birth from the
survey data. We can retrieve these variables either by concept code or
concept id. Using the allofus
searchable
codebook, we find that the concept id for gender is 1585838 and for
birthplace is 1586135. We can use the aou_survey
function
to extract these variables from the All of Us database. This function
takes a vector of concept ids or codes, and returns a dataset with one
row per participant and one column per variable. The
question_output
argument allows you to specify the names of
the column in the output dataset. Alternatively, you can specify
“concept_id” or “concept_code” to name the variables according to one of
those options.
library(allofus)
library(tidyverse)
svy_vars <- aou_survey(questions = c(1585838, 1586135), question_output = c("gender", "birthplace"))
#> Warning: No cohort provided.
#> → Pulling survey data for entire All of Us cohort.
Note that the aou_survey
function takes a cohort as an
argument, with the only requirement that it contains a column called
person_id
. If no cohort is provided (e.g., if you are using
survey data to define your cohort), the function will pull data for the
entire All of Us cohort.
Because aou_survey()
defaults to
collect = FALSE
, the query isn’t fully executed yet, and we
can continue to perform analyses, like cross-tabulating the two
variables, on the database.
#> # Source: SQL [?? x 3]
#> # Database: BigQueryConnection
#> gender birthplace n
#> <chr> <chr> <int64>
#> 1 Skip Skip 4449
#> 2 NA NA 97
#> 3 Woman USA 205440
#> 4 Woman Other 40348
#> 5 Man Skip 1484
#> 6 Transgender USA 504
#> 7 PreferNotToAnswer USA 475
#> 8 PreferNotToAnswer Other 112
#> 9 Man USA 131698
#> 10 Man Other 21059
#> # ℹ more rows
To create our cohort, we can subset the data to the participants matching our criteria.
#> # Source: SQL [?? x 5]
#> # Database: BigQueryConnection
#> person_id birthplace gender birthplace_date gender_date
#> <int64> <chr> <chr> <date> <date>
#> 1 xxxxxxx Other Woman 2019-08-xx 2019-08-xx
#> 2 xxxxxxx Other Woman 2018-06-xx 2018-06-xx
#> 3 xxxxxxx Other Woman 2019-05-xx 2019-05-xx
#> 4 xxxxxxx Other Woman 2019-02-xx 2019-02-xx
#> 5 xxxxxxx Other Woman 2018-06-xx 2018-06-xx
#> 6 xxxxxxx Other Woman 2019-03-xx 2019-03-xx
#> 7 xxxxxxx Other Woman 2019-04-xx 2019-04-xx
#> 8 xxxxxxx Other Woman 2019-04-xx 2019-04-xx
#> 9 xxxxxxx Other Woman 2019-05-xx 2019-05-xx
#> 10 xxxxxxx Other Woman 2020-09-xx 2020-09-xx
#> # ℹ more rows
Note that the aou_survey()
function automatically
creates an additional column for every survey variable extracted, which
contains the date the participant answered the question. These are
automatically named by adding “_date” as a suffix to the column
names.
Now that we have our cohort, we want to create a dataset with data from these participants’ medical records. We first want to find the concept ids of the data we want to extract. There are a number of tools we could use to do so:
Athena: a web-based tool for searching the entire OMOP vocabulary. The concept id for Type 2 diabetes mellitus can be found under ID. There are many results for this search, but we want only standard concepts when searching for EHR data.
Data Browser: a tool for exploring data in the All of Us database. Finding the concept id for Type 2 diabetes mellitus requires searching for that concept and then clicking “Sources”:
Researcher Workbench: Creating a concept set can allow you to find a number of concept ids quickly. Add concepts to your concept set, and then review the concept set to see them all.
Here are some concepts we may want to include. Keep in mind that this is an example, and a real analysis would have a more thorough search for relevant concepts.
Concept ID | Concept Name | Domain |
---|---|---|
201826 | Type 2 diabetes mellitus | Condition |
4193704 | Type 2 diabetes mellitus without complication | Condition |
40164929 | metformin hydrochloride 500 MG Oral Tablet | Drug |
40164897 | metformin hydrochloride 1000 MG Oral Tablet | Drug |
3004410 | Hemoglobin A1c/Hemoglobin.total in Blood | Measurement |
3005673 | Hemoglobin A1c/Hemoglobin.total in Blood by HPLC | Measurement |
First, we’ll extract data on diabetes diagnoses. We can use the
aou_concept_set
function to extract data from the All of Us
database. This function takes a cohort, a vector of concepts, and a
domain or set of domains. You can actually extract data from multiple
domains at once – the default is to search all domains. However, it is
helpful to specify the domain(s) when you can, for speed and clarity.
The output =
argument specifies the type of data to
extract. Here, we’ll extract an indicator variable for whether the
participant has a diabetes diagnosis at any point in their medical
record. We’ll call that variable “t2dm”.
t2dm <- aou_concept_set(cohort,
concepts = c(201826, 4193704),
domains = "condition", output = "indicator",
concept_set_name = "t2dm"
)
We can also extract data within a certain time window. For example, if we only want to extract metformin use after joining All of Us, we can specify a start date as the date the participant answered the gender question (on the Basics survey). We’ll count how many times the participant was prescribed metformin after that time point.
metformin <- aou_concept_set(cohort,
concepts = c(40164929, 40164897),
domains = "drug", output = "count",
start_date = "gender_date", concept_set_name = "metformin"
)
For a concept set in the measurement domain, we likely want to set
output = "all"
because we are interested in the values of
the measurements, not whether they had any or how many they had.
a1c <- aou_concept_set(cohort,
concepts = c(3004410, 3005673),
domains = "measurement", output = "all", start_date = "gender_date"
)
The output to this is a dataset with one row per measurement. We can see that the second participant in the output has had 4 A1C measurements since joining All of Us, while the fifth had 1 and the rest had none.
#> # Source: SQL [?? x 9]
#> # Database: BigQueryConnection
#> person_id concept_date concept_id concept_name concept_domain value_as_number
#> <int64> <date> <int64> <chr> <chr> <dbl>
#> 1 xxxxxx1 NA NA NA NA NA
#> 2 xxxxxx2 2022-03-xx 3004410 Hemoglobin … Measurement 5.6
#> 3 xxxxxx2 2021-07-xx 3004410 Hemoglobin … Measurement 5.6
#> 4 xxxxxx2 2021-05-xx 3004410 Hemoglobin … Measurement 5.5
#> 5 xxxxxx2 2020-10-xx 3004410 Hemoglobin … Measurement 6
#> 6 xxxxxx3 NA NA NA NA NA
#> 7 xxxxxx4 NA NA NA NA NA
#> 8 xxxxxx5 2019-04-xx 3005673 Hemoglobin … Measurement 5.9
#> 9 xxxxxx6 NA NA NA NA NA
#> 10 xxxxxx7 NA NA NA NA NA
#> # ℹ more rows
#> # ℹ 3 more variables: value_as_concept_id <int64>, value_source_value <chr>,
#> # gender_date <date>
Suppose we want to compare the EHR data to self-reported survey data
on diabetes diagnoses. Returning again to the aou_survey()
function, we need to first find the concept id for the question asking
about diabetes diagnoses. Wuestions from the health history surveys are
more complex than others like gender and birthplace: there are branching
questions and complex skip patterns for each question, which were also
asked about family members, and groups of diagnoses were asked about
together. The searchable health history codebook is a useful tool for
finding the concept id for these questions. We can search for “diabetes”
and find the concept id for the question asking about type 2 diabetes
diagnoses for the respondent: 43529932. We can also use the Data Browser to find
the same concept id:
Data from the health history surveys is returned as “Yes”, “No”, or “Skip”. An observation will have a value of “No” if the participant answered that someone in their family had the condition, but did not self-identify as having the condition, or if they responded that no one in their family had the condition. If they skipped either part of the question, the value will be “Skip”.
Next, we want to combine these datasets into one. We can do this
using the aou_join
function. This function takes a list of
datasets and joins them together using the participant id. The
aou_join
function also allows you to specify the type of
join (inner, left, right, or full). Here we’ll perform repeated left
joins into the cohort using the reduce()
function from the
purrr
package.
Our combined dataset has columns for the diabetes diagnosis indicator
from the EHR data, the number of metformin prescriptions, the
self-reported diabetes diagnosis from the health history survey and the
date that survey was completed, as well as complete data on the A1C
measurements. Participants with multiple measurements therefore have
multiple rows in the dataset. We may want to do some additional analyses
before bringing the table off the database and into R. For example, we
may want to summarize the A1C measurements as the highest value for each
participant. We can continue to manipulate the data using
dplyr
functions (see
vignette("sql", package = "allofus")
).
final_data <- combined_data %>%
group_by(person_id, birthplace, t2dm, metformin, t2dm_survey) %>%
summarize(max_a1c = max(value_as_number, na.rm = TRUE), .groups = "drop")
This final query may take a few seconds to run – it is the
accumulation of all the queries we’ve run so far (use
show_query()
to see the SQL query that will be run). For
that reason, it is often helpful to print the result of each individual
query to make sure it is what you expect before moving on to the next
step.
#> # Source: SQL [?? x 5]
#> # Database: BigQueryConnection
#> person_id t2dm metformin t2dm_survey max_a1c
#> <int64> <dbl> <dbl> <chr> <dbl>
#> 1 xxxxxxx 0 0 NA NA
#> 2 xxxxxxx 0 0 DontKnow NA
#> 3 xxxxxxx 0 0 NA NA
#> 4 xxxxxxx 0 0 No NA
#> 5 xxxxxxx 0 0 NA NA
#> 6 xxxxxxx 0 0 NA NA
#> 7 xxxxxxx 0 0 NA NA
#> 8 xxxxxxx 0 0 No NA
#> 9 xxxxxxx 0 0 NA NA
#> 10 xxxxxxx 0 0 No 5.4
#> # ℹ more rows