Available diseasystores
To see the available diseasystores
on your system, you
can use the available_diseasystores()
function.
available_diseasystores()
#> [1] "DiseasystoreEcdcRespiratoryViruses" "DiseasystoreGoogleCovid19"
This function looks for diseasystores
on the current
search path. By default, this will show the diseasystores
bundled with the base package. If you have extended
diseasystore
with either your own
diseasystores
or from an external package, then attaching
the package to your search path will allow it to show up as
available.
Note: diseasystores
are found if they are defined within
packages named diseasystore*
and are of the class
DiseasystoreBase
.
Each of these diseasystores
may have their own vignette
that further details their content, use and/or tips and tricks. This is
for example the case with DiseasystoreGoogleCovid19
.
Using a diseasystore
To use a diseasystore
we need to first do some
configuration. The diseasystores
are designed to work with
data bases to store the computed features in. Each
diseasystore
may require individual configuration as listed
in its documentation or accompanying vignette.
For this Quick start, we will configure a
DiseasystoreGoogleCovid19
to use a local
SQLite
data base Ideally, we want to use a faster, more
capable, data base to store the features in. The
diseasystores
uses SCDB
in the back end and
can use any data base back end supported by SCDB
.
ds <- DiseasystoreGoogleCovid19$new(
target_conn = DBI::dbConnect(RSQLite::SQLite()),
start_date = as.Date("2020-03-01"),
end_date = as.Date("2020-03-15")
)
When we create our new diseasystore
instance, we also
supply start_date
and end_date
arguments.
These are not strictly required, but make getting features for this time
interval simpler.
Once configured we can query the available features in the
diseasystore
ds$available_features
#> [1] "n_population" "age_group" "country_id" "country"
#> [5] "region_id" "region" "subregion_id" "subregion"
#> [9] "n_hospital" "n_deaths" "n_positive" "n_icu"
#> [13] "n_ventilator" "min_temperature" "max_temperature"
These features can be retrieved individually (using the
start_date
and end_date
we specified during
creation of ds
):
ds$get_feature("n_hospital")
#> # Source: table<`dbplyr_EZ1U7X59H5`> [?? x 5]
#> # Database: sqlite 3.46.0 []
#> key_location key_age_bin n_hospital valid_from valid_until
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 AR 0 NA 18322 18323
#> 2 AR 0 NA 18330 18331
#> 3 AR 0 0 18324 18325
#> 4 AR 0 1 18323 18324
#> 5 AR 0 1 18327 18328
#> # ℹ more rows
Notice that features have associated “key_*” and “valid_from/until”
columns. These are used for one of the primary selling points of
diseasystore
, namely automatic aggregation.
Go get features for other time intervals, we can manually supply
start_date
and/or end_date
:
ds$get_feature("n_hospital",
start_date = as.Date("2020-03-01"),
end_date = as.Date("2020-03-02"))
#> # Source: table<`dbplyr_oBF7QxMz0h`> [?? x 5]
#> # Database: sqlite 3.46.0 []
#> key_location key_age_bin n_hospital valid_from valid_until
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 AR 0 NA 18322 18323
#> 2 AR 0 1 18323 18324
#> 3 AR 1 0 18323 18324
#> 4 AR 1 1 18322 18323
#> 5 AR 2 1 18322 18323
#> # ℹ more rows
Dynamically expanded
The diseasystore
automatically expands the computed
features.
Say a given “n_hospital” has been computed between 2020-03-01 and
2020-03-15. In this case, the call
$get_feature("n_hospital", start_date = as.Date("2020-03-01"), end_date = as.Date("2020-03-20")
only needs to compute the feature between 2020-03-16 and 2020-03-20.
Time versioned
Through using SCDB as the back end, the features are
stored even as new data becomes available. This way, we get a
time-versioned record of the features provided by
diseasystore
.
The features being computed is controlled through the
slice_ts
argument. By default, diseasystores
uses today’s date for this argument.
The dynamical expansion of the features described above is only valid
for any given slice_ts
. That is, if a feature has been
computed for a time interval on one slice_ts
,
diseasystore
will recompute the feature for any other
slice_ts
.
This way, feature computation can be implemented into continuous integration (requesting features will preserve a history of computed features). Furthermore, post-hoc analysis can be performed by computing features as they would have looked on previous dates.
Automatic aggregation
The real strength of diseasystore
comes from its
built-in automatic aggregation.
We saw above that the features come with additional associated “key_*” and “valid_from/until” columns.
This additional information is used to do automatic aggregation
through the $key_join_features()
method (see extending-diseasystore for more
details).
To use this method, you need to provide the observable
that you want to aggregate and the stratification
you want
to apply to the aggregation.
Lets start with an simple example where we request no stratification
(NULL
):
ds$key_join_features(observable = "n_hospital",
stratification = NULL)
#> # A tibble: 15 × 2
#> date n_hospital
#> <date> <dbl>
#> 1 2020-03-01 3
#> 2 2020-03-02 6
#> 3 2020-03-03 5
#> 4 2020-03-04 12
#> 5 2020-03-05 8
#> # ℹ 10 more rows
This gives us the same feature information as
ds$get_feature("n_hospital")
but simplified to give the
observable per day (in this case, the number of people
hospitalised).
To specify a level of stratification
, we need to supply
a list of quosures
(see
help("topic-quosure", package = "rlang")
).
ds$key_join_features(observable = "n_hospital",
stratification = rlang::quos(country_id))
#> # A tibble: 15 × 3
#> date country_id n_hospital
#> <date> <chr> <dbl>
#> 1 2020-03-01 AR 3
#> 2 2020-03-02 AR 6
#> 3 2020-03-03 AR 5
#> 4 2020-03-04 AR 12
#> 5 2020-03-05 AR 8
#> # ℹ 10 more rows
The stratification
argument is very flexible, so we can
supply any valid R expression:
ds$key_join_features(observable = "n_hospital",
stratification = rlang::quos(country_id,
old = age_group == "90+"))
#> # A tibble: 30 × 4
#> date country_id old n_hospital
#> <date> <chr> <int> <dbl>
#> 1 2020-03-01 AR 0 27
#> 2 2020-03-02 AR 0 54
#> 3 2020-03-03 AR 0 45
#> 4 2020-03-04 AR 0 108
#> 5 2020-03-05 AR 0 72
#> # ℹ 25 more rows
Dropping computed features
Sometimes, it is need to clear the compute features from the data
base. For this purpose, we provide the drop_diseasystore()
function.
By default, this deletes all stored features in the default
diseasystore
schema. A pattern
argument to
match tables by and a schema
argument to specify the schema
to delete from1.
SCDB::get_tables(ds$target_conn)
#> schema table
#> 1 main ds.google_covid_19_age_group
#> 2 main ds.google_covid_19_index
#> 3 main ds.google_covid_19_hospital
#> 4 main ds.locks
#> 5 main ds.logs
#> 6 temp ds_validities_11946
#> 7 temp SCDB_11946_020
#> 8 temp SCDB_11946_024
#> 9 temp dbplyr_kgPY1bPB7z
#> 10 temp ds_study_dates_11946
#> 11 temp dbplyr_VdisgpRrIA
#> 12 temp dbplyr_5bG7gFZ2RI
#> 13 temp ds_google_covid_19_age_group_11946
#> 14 temp ds_all_combinations_11946
#> 15 temp dbplyr_hByrdeDhSC
#> 16 temp dbplyr_T1j3HbFlCu
#> 17 temp dbplyr_bdClHujZ6V
#> 18 temp SCDB_11946_012
#> 19 temp dbplyr_2zDoz7gvmR
#> 20 temp SCDB_11946_009
#> 21 temp SCDB_11946_017
#> 22 temp dbplyr_EZ1U7X59H5
#> 23 temp dbplyr_PzDsCtpAUT
#> 24 temp dbplyr_oFQJxb20Bt
#> 25 temp SCDB_11946_016
#> 26 temp dbplyr_oBF7QxMz0h
#> 27 temp dbplyr_MWTk61j7Q4
#> 28 temp SCDB_11946_001
#> 29 temp SCDB_11946_008
#> 30 temp ds_google_covid_19_hospital_11946
#> 31 temp ds_google_covid_19_index_11946
#> 32 temp dbplyr_uVs7TnBty9
#> 33 temp SCDB_11946_004
#> 34 temp SCDB_11946_025
drop_diseasystore(conn = ds$target_conn)
SCDB::get_tables(ds$target_conn)
#> schema table
#> 1 temp ds_validities_11946
#> 2 temp SCDB_11946_020
#> 3 temp SCDB_11946_024
#> 4 temp dbplyr_kgPY1bPB7z
#> 5 temp ds_study_dates_11946
#> 6 temp dbplyr_VdisgpRrIA
#> 7 temp dbplyr_5bG7gFZ2RI
#> 8 temp ds_google_covid_19_age_group_11946
#> 9 temp ds_all_combinations_11946
#> 10 temp dbplyr_hByrdeDhSC
#> 11 temp dbplyr_T1j3HbFlCu
#> 12 temp dbplyr_bdClHujZ6V
#> 13 temp SCDB_11946_012
#> 14 temp dbplyr_2zDoz7gvmR
#> 15 temp SCDB_11946_009
#> 16 temp SCDB_11946_017
#> 17 temp dbplyr_EZ1U7X59H5
#> 18 temp dbplyr_PzDsCtpAUT
#> 19 temp dbplyr_oFQJxb20Bt
#> 20 temp SCDB_11946_016
#> 21 temp dbplyr_oBF7QxMz0h
#> 22 temp dbplyr_MWTk61j7Q4
#> 23 temp SCDB_11946_001
#> 24 temp SCDB_11946_008
#> 25 temp ds_google_covid_19_hospital_11946
#> 26 temp ds_google_covid_19_index_11946
#> 27 temp dbplyr_uVs7TnBty9
#> 28 temp SCDB_11946_004
#> 29 temp SCDB_11946_025
diseasystore options
diseasystores
have a number of options available to make
configuration easier. These options all start with “diseasystore.”.
options()[purrr::keep(names(options()), ~ startsWith(., "diseasystore"))]
#> $diseasystore.DiseasystoreEcdcRespiratoryViruses.pull
#> [1] TRUE
#>
#> $diseasystore.DiseasystoreEcdcRespiratoryViruses.remote_conn
#> [1] "https://api.github.com/repos/EU-ECDC/Respiratory_viruses_weekly_data"
#>
#> $diseasystore.DiseasystoreEcdcRespiratoryViruses.source_conn
#> [1] "https://api.github.com/repos/EU-ECDC/Respiratory_viruses_weekly_data"
#>
#> $diseasystore.DiseasystoreEcdcRespiratoryViruses.target_conn
#> [1] ""
#>
#> $diseasystore.DiseasystoreEcdcRespiratoryViruses.target_schema
#> [1] ""
#>
#> $diseasystore.DiseasystoreGoogleCovid19.n_max
#> [1] 1000
#>
#> $diseasystore.DiseasystoreGoogleCovid19.remote_conn
#> [1] "https://storage.googleapis.com/covid19-open-data/v3/"
#>
#> $diseasystore.DiseasystoreGoogleCovid19.source_conn
#> [1] "https://storage.googleapis.com/covid19-open-data/v3/"
#>
#> $diseasystore.DiseasystoreGoogleCovid19.target_conn
#> [1] ""
#>
#> $diseasystore.DiseasystoreGoogleCovid19.target_schema
#> [1] ""
#>
#> $diseasystore.lock_wait_increment
#> [1] 15
#>
#> $diseasystore.lock_wait_max
#> [1] 1800
#>
#> $diseasystore.source_conn
#> [1] ""
#>
#> $diseasystore.target_conn
#> [1] ""
#>
#> $diseasystore.target_schema
#> [1] "ds"
#>
#> $diseasystore.verbose
#> [1] FALSE
Notice that several options are set as empty strings (““). These are
treated as NULL
by diseasystore
2.
Importantly, the options are scoped. Consider the above
options for “source_conn”: Looking at the list of options we find
“diseasystore.source_conn” and
“diseasystore.DiseasystoreGoogleCovid19.source_conn”. The former is a
general setting while the latter is specific setting for
DiseasystoreGoogleCovid19
. The general setting is used as
fallback if no specific setting is found.
This allows you to set a general configuration to use and to overwrite it for specific cases.
To get the option related to a scope, we can use the
diseasyoption()
function.
diseasyoption("source_conn", class = "DiseasystoreGoogleCovid19")
#> [1] "https://storage.googleapis.com/covid19-open-data/v3/"
As we saw in the options, a source_conn
option was
defined specifically for DiseasystoreGoogleCovid19
.
If we try the same for the hypothetical
DiseasystoreDiseaseY
, we see that no value is defined as we
have not yet configured the fallback value.
diseasyoption("source_conn", class = "DiseasystoreDiseaseY")
#> NULL
If we change our general setting for source_conn
and
retry, we see that we get the fallback value.
options("diseasystore.source_conn" = file.path("local", "path"))
diseasyoption("source_conn", class = "DiseasystoreDiseaseY")
#> [1] "local/path"
Finally, we can use the .default
argument as a final
fallback value in case no option is set for either general or specific
case.
diseasyoption("non_existent", class = "DiseasystoreDiseaseY", .default = "final fallback")
#> [1] "final fallback"