library(SCDB)
#>
#> Attaching package: 'SCDB'
#> The following object is masked from 'package:base':
#>
#> nrow
The basic principle of the SCDB package is to enable the user to easily implement and maintain a database of time-versioned data.
In practice, this is done by labeling each record in the data with three additional fields:
- checksum: An md5 hash of information stored the record
- from_ts: the timestamp at which the record was introduced to the dataset
- until_ts: the timestamp at which the record was removed from the dataset
This strategy of time versioning is often called “type 2” history (Kimball and Ross 2013).
Note that identical records may be removed and introduced more than once; for example, in a table of names and addresses, a person may change their address (or name) back to a previous value.
The SCDB package provides the function update_snapshot
to handle the insertion and deactivation of records using this strategy.
It further includes several functions to improve the Quality of life for
working with database data.
A simple example of usage is shown below.
For this example, we
use a temporary, on-disk SQLite database. Note that
get_connection
tries to establish connection using
DBI::dbConnect
with as few additional arguments as
possible. Different drivers may require authentication which can be read
from a configuration file.1
Our example data is datasets::mtcars
reduced to only two
columns: row names converted to a column car
, and
hp
if (!exists("conn")) conn <- get_connection()
example_data <- dplyr::tbl(conn, DBI::Id(table = "example_data"))
example_data
#> # Source: table<`example_data`> [?? x 2]
#> # Database: sqlite 3.46.0 [:memory:]
#> car hp
#> <chr> <dbl>
#> 1 Mazda RX4 110
#> 2 Mazda RX4 Wag 110
#> 3 Datsun 710 93
#> 4 Hornet 4 Drive 110
#> 5 Hornet Sportabout 175
#> 6 Valiant 105
#> 7 Duster 360 245
#> 8 Merc 240D 62
#> 9 Merc 230 95
#> 10 Merc 280 123
#> # ℹ more rows
Imagine on Day 1, in this case January 1st, 2020, our currently
available data is the first three records of the example_data. We then
store this data in a table mtcars
:
data <- head(example_data, 3)
update_snapshot(.data = data,
conn = conn,
db_table = "mtcars", # the name of the DB table to store the data in
timestamp = as.POSIXct("2020-01-01 11:00:00"))
#> NULL
We can then access out data using the get_table
function, and include information on data validity period using
include_slice_info = TRUE
:
get_table(conn, "mtcars")
#> # Source: SQL [3 x 2]
#> # Database: sqlite 3.46.0 [:memory:]
#> car hp
#> <chr> <dbl>
#> 1 Datsun 710 93
#> 2 Mazda RX4 110
#> 3 Mazda RX4 Wag 110
get_table(conn, "mtcars", include_slice_info = TRUE)
#> # Source: SQL [3 x 5]
#> # Database: sqlite 3.46.0 [:memory:]
#> car hp checksum from_ts until_ts
#> <chr> <dbl> <chr> <chr> <dbl>
#> 1 Datsun 710 93 08c864e3854eb5a1460d87b3360d636f 2020-01-01 11:0… NA
#> 2 Mazda RX4 110 7cbe488757cc85aab6583dbc4226bf68 2020-01-01 11:0… NA
#> 3 Mazda RX4 Wag 110 b82618e7f5dd30d5df68540cecc696c8 2020-01-01 11:0… NA
Note that where e.g. dplyr::tbl
requires a more exact
specification of the table identity
(tbl(conn, DBI::Id(table = "mtcars"))
),
get_table
will parse any character to a
DBI::Id
object input using SCDB::id
.
The following day, the current data is now the first five rows of our example data. We then store this data in the database using update_snapshot:
# Let's say that the next day, our data set is now the first 5 of our example data
data <- head(example_data, 5)
update_snapshot(.data = data,
conn = conn,
db_table = "mtcars", # the name of the DB table to store the data in
timestamp = as.POSIXct("2020-01-02 12:00:00"))
#> NULL
We can again use the get_table
function to see the
latest available data, including time-keeping with
include_slice_info = TRUE
:
get_table(conn, "mtcars")
#> # Source: SQL [5 x 2]
#> # Database: sqlite 3.46.0 [:memory:]
#> car hp
#> <chr> <dbl>
#> 1 Datsun 710 93
#> 2 Mazda RX4 110
#> 3 Mazda RX4 Wag 110
#> 4 Hornet 4 Drive 110
#> 5 Hornet Sportabout 175
get_table(conn, "mtcars", include_slice_info = TRUE)
#> # Source: SQL [5 x 5]
#> # Database: sqlite 3.46.0 [:memory:]
#> car hp checksum from_ts until_ts
#> <chr> <dbl> <chr> <chr> <dbl>
#> 1 Datsun 710 93 08c864e3854eb5a1460d87b3360d636f 2020-01-01 … NA
#> 2 Mazda RX4 110 7cbe488757cc85aab6583dbc4226bf68 2020-01-01 … NA
#> 3 Mazda RX4 Wag 110 b82618e7f5dd30d5df68540cecc696c8 2020-01-01 … NA
#> 4 Hornet 4 Drive 110 3c1b6c43b206dd93ee4f6c3d06e1b416 2020-01-02 … NA
#> 5 Hornet Sportabout 175 9355ed7a70e3ff73a4b6ee7f7129aa35 2020-01-02 … NA
Since our data is time-versioned, we can recover the data from the day before
get_table(conn, "mtcars", slice_ts = "2020-01-01 11:00:00")
#> # Source: SQL [3 x 2]
#> # Database: sqlite 3.46.0 [:memory:]
#> car hp
#> <chr> <dbl>
#> 1 Datsun 710 93
#> 2 Mazda RX4 110
#> 3 Mazda RX4 Wag 110
On day 3, we imagine that we have the same 5 records, but one of them is altered
data <- head(example_data, 5) |>
dplyr::mutate(hp = ifelse(car == "Mazda RX4", hp / 2, hp))
update_snapshot(.data = data,
conn = conn,
db_table = "mtcars", # the name of the DB table to store the data in
timestamp = as.POSIXct("2020-01-03 10:00:00"))
#> NULL
We can again access our data using the get_table
function and see that the currently available data (with the changed hp
value for Mazda RX4)
get_table(conn, "mtcars")
#> # Source: SQL [5 x 2]
#> # Database: sqlite 3.46.0 [:memory:]
#> car hp
#> <chr> <dbl>
#> 1 Datsun 710 93
#> 2 Mazda RX4 Wag 110
#> 3 Hornet 4 Drive 110
#> 4 Hornet Sportabout 175
#> 5 Mazda RX4 55
Finally, using slice_ts = NULL
, the full history (and
time-keeping information) is returned:
get_table(conn, "mtcars", slice_ts = NULL)
#> # Source: table<`main`.`mtcars`> [6 x 5]
#> # Database: sqlite 3.46.0 [:memory:]
#> car hp checksum from_ts until_ts
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 Datsun 710 93 08c864e3854eb5a1460d87b3360d636f 2020-01-01 … NA
#> 2 Mazda RX4 110 7cbe488757cc85aab6583dbc4226bf68 2020-01-01 … 2020-01…
#> 3 Mazda RX4 Wag 110 b82618e7f5dd30d5df68540cecc696c8 2020-01-01 … NA
#> 4 Hornet 4 Drive 110 3c1b6c43b206dd93ee4f6c3d06e1b416 2020-01-02 … NA
#> 5 Hornet Sportabout 175 9355ed7a70e3ff73a4b6ee7f7129aa35 2020-01-02 … NA
#> 6 Mazda RX4 55 1232f78f7befb3a765b91176eaacdbb0 2020-01-03 … NA