Skip to contents
library(diseasy)
#> Loading required package: diseasystore
#> 
#> Attaching package: 'diseasy'
#> The following object is masked from 'package:diseasystore':
#> 
#>     diseasyoption

Introduction

Let us begin by considering a general SEIR model with KK and LL consecutive EE and II states respectively, which are governed by the rates rer_e and rir_i respectively.

Let us further assume that we have a incidence signal1, I*(t)I^*(t), which we would like our model to match.

The general approach is to consider the derivatives of the incidence and link these to the states of the model.

Initialising EI states

The EIEI states of the SEIR model should match the most recent developments of the incidence.

For this purpose, we assume that signal occurs when exiting I1I_1 in the model.

That is, we assume I*=riI1I^* = r_i I_1.

If we take the equation for I1I_1 and multiply by rir_i we obtain.

ridI1dt=rireEKri2I1 r_i\frac{d I_1}{d t} = r_i r_e E_K - r_i^2 I_1 \Rightarrow dI*dt=rireEKriI* \frac{d I^*}{d t} = r_i r_e E_K - r_i I^*

If we take the second derivative, we find: d2I*dt2=riredEKdtridIldt \frac{d^2I^*}{d t^2} = r_i r_e \frac{d E_K}{d t} - r_i \frac{d I_l}{d t} From here, we can inject dEKdt\frac{d E_K}{d t} from the SEIR equations which in turn relates to Ek1E_{k-1}. d2I*dt2=rire(reEK1reEK)ridIldt \frac{d^2I^*}{d t^2} = r_i r_e \left(r_e E_{K-1} - r_e E_K\right) - r_i \frac{d I_l}{d t}

This process can be iterated through the derivatives until all EKE_K states are expressed in terms of I*(t)I^*(t) and its derivatives.

In this case, we can relate the EkE_k states to the rates and derivatives of the signal in a simple form:

ri[reEKre2EK1reK1E2reKE1]=M¯¯K[I*dI*dtdKI*dtKdK+1I*dtK+1] r_i \begin{bmatrix} r_e E_K \\ r_e^2 E_{K-1} \\ \dots \\ r_e^{K-1} E_2 \\ r_e^K E_1 \\ \end{bmatrix} = \overline{\overline{M}}_K \cdot \begin{bmatrix} I^* \\ \frac{d I^*}{d t} \\ \dots \\ \frac{d^K I^*}{d t^K} \\ \frac{d^{K+1}I^*}{d t^{K+1}} \end{bmatrix}

The matrix MK¯¯\overline{\overline{M_K}} can be computed via a simple recursion.

To see why, start the equation for the derivative of I*(t)I^*(t): dI*dt=rireEKriI* \frac{d I^*}{d t} = r_i r_e E_K - r_i I^*

And separating the EkE_k and I*I* terms: rireEK=riI*+dI*dt r_i r_e E_K = r_i I^* + \frac{d I^*}{d t}

In the above formulation, this corresponds to the matrix M¯¯1=[ri1]\overline{\overline{M}}_1 = \begin{bmatrix}r_i & 1\end{bmatrix}.

When taking the second derivative, we obtain:

d2I*dt2=riredEKdtridI*dt \frac{d^2I^*}{d t^2} = r_i r_e \frac{d E_K}{d t}- r_i \frac{d I^*}{d t} \Rightarrow d2I*dt2=rire(reEK1reEK)ridI*dt \frac{d^2I^*}{d t^2} = r_i r_e \left(r_e E_{K-1} - r_e E_K\right) - r_i \frac{d I^*}{d t} \Rightarrow d2I*dt2=rire2Ek1re(ridI*dt+I*)ridI*dt \frac{d^2I^*}{d t^2} = r_i r_e^2 E_{k-1} - r_e\left(r_i \frac{d I^*}{d t} + I^*\right)- r_i \frac{d I^*}{d t}

And separating the EkE_k and I*I* terms: rire2Ek1=re(ridI*dt+I*)+ridI*dt+d2I*dt2 r_i r_e^2 E_{k-1} = r_e\left(r_i \frac{d I^*}{d t} + I^*\right) + r_i \frac{d I^*}{d t} + \frac{d^2I^*}{d t^2}

Which, in the matrix formulation corresponds to the sum of rem¯1r_e \overline{m}_1 and the shifted m¯1\overline{m}_1, where m¯1\overline{m}_1 is the row vector of M¯1¯\overline{\overline{M}_1}.

That is, we can express the second derivative as: m¯2=re[m¯10]+[0m¯1] \overline{m}_2 = r_e \begin{bmatrix}\overline{m}_1 & 0 \end{bmatrix}+ \begin{bmatrix}0 & \overline{m}_1\end{bmatrix} =[rirere0]+[0ri1] = \begin{bmatrix}r_i r_e & r_e & 0 \end{bmatrix} + \begin{bmatrix}0 & r_i & 1 \end{bmatrix} =[rirere+ri1] = \begin{bmatrix}r_i r_e & r_e + r_i & 1 \end{bmatrix}

Which, combined with m¯1\overline{m}_1 yields the two level system:

M¯¯2=[m11m120m21m22m23] \overline{\overline{M}}_2 = \begin{bmatrix}m_{11} & m_{12} & 0 \\ m_{21} & m_{22} & m_{23} \end{bmatrix} =[ri10rirere+ri1] = \begin{bmatrix} r_i & 1 & 0 \\ r_i r_e & r_e + r_i & 1 \end{bmatrix}

In general, the rows can be computed by recursion: m¯k=re[m¯k10]+[0m¯k1]m¯¯1=[ri1]. \overline{m}_{k} = r_e \begin{bmatrix} \overline{m}_{k-1} & 0 \end{bmatrix} + \begin{bmatrix} 0 & \overline{m}_{k-1} \end{bmatrix}\quad \overline{\overline{m}}_1 = \begin{bmatrix}r_i & 1\end{bmatrix}.

The algorithmic implementation of the recursion is then:

K <- 4
ri <- 0.9
re <- 0.8

M <- matrix(rep(0, K * (K + 1)), nrow = K) # Pre-allocate
active_row <- c(ri, 1)

for (k in seq(K)) {
  if (k > 1) active_row <- c(0, active_row) + re * c(active_row, 0)

  M[k, seq(k + 1)] <- active_row
}

M
#>        [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.9000 1.00 0.00  0.0    0
#> [2,] 0.7200 1.70 1.00  0.0    0
#> [3,] 0.5760 2.08 2.50  1.0    0
#> [4,] 0.4608 2.24 4.08  3.3    1

Since we assume that the signal I*I^* only relates to I1I_1, then we can determine the Il>1I_{l>1} states by evaluating the signal at I*(t(l1)/ri)I^*(t - (l - 1) / r_i).

Initialising SR states

The SRSR states of the SEIR model should capture both the short and the long term developments of the incidence signal we want to match. If a lot of infections have happened previously, we expect a larger proportion of the population to be in the RR states.

We again assume I*=riI1I^* = r_i I_1.

We can modify the SEIR equation to take this signal as a forcing function with no EE states and one less II state (no I1I_1 state).

The equations are as normal expect for the following changes: I=I*ri+l=2LIl I = \frac{I^*}{r_i} + \sum_{l=2}^L I_l

dI2dt=I*riI2 \frac{d I_2}{d t} = I^* - r_i I_2 If we start this system at a time where there are no new infections, we can initialize Il=0I_l = 0, and run the simulation forward to estimate the SS and RR populations at the point of interest.

Testing the methods

We test the initialisation methods on a data set generated using the same ?DiseasyModelOdeSeir model template that we are using in this vignette.

Simple SEIR example data

For the first example, we use the SEIR model output where we know the parameters of the model used to generate the data.

To begin, we configure the observables module to use this data set and to use all available data.

# Connect to a database
obs <- DiseasyObservables$new(
  diseasystore = DiseasystoreSeirExample,
  conn = DBI::dbConnect(duckdb::duckdb())
)

obs$set_study_period( # Use all available data
  start_date = obs$ds$min_start_date,
  end_date = obs$ds$max_end_date
)

The data set contains different data for the infected to test our initialisation method against.

The data are:

  • “n_infected”: The true number of infected in the model, measured as the number of people transitioning out of the I1 state at any given date.
  • “n_positive_simple”: A realisation of the number of test-positives in the model - using a 65 % probability of testing.
  • “n_positive”: A realisation of the number of test-positives in the model - using a overall 65 % probability of testing in conjunction with a reduced probability of testing during weekends.
model_data <- c("n_infected", "n_positive_simple", "n_positive") |>
  purrr::map(\(observable) {
    obs$get_observation(
      observable = observable,
      stratification = rlang::quos(age_group)
    )
  }) |>
  purrr::reduce(~ dplyr::full_join(.x, .y, by = c("date", "age_group"))) |>
  dplyr::mutate("variant" = "WT", .after = "age_group")

model_data
#> # A tibble: 450 × 6
#>    date       age_group variant n_infected n_positive_simple n_positive
#>    <date>     <chr>     <chr>        <dbl>             <dbl>      <dbl>
#>  1 2020-01-03 30-59     WT            64.2                37         41
#>  2 2020-01-04 30-59     WT           136.                 85         64
#>  3 2020-01-05 30-59     WT           179.                107         79
#>  4 2020-01-06 30-59     WT           203.                136        138
#>  5 2020-01-07 30-59     WT           221.                137        144
#>  6 2020-01-08 30-59     WT           239.                159        169
#>  7 2020-01-09 30-59     WT           257.                170        181
#>  8 2020-01-10 30-59     WT           278.                188        196
#>  9 2020-01-11 30-59     WT           301.                203        159
#> 10 2020-01-12 30-59     WT           326.                199        143
#> # ℹ 440 more rows
# Visualise the example data
ggplot2::ggplot(model_data) +
  ggplot2::geom_line(
    ggplot2::aes(x = date, y = n_infected, color = "Infected"),
    linewidth = 1
  ) +
  ggplot2::geom_point(
    ggplot2::aes(x = date, y = n_positive, color = "Test positive (realistic)")
  ) +
  ggplot2::geom_line(
    ggplot2::aes(
      x = date, y = n_positive_simple,
      color = "Test positive (simple)"
    ),
    linewidth = 1
  ) +
  ggplot2::facet_wrap(~ age_group) +
  ggplot2::ylab("Test positive / Infected") +
  ggplot2::scale_color_manual(
    values = c(
      "Infected"                  = "deepskyblue3",
      "Test positive (simple)"    = "orange",
      "Test positive (realistic)" = "seagreen"
    )
  ) +
  ggplot2::labs(colour = "Model output")

Plots of the example data bundled with diseasy.

These different levels of detail allows us to test the initialisation from incidence data in different cases.

The method relies on having incidence data, so we scale the model outputs by the population size. We do this by creating a “synthetic” observable in the observables module.

The simplest cases is using the “n_infected” signal which directly tracks the I1 state in the model. While the most realistic case is the “n_positive” signal which has some real life inspired noise patterns.

source <- "n_positive"


if (source == "n_infected") {
  mapping <- \(n_infected, n_population) n_infected / (n_population)
} else if (source == "n_positive_simple") {
  mapping <- \(n_positive_simple, n_population) {
    n_positive_simple / (n_population * 0.65)
  }
} else if (source == "n_positive") {
  mapping <- \(n_positive, n_population) n_positive / (n_population * 0.65)
}

obs$define_synthetic_observable("incidence", mapping)

incidence_data <- obs$get_observation(
  observable = "incidence",
  stratification = rlang::quos(age_group)
) |>
  dplyr::mutate("source" = !!source)

Correctly specified model

In any case, we first need to define the model that should initialise using the incidence data. We here use the model configuration used to generate the data to test the best case scenario:

# Set the point in time to initialise from
obs$set_last_queryable_date(obs$start_date + lubridate::days(45))

generate_model <- function(K, L, M, rE = 1 / 2.1, rI = 1 / 4.5) {

  # Define the activity for the scenario
  act <- DiseasyActivity$new()
  act$set_contact_basis(contact_basis = contact_basis$DK)
  act$set_activity_units(dk_activity_units)
  act$change_activity(date = as.Date("1900-01-01"), opening = "baseline")

  # Create a SEIR model to initialise
  m <- DiseasyModelOdeSeir$new(
    observables = obs,
    activity = act,
    parameters = list(
      "compartment_structure" = c("E" = K, "I" = L, "R" = M),
      "age_cuts_lower" = c(0, 30, 60),
      "overall_infection_risk" = 0.025,
      "disease_progression_rates" = c("E" = rE, "I" = rI)
    )
  )

  return(m)
}

m <- generate_model(2L, 1L, 1L) # Use the configuration from example data

This method relies on fitting a polynomial to the latest period, so we here visualise this fitting.

# Extract the most recent signal
poly_fit_data <- incidence_data |>
  dplyr::mutate(
    "t" = as.numeric(.data$date - !!obs$last_queryable_date, units = "days")
  )

poly_fit_projection <- poly_fit_data |>
  dplyr::group_by(.data$age_group) |>
  dplyr::group_modify(
    ~ {
      poly_fit <- lm(
        incidence ~ poly(t, m$parameters$incidence_polynomial_order, raw = TRUE),
        data = dplyr::filter(
          .x,
          .data$t <= 0,
          .data$t >= - m$parameters$incidence_polynomial_training_length
        )
      )

      tibble::tibble(
        "t" = .x$t,
        "incidence" = predict(poly_fit, data.frame("t" = t))
      ) |>
        dplyr::mutate("date" = .x$date)
    }
  )

incidence_data |>
  ggplot2::ggplot(ggplot2::aes(x = date, y = incidence)) +
    ggplot2::geom_point(
      color = switch(
        source,
        "n_infected"        = "deepskyblue3",
        "n_positive_simple" = "orange",
        "n_positive"        = "seagreen"
      )
    ) +
    ggplot2::geom_line(data = poly_fit_projection, color = "red", linewidth = 1) +
    ggplot2::geom_vline(
      xintercept = obs$last_queryable_date,
      linetype = 2, linewidth = 1, color = "red"
    ) +
    ggplot2::geom_vline(
      xintercept = obs$last_queryable_date -
        m$parameters$incidence_polynomial_training_length,
      linetype = 2, linewidth = 1, color = "red"
    ) +
    ggplot2::ylim(
      0,
      incidence_data |>
        dplyr::pull("incidence") |>
        max() * 1.1
    ) +
    ggplot2::facet_wrap(~ age_group) +
    ggplot2::theme_bw()
#> Warning: Removed 169 rows containing missing values or values outside the scale range
#> (`geom_line()`).

Fitting a polynomial to the incidence data to estimate derivatives.

We can now use the $initialise_state_vector() method to infer the initial state vector.

psi <- m$initialise_state_vector(incidence_data)

psi
#> # A tibble: 15 × 4
#>    variant age_group state initial_condition
#>    <chr>   <chr>     <chr>             <dbl>
#>  1 All     00-29     E1             0.00147 
#>  2 All     00-29     E2             0.00141 
#>  3 All     00-29     I1             0.00484 
#>  4 All     00-29     R1             0.0132  
#>  5 All     30-59     E1             0.00123 
#>  6 All     30-59     E2             0.00118 
#>  7 All     30-59     I1             0.00410 
#>  8 All     30-59     R1             0.0112  
#>  9 All     60+       E1             0.000413
#> 10 All     60+       E2             0.000398
#> 11 All     60+       I1             0.00139 
#> 12 All     60+       R1             0.00379 
#> 13 NA      00-29     S              0.337   
#> 14 NA      30-59     S              0.368   
#> 15 NA      60+       S              0.250

And we now test the initial conditions by solving the model using these starting conditions.

get_prediction <- function(
    model,
    psi,
    signal = "incidence"
  ) {

  # Integrate the ODE system with deSolve
  sol <- deSolve::ode(
    y = psi$initial_condition,
    times = seq(0, 50),
    func = model$rhs
  )

  # Improve the names of the output
  colnames(sol) <- c(
    "time",
    psi |>
      tidyr::unite("label", "variant", "age_group", "state", sep = "/") |>
      dplyr::pull("label")
  )

  # Convert to long format
  sol_long <- sol |>
    as.data.frame() |>
    tidyr::pivot_longer(
      !"time",
      names_sep = "/",
      names_to = c("variant", "age_group", "state")
    )

  # Extract the solution
  if (signal == "incidence") {
    out <- sol_long |>
      dplyr::filter(.data$state == "I1") |>
      dplyr::select(!"state") |>
      dplyr::mutate(
        "date" = .data$time + model$observables$last_queryable_date,
        "incidence_model" = model$parameters$disease_progression_rates[["I"]] *
            model$parameters$compartment_structure[["I"]] *
            .data$value
      )

    # Convert to incidence
    proportion <- model$activity$map_population(
      model$parameters$age_cuts_lower
    ) |>
      dplyr::mutate(
        "age_group" = diseasystore::age_labels(
          model$parameters$age_cuts_lower
        )[.data$age_group_out]
      ) |>
      dplyr::summarise(
        "proportion" = sum(.data$proportion),
        .by = "age_group"
      )

    out <- out |>
      dplyr::left_join(proportion, by = "age_group") |>
      dplyr::mutate("incidence_model" = .data$incidence_model / .data$proportion) |>
      dplyr::select(!"proportion")

  } else if (signal == "prevalence") {
    out <- sol_long |>
      dplyr::filter(startsWith(.data$state, "I1")) |>
      dplyr::select(!"state") |>
      dplyr::summarise(
        "date" = dplyr::first(.data$time) + model$observables$last_queryable_date,
        "incidence_model" = sum(.data$value),
        .by = "time"
      )
  }

  # Add the model configuration
  out <- out |>
    dplyr::mutate(
      "model_configuration" = paste0(
        names(model$parameters$compartment_structure),
        model$parameters$compartment_structure,
        collapse = ""
      )
    )

  return(out)
}

prediction <- get_prediction(model = m, psi = psi)

prediction
#> # A tibble: 153 × 7
#>     time variant age_group   value date       incidence_model
#>    <dbl> <chr>   <chr>       <dbl> <date>               <dbl>
#>  1     0 All     00-29     0.00484 2020-02-17         0.00302
#>  2     0 All     30-59     0.00410 2020-02-17         0.00236
#>  3     0 All     60+       0.00139 2020-02-17         0.00120
#>  4     1 All     00-29     0.00512 2020-02-18         0.00319
#>  5     1 All     30-59     0.00433 2020-02-18         0.00249
#>  6     1 All     60+       0.00147 2020-02-18         0.00127
#>  7     2 All     00-29     0.00546 2020-02-19         0.00341
#>  8     2 All     30-59     0.00462 2020-02-19         0.00266
#>  9     2 All     60+       0.00157 2020-02-19         0.00135
#> 10     3 All     00-29     0.00586 2020-02-20         0.00365
#> # ℹ 143 more rows
#> # ℹ 1 more variable: model_configuration <chr>
ggplot2::ggplot() +
  ggplot2::geom_point(
    data = incidence_data,
    ggplot2::aes(x = date, y = incidence),
    color = switch(
      source,
      "n_infected"        = "deepskyblue3",
      "n_positive_simple" = "orange",
      "n_positive"        = "seagreen"
    )
  ) +
  ggplot2::geom_line(
    data = prediction,
    ggplot2::aes(x = date, y = incidence_model, color = model_configuration),
    linewidth = 1.5
  ) +
  ggplot2::geom_vline(
    xintercept = obs$last_queryable_date,
    linetype = 2,
    color = "black"
  ) +
  ggplot2::facet_grid(source ~ age_group, scales = "free") +
  ggplot2::labs(y = "Model output", color = "Model Configuration")

Using a correctly specified model to initialise the SEIR model matches the true data near-perfectly.

Misspecified model

Correctly matching the model is the best case scenario. However, we can also the method for a couple of cases where the model is misspecified.

Note that we at this state does not modify the parameters of the model to match the development, we only estimate the initial state vector.

Once we include model fitting, the discrepancy between the data and model predictions may diminish.

Misspecified model in periods of increasing infections
models <- list(
  generate_model(2L, 1L, 1L),
  generate_model(1L, 1L, 1L),
  generate_model(2L, 2L, 2L),
  generate_model(3L, 2L, 5L)
)

predictions <- models |>
  purrr::map(\(m) {
    get_prediction(model = m, psi = m$initialise_state_vector(incidence_data))
  }) |>
  purrr::reduce(rbind)


ggplot2::ggplot() +
  ggplot2::geom_point(
    data = incidence_data,
    ggplot2::aes(x = date, y = incidence),
    color = switch(
      source,
      "n_infected"        = "deepskyblue3",
      "n_positive_simple" = "orange",
      "n_positive"        = "seagreen"
    )
  ) +
  ggplot2::geom_line(
    data = predictions,
    ggplot2::aes(x = date, y = incidence_model, color = model_configuration),
    linewidth = 1.5
  ) +
  ggplot2::geom_vline(
    xintercept = models[[1]]$observables$last_queryable_date,
    linetype = 2,
    color = "black"
  ) +
  ggplot2::facet_grid(source ~ age_group, scales = "free") +
  ggplot2::labs(y = "Model output", color = "Model Configuration")

Using a misspecified model to initialise the SEIR model can match the true data well when infections are increasing.

Misspecified model in periods of decreasing infections

models <- list(
  generate_model(2L, 1L, 1L),
  generate_model(1L, 1L, 1L),
  generate_model(2L, 2L, 2L),
  generate_model(3L, 2L, 5L)
)

# Update the last queryable date to later starting point
purrr::walk(models, \(m) {
  m$observables$set_last_queryable_date(
    m$observables$ds$max_end_date - lubridate::days(45)
  )
})

predictions <- models |>
  purrr::map(\(m) {
    get_prediction(model = m, psi = m$initialise_state_vector(incidence_data))
  }) |>
  purrr::reduce(rbind)


ggplot2::ggplot() +
  ggplot2::geom_point(
    data = incidence_data,
    ggplot2::aes(x = date, y = incidence),
    color = switch(
      source,
      "n_infected"        = "deepskyblue3",
      "n_positive_simple" = "orange",
      "n_positive"        = "seagreen"
    )
  ) +
  ggplot2::geom_line(
    data = predictions,
    ggplot2::aes(x = date, y = incidence_model, color = model_configuration),
    linewidth = 1.5
  ) +
  ggplot2::geom_vline(
    xintercept = models[[1]]$observables$last_queryable_date,
    linetype = 2,
    color = "black"
  ) +
  ggplot2::facet_grid(source ~ age_group, scales = "free") +
  ggplot2::labs(y = "Model output", color = "Model Configuration")

Using a misspecified model to initialise the SEIR model can match the true data well when infections are decreasing.