SEIR: Initialising from incidence data • diseasy

library(diseasy)
#> Loading required package: diseasystore
#> 
#> Attaching package: 'diseasy'
#> The following object is masked from 'package:diseasystore':
#> 
#>     diseasyoption

Introduction

Let us begin by considering a general SEIR model with $K$ and $L$ consecutive $E$ and $I$ states respectively, which are governed by the rates $r_e$ and $r_i$ respectively.

SEIR model overview with multiple E and I states

Let us further assume that we have a incidence signal¹, $I^*(t)$ , which we would like our model to match.

The general approach is to consider the derivatives of the incidence and link these to the states of the model.

Initialising EI states

The $EI$ states of the SEIR model should match the most recent developments of the incidence.

For this purpose, we assume that signal occurs when exiting $I_1$ in the model.

That is, we assume $I^* = r_i I_1$ .

If we take the equation for $I_1$ and multiply by $r_i$ we obtain.

$r_i\frac{d I_1}{d t} = r_i r_e E_K - r_i^2 I_1 \Rightarrow$ $\frac{d I^*}{d t} = r_i r_e E_K - r_i I^*$

If we take the second derivative, we find: $\frac{d^2I^*}{d t^2} = r_i r_e \frac{d E_K}{d t} - r_i \frac{d I_l}{d t}$ From here, we can inject $\frac{d E_K}{d t}$ from the SEIR equations which in turn relates to $E_{k-1}$ . $\frac{d^2I^*}{d t^2} = r_i r_e \left(r_e E_{K-1} - r_e E_K\right) - r_i \frac{d I_l}{d t}$

This process can be iterated through the derivatives until all $E_K$ states are expressed in terms of $I^*(t)$ and its derivatives.

In this case, we can relate the $E_k$ states to the rates and derivatives of the signal in a simple form:

$r_i \begin{bmatrix} r_e E_K \\ r_e^2 E_{K-1} \\ \dots \\ r_e^{K-1} E_2 \\ r_e^K E_1 \\ \end{bmatrix} = \overline{\overline{M}}_K \cdot \begin{bmatrix} I^* \\ \frac{d I^*}{d t} \\ \dots \\ \frac{d^K I^*}{d t^K} \\ \frac{d^{K+1}I^*}{d t^{K+1}} \end{bmatrix}$

The matrix $\overline{\overline{M_K}}$ can be computed via a simple recursion.

To see why, start the equation for the derivative of $I^*(t)$ : $\frac{d I^*}{d t} = r_i r_e E_K - r_i I^*$

And separating the $E_k$ and $I*$ terms: $r_i r_e E_K = r_i I^* + \frac{d I^*}{d t}$

In the above formulation, this corresponds to the matrix $\overline{\overline{M}}_1 = \begin{bmatrix}r_i & 1\end{bmatrix}$ .

When taking the second derivative, we obtain:

$\frac{d^2I^*}{d t^2} = r_i r_e \frac{d E_K}{d t}- r_i \frac{d I^*}{d t} \Rightarrow$ $\frac{d^2I^*}{d t^2} = r_i r_e \left(r_e E_{K-1} - r_e E_K\right) - r_i \frac{d I^*}{d t} \Rightarrow$ $\frac{d^2I^*}{d t^2} = r_i r_e^2 E_{k-1} - r_e\left(r_i \frac{d I^*}{d t} + I^*\right)- r_i \frac{d I^*}{d t}$

And separating the $E_k$ and $I*$ terms: $r_i r_e^2 E_{k-1} = r_e\left(r_i \frac{d I^*}{d t} + I^*\right) + r_i \frac{d I^*}{d t} + \frac{d^2I^*}{d t^2}$

Which, in the matrix formulation corresponds to the sum of $r_e \overline{m}_1$ and the shifted $\overline{m}_1$ , where $\overline{m}_1$ is the row vector of $\overline{\overline{M}_1}$ .

That is, we can express the second derivative as: $\overline{m}_2 = r_e \begin{bmatrix}\overline{m}_1 & 0 \end{bmatrix}+ \begin{bmatrix}0 & \overline{m}_1\end{bmatrix}$ $= \begin{bmatrix}r_i r_e & r_e & 0 \end{bmatrix} + \begin{bmatrix}0 & r_i & 1 \end{bmatrix}$ $= \begin{bmatrix}r_i r_e & r_e + r_i & 1 \end{bmatrix}$

Which, combined with $\overline{m}_1$ yields the two level system:

$\overline{\overline{M}}_2 = \begin{bmatrix}m_{11} & m_{12} & 0 \\ m_{21} & m_{22} & m_{23} \end{bmatrix}$ $= \begin{bmatrix} r_i & 1 & 0 \\ r_i r_e & r_e + r_i & 1 \end{bmatrix}$

In general, the rows can be computed by recursion: $\overline{m}_{k} = r_e \begin{bmatrix} \overline{m}_{k-1} & 0 \end{bmatrix} + \begin{bmatrix} 0 & \overline{m}_{k-1} \end{bmatrix}\quad \overline{\overline{m}}_1 = \begin{bmatrix}r_i & 1\end{bmatrix}.$

The algorithmic implementation of the recursion is then:

K <- 4
ri <- 0.9
re <- 0.8

M <- matrix(rep(0, K * (K + 1)), nrow = K) # Pre-allocate
active_row <- c(ri, 1)

for (k in seq(K)) {
  if (k > 1) active_row <- c(0, active_row) + re * c(active_row, 0)

  M[k, seq(k + 1)] <- active_row
}

M
#>        [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.9000 1.00 0.00  0.0    0
#> [2,] 0.7200 1.70 1.00  0.0    0
#> [3,] 0.5760 2.08 2.50  1.0    0
#> [4,] 0.4608 2.24 4.08  3.3    1

Since we assume that the signal $I^*$ only relates to $I_1$ , then we can determine the $I_{l>1}$ states by evaluating the signal at $I^*(t - (l - 1) / r_i)$ .

Initialising SR states

The $SR$ states of the SEIR model should capture both the short and the long term developments of the incidence signal we want to match. If a lot of infections have happened previously, we expect a larger proportion of the population to be in the $R$ states.

We again assume $I^* = r_i I_1$ .

We can modify the SEIR equation to take this signal as a forcing function with no $E$ states and one less $I$ state (no $I_1$ state).

The equations are as normal expect for the following changes: $I = \frac{I^*}{r_i} + \sum_{l=2}^L I_l$

$\frac{d I_2}{d t} = I^* - r_i I_2$ If we start this system at a time where there are no new infections, we can initialize $I_l = 0$ , and run the simulation forward to estimate the $S$ and $R$ populations at the point of interest.

Testing the methods

We test the initialisation methods on a data set generated using the same ?DiseasyModelOdeSeir model template that we are using in this vignette.

Simple SEIR example data

For the first example, we use the SEIR model output where we know the parameters of the model used to generate the data.

To begin, we configure the observables module to use this data set and to use all available data.

# Connect to a database
obs <- DiseasyObservables$new(
  diseasystore = DiseasystoreSeirExample,
  conn = DBI::dbConnect(duckdb::duckdb())
)

obs$set_study_period( # Use all available data
  start_date = obs$ds$min_start_date,
  end_date = obs$ds$max_end_date
)

The data set contains different data for the infected to test our initialisation method against.

The data are:

“n_infected”: The true number of infected in the model, measured as the number of people transitioning out of the I1 state at any given date.
“n_positive_simple”: A realisation of the number of test-positives in the model - using a 65 % probability of testing.
“n_positive”: A realisation of the number of test-positives in the model - using a overall 65 % probability of testing in conjunction with a reduced probability of testing during weekends.

model_data <- c("n_infected", "n_positive_simple", "n_positive") |>
  purrr::map(\(observable) {
    obs$get_observation(
      observable = observable,
      stratification = rlang::quos(age_group)
    )
  }) |>
  purrr::reduce(~ dplyr::full_join(.x, .y, by = c("date", "age_group"))) |>
  dplyr::mutate("variant" = "WT", .after = "age_group")

model_data
#> # A tibble: 696 × 6
#>    date       age_group variant n_infected n_positive_simple n_positive
#>    <date>     <chr>     <chr>        <dbl>             <dbl>      <dbl>
#>  1 2020-01-03 00-29     WT            68.0                38         42
#>  2 2020-01-04 00-29     WT           145.                 97         73
#>  3 2020-01-05 00-29     WT           191.                118         87
#>  4 2020-01-06 00-29     WT           218.                146        152
#>  5 2020-01-07 00-29     WT           239.                152        160
#>  6 2020-01-08 00-29     WT           260.                164        176
#>  7 2020-01-09 00-29     WT           283.                188        202
#>  8 2020-01-10 00-29     WT           308.                188        204
#>  9 2020-01-11 00-29     WT           334.                224        170
#> 10 2020-01-12 00-29     WT           363.                227        169
#> # ℹ 686 more rows

# Visualise the example data
ggplot2::ggplot(model_data) +
  ggplot2::geom_line(
    ggplot2::aes(x = date, y = n_infected, color = "Infected"),
    linewidth = 1
  ) +
  ggplot2::geom_point(
    ggplot2::aes(x = date, y = n_positive, color = "Test positive (realistic)")
  ) +
  ggplot2::geom_line(
    ggplot2::aes(
      x = date, y = n_positive_simple,
      color = "Test positive (simple)"
    ),
    linewidth = 1
  ) +
  ggplot2::facet_wrap(~ age_group) +
  ggplot2::ylab("Test positive / Infected") +
  ggplot2::scale_color_manual(
    values = c(
      "Infected"                  = "deepskyblue3",
      "Test positive (simple)"    = "orange",
      "Test positive (realistic)" = "seagreen"
    )
  ) +
  ggplot2::labs(colour = "Model output")

Plots of the example data bundled with diseasy.

These different levels of detail allows us to test the initialisation from incidence data in different cases.

The method relies on having incidence data, so we scale the model outputs by the population size. We do this by creating a “synthetic” observable in the observables module.

The simplest cases is using the “n_infected” signal which directly tracks the I1 state in the model. While the most realistic case is the “n_positive” signal which has some real life inspired noise patterns.

source <- "n_positive"


if (source == "n_infected") {
  mapping <- \(n_infected, n_population) n_infected / (n_population)
} else if (source == "n_positive_simple") {
  mapping <- \(n_positive_simple, n_population) {
    n_positive_simple / (n_population * 0.65)
  }
} else if (source == "n_positive") {
  mapping <- \(n_positive, n_population) n_positive / (n_population * 0.65)
}

obs$define_synthetic_observable("incidence", mapping)

incidence_data <- obs$get_observation(
  observable = "incidence",
  stratification = rlang::quos(age_group)
) |>
  dplyr::mutate("source" = !!source)

Correctly specified model

In any case, we first need to define the model that should initialise using the incidence data. We here use the model configuration used to generate the data to test the best case scenario:

# Set the point in time to initialise from
obs$set_last_queryable_date(obs$start_date + lubridate::days(45))

generate_model <- function(K, L, M, rE = 1 / 2.1, rI = 1 / 4.5) {

  # Define the activity for the scenario
  act <- DiseasyActivity$new()
  act$set_contact_basis(contact_basis = contact_basis$DK)
  act$set_activity_units(dk_activity_units)
  act$change_activity(date = as.Date("1900-01-01"), opening = "baseline")

  # Add a waning immunity scenario
  im <- DiseasyImmunity$new()
  im$set_exponential_waning(time_scale = 180)

  # Add a season scenario
  s <- DiseasySeason$new()
  s$set_reference_date(as.Date("2020-01-01"))
  s$use_cosine_season()


  # Create a SEIR model to initialise
  m <- DiseasyModelOdeSeir$new(
    activity = act,
    immunity = im,
    season = s,
    observables = obs,
    parameters = list(
      "compartment_structure" = c("E" = K, "I" = L, "R" = M),
      "age_cuts_lower" = c(0, 30, 60),
      "overall_infection_risk" = 0.025,
      "disease_progression_rates" = c("E" = rE, "I" = rI)
    )
  )

  return(m)
}

m <- generate_model(2L, 1L, 2L) # Use the configuration from example data

This method relies on fitting a polynomial to the latest period, so we here visualise this fitting.

# Extract the most recent signal
poly_fit_data <- incidence_data |>
  dplyr::mutate(
    "t" = as.numeric(.data$date - !!obs$last_queryable_date, units = "days")
  )

poly_fit_projection <- poly_fit_data |>
  dplyr::group_by(.data$age_group) |>
  dplyr::group_modify(
    ~ {
      poly_fit <- lm(
        incidence ~ poly(t, m$parameters$incidence_polynomial_order, raw = TRUE),
        data = dplyr::filter(
          .x,
          .data$t <= 0,
          .data$t >= - m$parameters$incidence_polynomial_training_length
        )
      )

      tibble::tibble(
        "t" = .x$t,
        "incidence" = predict(poly_fit, data.frame("t" = t))
      ) |>
        dplyr::mutate("date" = .x$date)
    }
  )

incidence_data |>
  ggplot2::ggplot(ggplot2::aes(x = date, y = incidence)) +
    ggplot2::geom_point(
      color = switch(
        source,
        "n_infected"        = "deepskyblue3",
        "n_positive_simple" = "orange",
        "n_positive"        = "seagreen"
      )
    ) +
    ggplot2::geom_line(data = poly_fit_projection, color = "red", linewidth = 1) +
    ggplot2::geom_vline(
      xintercept = obs$last_queryable_date,
      linetype = 2, linewidth = 1, color = "red"
    ) +
    ggplot2::geom_vline(
      xintercept = obs$last_queryable_date -
        m$parameters$incidence_polynomial_training_length,
      linetype = 2, linewidth = 1, color = "red"
    ) +
    ggplot2::ylim(
      0,
      incidence_data |>
        dplyr::pull("incidence") |>
        max() * 1.1
    ) +
    ggplot2::facet_wrap(~ age_group) +
    ggplot2::theme_bw()
#> Warning: Removed 81 rows containing missing values or values outside the scale range
#> (`geom_line()`).

Fitting a polynomial to the incidence data to estimate derivatives.

We can now use the $initialise_state_vector() method to infer the initial state vector.

psi <- m$initialise_state_vector(incidence_data)

psi
#> # A tibble: 18 × 4
#>    variant age_group state initial_condition
#>    <chr>   <chr>     <chr>             <dbl>
#>  1 All     00-29     E1             0.00156 
#>  2 All     00-29     E2             0.00149 
#>  3 All     00-29     I1             0.00505 
#>  4 All     00-29     R1             0.0124  
#>  5 All     00-29     R2             0.000756
#>  6 All     30-59     E1             0.00128 
#>  7 All     30-59     E2             0.00123 
#>  8 All     30-59     I1             0.00424 
#>  9 All     30-59     R1             0.0106  
#> 10 All     30-59     R2             0.000655
#> 11 All     60+       E1             0.000463
#> 12 All     60+       E2             0.000440
#> 13 All     60+       I1             0.00148 
#> 14 All     60+       R1             0.00361 
#> 15 All     60+       R2             0.000222
#> 16 NA      00-29     S              0.337   
#> 17 NA      30-59     S              0.368   
#> 18 NA      60+       S              0.250

And we now test the initial conditions by solving the model using these starting conditions.

get_prediction <- function(
  model,
  psi,
  signal = "incidence"
) {

  # Integrate the ODE system with deSolve
  sol <- deSolve::ode(
    y = psi$initial_condition,
    times = seq(0, 100),
    func = model$rhs
  )

  # Improve the names of the output
  colnames(sol) <- c(
    "time",
    psi |>
      tidyr::unite("label", "variant", "age_group", "state", sep = "/") |>
      dplyr::pull("label")
  )

  # Convert to long format
  sol_long <- sol |>
    as.data.frame() |>
    tidyr::pivot_longer(
      !"time",
      names_sep = "/",
      names_to = c("variant", "age_group", "state")
    )

  # Extract the solution
  if (signal == "incidence") {
    out <- sol_long |>
      dplyr::filter(.data$state == "I1") |>
      dplyr::select(!"state") |>
      dplyr::mutate(
        "date" = .data$time + model$observables$last_queryable_date,
        "incidence_model" = model$parameters$disease_progression_rates[["I"]] *
            model$parameters$compartment_structure[["I"]] *
            .data$value
      )

    # Convert to incidence
    proportion <- model$activity$map_population(
      model$parameters$age_cuts_lower
    ) |>
      dplyr::mutate(
        "age_group" = diseasystore::age_labels(
          model$parameters$age_cuts_lower
        )[.data$age_group_out]
      ) |>
      dplyr::summarise(
        "proportion" = sum(.data$proportion),
        .by = "age_group"
      )

    out <- out |>
      dplyr::left_join(proportion, by = "age_group") |>
      dplyr::mutate("incidence_model" = .data$incidence_model / .data$proportion) |>
      dplyr::select(!"proportion")

  } else if (signal == "prevalence") {
    out <- sol_long |>
      dplyr::filter(startsWith(.data$state, "I1")) |>
      dplyr::select(!"state") |>
      dplyr::summarise(
        "date" = dplyr::first(.data$time) + model$observables$last_queryable_date,
        "incidence_model" = sum(.data$value),
        .by = "time"
      )
  }

  # Add the model configuration
  out <- out |>
    dplyr::mutate(
      "model_configuration" = paste0(
        names(model$parameters$compartment_structure),
        model$parameters$compartment_structure,
        collapse = ""
      )
    )

  return(out)
}

prediction <- get_prediction(model = m, psi = psi)

prediction
#> # A tibble: 303 × 7
#>     time variant age_group   value date       incidence_model
#>    <dbl> <chr>   <chr>       <dbl> <date>               <dbl>
#>  1     0 All     00-29     0.00505 2020-02-17         0.00315
#>  2     0 All     30-59     0.00424 2020-02-17         0.00244
#>  3     0 All     60+       0.00148 2020-02-17         0.00128
#>  4     1 All     00-29     0.00536 2020-02-18         0.00334
#>  5     1 All     30-59     0.00449 2020-02-18         0.00258
#>  6     1 All     60+       0.00158 2020-02-18         0.00136
#>  7     2 All     00-29     0.00571 2020-02-19         0.00357
#>  8     2 All     30-59     0.00479 2020-02-19         0.00276
#>  9     2 All     60+       0.00168 2020-02-19         0.00145
#> 10     3 All     00-29     0.00611 2020-02-20         0.00381
#> # ℹ 293 more rows
#> # ℹ 1 more variable: model_configuration <chr>

ggplot2::ggplot() +
  ggplot2::geom_point(
    data = incidence_data,
    ggplot2::aes(x = date, y = incidence),
    color = switch(
      source,
      "n_infected"        = "deepskyblue3",
      "n_positive_simple" = "orange",
      "n_positive"        = "seagreen"
    )
  ) +
  ggplot2::geom_line(
    data = prediction,
    ggplot2::aes(x = date, y = incidence_model, color = model_configuration),
    linewidth = 1.5
  ) +
  ggplot2::geom_vline(
    xintercept = obs$last_queryable_date,
    linetype = 2,
    color = "black"
  ) +
  ggplot2::facet_grid(source ~ age_group, scales = "free") +
  ggplot2::labs(y = "Model output", color = "Model Configuration")

Using a correctly specified model to initialise the SEIR model matches the true data near-perfectly.

Misspecified model

Correctly matching the model is the best case scenario. However, we can also the method for a couple of cases where the model is misspecified.

Note that we at this state does not modify the parameters of the model to match the development, we only estimate the initial state vector.

Once we include model fitting, the discrepancy between the data and model predictions may diminish.

Misspecified model in periods of increasing infections

models <- list(
  generate_model(2L, 1L, 1L),
  generate_model(1L, 1L, 2L),
  generate_model(2L, 2L, 2L),
  generate_model(3L, 2L, 5L)
)

predictions <- models |>
  purrr::map(\(m) {
    get_prediction(model = m, psi = m$initialise_state_vector(incidence_data))
  }) |>
  purrr::reduce(rbind)


ggplot2::ggplot() +
  ggplot2::geom_point(
    data = incidence_data,
    ggplot2::aes(x = date, y = incidence),
    color = switch(
      source,
      "n_infected"        = "deepskyblue3",
      "n_positive_simple" = "orange",
      "n_positive"        = "seagreen"
    )
  ) +
  ggplot2::geom_line(
    data = predictions,
    ggplot2::aes(x = date, y = incidence_model, color = model_configuration),
    linewidth = 1.5
  ) +
  ggplot2::geom_vline(
    xintercept = models[[1]]$observables$last_queryable_date,
    linetype = 2,
    color = "black"
  ) +
  ggplot2::facet_grid(source ~ age_group, scales = "free") +
  ggplot2::labs(y = "Model output", color = "Model Configuration")

Using a misspecified model to initialise the SEIR model can match the true data well when infections are increasing.

Misspecified model in periods of decreasing infections


models <- list(
  generate_model(2L, 1L, 1L),
  generate_model(1L, 1L, 2L),
  generate_model(2L, 2L, 2L),
  generate_model(3L, 2L, 5L)
)

# Update the last queryable date to later starting point
purrr::walk(models, \(m) {
  m$observables$set_last_queryable_date(
    m$observables$start_date + lubridate::days(150)
  )
})

predictions <- models |>
  purrr::map(\(m) {
    get_prediction(model = m, psi = m$initialise_state_vector(incidence_data))
  }) |>
  purrr::reduce(rbind)


ggplot2::ggplot() +
  ggplot2::geom_point(
    data = incidence_data,
    ggplot2::aes(x = date, y = incidence),
    color = switch(
      source,
      "n_infected"        = "deepskyblue3",
      "n_positive_simple" = "orange",
      "n_positive"        = "seagreen"
    )
  ) +
  ggplot2::geom_line(
    data = predictions,
    ggplot2::aes(x = date, y = incidence_model, color = model_configuration),
    linewidth = 1.5
  ) +
  ggplot2::geom_vline(
    xintercept = models[[1]]$observables$last_queryable_date,
    linetype = 2,
    color = "black"
  ) +
  ggplot2::facet_grid(source ~ age_group, scales = "free") +
  ggplot2::labs(y = "Model output", color = "Model Configuration")

Using a misspecified model to initialise the SEIR model can match the true data well when infections are decreasing.