Skip to contents
library(diseasy)
#> Loading required package: diseasystore
#> 
#> Attaching package: 'diseasy'
#> The following object is masked from 'package:diseasystore':
#> 
#>     diseasyoption

Introduction

Let us begin by considering a general SEIR model with KK and LL consecutive EE and II states respectively, which are governed by the rates rer_e and rir_i respectively.

SEIR model overview with multiple E and I states

SEIR model overview with multiple E and I states

Let us further assume that we have a incidence signal1, Inc*(t)Inc^*(t), which we would like our model to match.

The general approach is to consider the derivatives of the incidence and link these to the states of the model.

Initialising EI states

The EIEI states of the SEIR model should match the most recent developments of the incidence.

For this purpose, we assume that signal occurs when exiting I1I_1 in the model.

That is, we assume Inc*=riI1Inc^* = r_i I_1.

If we take the equation for I1I_1 and multiply by rir_i we obtain.

ridI1dt=rireEKri2I1 r_i\frac{d I_1}{d t} = r_i r_e E_K - r_i^2 I_1 \Rightarrow dInc*dt=rireEKriInc* \frac{d Inc^*}{d t} = r_i r_e E_K - r_i Inc^*

If we take the second derivative, we find: d2Inc*dt2=riredEKdtridIldt \frac{d^2Inc^*}{d t^2} = r_i r_e \frac{d E_K}{d t} - r_i \frac{d I_l}{d t} From here, we can inject dEKdt\frac{d E_K}{d t} from the SEIR equations which in turn relates to Ek1E_{k-1}. d2Inc*dt2=rire(reEK1reEK)ridIldt \frac{d^2Inc^*}{d t^2} = r_i r_e \left(r_e E_{K-1} - r_e E_K\right) - r_i \frac{d I_l}{d t}

This process can be iterated through the derivatives until all EKE_K states are expressed in terms of Inc*(t)Inc^*(t) and its derivatives.

In this case, we can relate the EkE_k states to the rates and derivatives of the signal in a simple form:

ri[reEKre2EK1reK1E2reKE1]=M¯¯K[Inc*dInc*dtdKInc*dtKdK+1Inc*dtK+1] r_i \begin{bmatrix} r_e E_K \\ r_e^2 E_{K-1} \\ \dots \\ r_e^{K-1} E_2 \\ r_e^K E_1 \\ \end{bmatrix} = \overline{\overline{M}}_K \cdot \begin{bmatrix} Inc^* \\ \frac{d Inc^*}{d t} \\ \dots \\ \frac{d^K Inc^*}{d t^K} \\ \frac{d^{K+1}Inc^*}{d t^{K+1}} \end{bmatrix}

The matrix MK¯¯\overline{\overline{M_K}} can be computed via a simple recursion.

To see why, start the equation for the derivative of Inc*(t)Inc^*(t): dInc*dt=rireEKriInc* \frac{d Inc^*}{d t} = r_i r_e E_K - r_i Inc^*

And separating the EkE_k and Inc*Inc^* terms: rireEK=riInc*+dInc*dt r_i r_e E_K = r_i Inc^* + \frac{d Inc^*}{d t}

In the above formulation, this corresponds to the matrix M¯¯1=[ri1]\overline{\overline{M}}_1 = \begin{bmatrix}r_i & 1\end{bmatrix}.

When taking the second derivative, we obtain:

d2Inc*dt2=riredEKdtridInc*dt \frac{d^2Inc^*}{d t^2} = r_i r_e \frac{d E_K}{d t}- r_i \frac{d Inc^*}{d t} \Rightarrow d2Inc*dt2=rire(reEK1reEK)ridInc*dt \frac{d^2Inc^*}{d t^2} = r_i r_e \left(r_e E_{K-1} - r_e E_K\right) - r_i \frac{d Inc^*}{d t} \Rightarrow d2Inc*dt2=rire2Ek1re(ridInc*dt+Inc*)ridInc*dt \frac{d^2Inc^*}{d t^2} = r_i r_e^2 E_{k-1} - r_e\left(r_i \frac{d Inc^*}{d t} + Inc^*\right)- r_i \frac{d Inc^*}{d t}

And separating the EkE_k and Inc*Inc^* terms: rire2Ek1=re(ridInc*dt+Inc*)+ridInc*dt+d2Inc*dt2 r_i r_e^2 E_{k-1} = r_e\left(r_i \frac{d Inc^*}{d t} + Inc^*\right) + r_i \frac{d Inc^*}{d t} + \frac{d^2Inc^*}{d t^2}

Which, in the matrix formulation corresponds to the sum of rem¯1r_e \overline{m}_1 and the shifted m¯1\overline{m}_1, where m¯1\overline{m}_1 is the row vector of M¯1¯\overline{\overline{M}_1}.

That is, we can express the second derivative as: m¯2=re[m¯10]+[0m¯1] \overline{m}_2 = r_e \begin{bmatrix}\overline{m}_1 & 0 \end{bmatrix}+ \begin{bmatrix}0 & \overline{m}_1\end{bmatrix} =[rirere0]+[0ri1] = \begin{bmatrix}r_i r_e & r_e & 0 \end{bmatrix} + \begin{bmatrix}0 & r_i & 1 \end{bmatrix} =[rirere+ri1] = \begin{bmatrix}r_i r_e & r_e + r_i & 1 \end{bmatrix}

Which, combined with m¯1\overline{m}_1 yields the two level system:

M¯¯2=[m11m120m21m22m23] \overline{\overline{M}}_2 = \begin{bmatrix}m_{11} & m_{12} & 0 \\ m_{21} & m_{22} & m_{23} \end{bmatrix} =[ri10rirere+ri1] = \begin{bmatrix} r_i & 1 & 0 \\ r_i r_e & r_e + r_i & 1 \end{bmatrix}

In general, the rows can be computed by recursion: m¯k=re[m¯k10]+[0m¯k1]m¯¯1=[ri1]. \overline{m}_{k} = r_e \begin{bmatrix} \overline{m}_{k-1} & 0 \end{bmatrix} + \begin{bmatrix} 0 & \overline{m}_{k-1} \end{bmatrix}\quad \overline{\overline{m}}_1 = \begin{bmatrix}r_i & 1\end{bmatrix}.

The algorithmic implementation of the recursion is then:

K <- 4
ri <- 0.9
re <- 0.8

M <- matrix(rep(0, K * (K + 1)), nrow = K) # Pre-allocate
active_row <- c(ri, 1)

for (k in seq(K)) {
  if (k > 1) active_row <- c(0, active_row) + re * c(active_row, 0)

  M[k, seq(k + 1)] <- active_row
}

M
#>        [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.9000 1.00 0.00  0.0    0
#> [2,] 0.7200 1.70 1.00  0.0    0
#> [3,] 0.5760 2.08 2.50  1.0    0
#> [4,] 0.4608 2.24 4.08  3.3    1

Since we assume that the signal Inc*Inc^* only relates to I1I_1, then we can determine the Il>1I_{l>1} states by evaluating the signal at Inc*(t(l1)/ri)Inc^*(t - (l - 1) / r_i).

Initialising SR states

The SRSR states of the SEIR model should capture both the short and the long term developments of the incidence signal we want to match. If a lot of infections have happened previously, we expect a larger proportion of the population to be in the RR states.

We again assume Inc*=riI1Inc^* = r_i I_1.

We can modify the SEIR equation to take this signal as a forcing function with one less II state (no I1I_1 state).

The equations are as normal except for the following changes: I=Inc*ri+l=2LIl I = \frac{Inc^*}{r_i} + \sum_{l=2}^L I_l

dI2dt=Inc*riI2 \frac{d I_2}{d t} = Inc^* - r_i I_2

If we create such a submodel and start this system at a time where there are no new infections, we can initialize Il=0I_l = 0, and run the simulation forward to estimate the SS and RR populations at the point of interest.

Initialising custom model output states

When the model is asked to deliver a prediction for an observable, these observables may be shifted in time (i.e. delayed) relative to the point of infection.

For example, hospitalisation might occur on average 7 days after infection. If the models starts at time = 0, then any model output for hospitalisation will therefore have a transient period of 7 days before that signal begins being exported by the model.

Each model output is derived either from the I1I_1 state or any of the CC states added when the user defines a custom model output (see the “DiseasyModelOde” article).

To eliminate the transient behaviour of the model output when delays are included, we can determine the values of the I1I_1 state and the CC states for time < 0.

This process is two-fold, 1) To determine I1I_1 for time < 0 we can simply use the input signal given to the initialisation method by re-using the assumption Inc*=riI1Inc^* = r_i I_1.

  1. To determine CC_\cdot for time < 0 we use the same submodel used to infer the R and S states and configure the custom model outputs here. The forcing function in the submodel thus generates the time evolution of the CC_\cdot states for time < 0.

Testing the methods

We test the initialisation methods on a data set generated using the same ?DiseasyModelOdeSeir model template that we are using in this vignette.

Simple SEIR example data

For the first example, we use the SEIR model output where we know the parameters of the model used to generate the data.

To begin, we configure the observables module to use this data set and to use all available data.

# Connect to a database
obs <- DiseasyObservables$new(
  diseasystore = DiseasystoreSeirExample,
  conn = DBI::dbConnect(duckdb::duckdb())
)

obs$set_study_period( # Use all available data
  start_date = obs$ds$min_start_date,
  end_date = obs$ds$max_end_date
)

The data set contains different data for the infected to test our initialisation method against.

The data are:

  • “n_infected”: The true number of infected in the model, measured as the number of people transitioning out of the I1 state at any given date.
  • “n_positive_simple”: A realisation of the number of test-positives in the model - using a 65 % probability of testing.
  • “n_positive”: A realisation of the number of test-positives in the model - using a overall 65 % probability of testing in conjunction with a reduced probability of testing during weekends.
model_data <- c("n_infected", "n_positive_simple", "n_positive") |>
  purrr::map(\(observable) {
    obs$get_observation(
      observable = observable,
      stratification = rlang::quos(age_group)
    )
  }) |>
  purrr::reduce(~ dplyr::full_join(.x, .y, by = c("date", "age_group"))) |>
  dplyr::mutate("variant" = "WT", .after = "age_group")

model_data
#> # A tibble: 1,500 × 6
#>    date       age_group variant n_infected n_positive_simple n_positive
#>    <date>     <chr>     <chr>        <dbl>             <dbl>      <dbl>
#>  1 2020-01-03 00-29     WT            68.0                42         42
#>  2 2020-01-04 00-29     WT           145.                 94         69
#>  3 2020-01-05 00-29     WT           191.                122         93
#>  4 2020-01-06 00-29     WT           220.                142        149
#>  5 2020-01-07 00-29     WT           242.                152        163
#>  6 2020-01-08 00-29     WT           265.                169        180
#>  7 2020-01-09 00-29     WT           289.                189        203
#>  8 2020-01-10 00-29     WT           315.                191        207
#>  9 2020-01-11 00-29     WT           343.                231        176
#> 10 2020-01-12 00-29     WT           374.                236        174
#> # ℹ 1,490 more rows
# Visualise the example data
ggplot2::ggplot(model_data) +
  ggplot2::geom_line(
    ggplot2::aes(x = date, y = n_infected, color = "Infected"),
    linewidth = 1
  ) +
  ggplot2::geom_point(
    ggplot2::aes(x = date, y = n_positive, color = "Test positive (realistic)")
  ) +
  ggplot2::geom_line(
    ggplot2::aes(
      x = date, y = n_positive_simple,
      color = "Test positive (simple)"
    ),
    linewidth = 1
  ) +
  ggplot2::facet_wrap(~ age_group) +
  ggplot2::ylab("Test positive / Infected") +
  ggplot2::scale_color_manual(
    values = c(
      "Infected"                  = "deepskyblue3",
      "Test positive (simple)"    = "orange",
      "Test positive (realistic)" = "seagreen"
    )
  ) +
  ggplot2::labs(colour = "Model output")

Plots of the example data bundled with diseasy.

These different levels of detail allows us to test the initialisation from incidence data in different cases.

The method relies on having incidence data, so we scale the model outputs by the population size. We do this by creating a “synthetic” observable in the observables module.

The simplest cases is using the “n_infected” signal which directly tracks the I1 state in the model. While the most realistic case is the “n_positive” signal which has some real life inspired noise patterns.

source <- "n_positive"


if (source == "n_infected") {
  mapping <- \(n_infected, n_population) n_infected / (n_population)
} else if (source == "n_positive_simple") {
  mapping <- \(n_positive_simple, n_population) {
    n_positive_simple / (n_population * 0.65)
  }
} else if (source == "n_positive") {
  mapping <- \(n_positive, n_population) n_positive / (n_population * 0.65)
}

obs$define_synthetic_observable("incidence", mapping)

incidence_data <- obs$get_observation(
  observable = "incidence",
  stratification = rlang::quos(age_group)
) |>
  dplyr::mutate("source" = !!source)

Correctly specified model

In any case, we first need to define the model that should initialise using the incidence data. We here use the model configuration used to generate the data to test the best case scenario:

# Set the point in time to initialise from
obs$set_last_queryable_date(obs$start_date + lubridate::days(50))

generate_model <- function(K, L, M, rE = 1 / 2.1, rI = 1 / 4.5) {

  # Define the population for the model
  population <- DiseasyPopulation$new(age_cuts_lower = c(0, 30, 60))

  # Define the activity for the scenario
  activity <- DiseasyActivity$new()
  activity$set_contact_basis(contact_basis = contact_basis_nordic$DK)
  activity$set_activity_units(dk_activity_units)
  activity$change_activity(date = as.Date("1900-01-01"), opening = "baseline")

  # Add a waning immunity scenario
  immunity <- DiseasyImmunity$new()
  immunity$set_exponential_waning(time_scale = 180)

  # Add a season scenario
  season <- DiseasySeason$new()
  season$set_reference_date(as.Date("2020-01-20"))
  season$use_cosine_season()


  # Create a SEIR model to initialise
  m <- DiseasyModelOdeSeir$new(
    population = population,
    activity = activity,
    immunity = immunity,
    season = season,
    observables = obs,
    parameters = list(
      "compartment_structure" = c("E" = K, "I" = L, "R" = M),
      "overall_infection_risk" = 0.025,
      "disease_progression_rates" = c("E" = rE, "I" = rI)
    )
  )

  return(m)
}

m <- generate_model(2L, 1L, 2L) # Use the configuration from example data

This method relies on fitting a polynomial to the latest period, so we here visualise this fitting.

# Extract the most recent signal
poly_fit_data <- incidence_data |>
  dplyr::mutate(
    "t" = as.numeric(.data$date - !!obs$last_queryable_date, units = "days")
  )

poly_fit_projection <- poly_fit_data |>
  dplyr::group_by(.data$age_group) |>
  dplyr::group_modify(
    ~ {
      poly_fit <- lm(
        incidence ~ poly(t, m$parameters$incidence_polynomial_order, raw = TRUE),
        data = dplyr::filter(
          .x,
          .data$t <= 0,
          .data$t >= - m$parameters$incidence_polynomial_training_length
        )
      )

      tibble::tibble(
        "t" = .x$t,
        "incidence" = predict(poly_fit, data.frame("t" = t))
      ) |>
        dplyr::mutate("date" = .x$date)
    }
  )

incidence_data |>
  ggplot2::ggplot(ggplot2::aes(x = date, y = incidence)) +
    ggplot2::geom_point(
      color = switch(
        source,
        "n_infected"        = "deepskyblue3",
        "n_positive_simple" = "orange",
        "n_positive"        = "seagreen"
      )
    ) +
    ggplot2::geom_line(data = poly_fit_projection, color = "red", linewidth = 1) +
    ggplot2::geom_vline(
      xintercept = obs$last_queryable_date,
      linetype = 2, linewidth = 1, color = "red"
    ) +
    ggplot2::geom_vline(
      xintercept = obs$last_queryable_date -
        m$parameters$incidence_polynomial_training_length,
      linetype = 2, linewidth = 1, color = "red"
    ) +
    ggplot2::ylim(
      0,
      incidence_data |>
        dplyr::pull("incidence") |>
        max() * 1.1
    ) +
    ggplot2::facet_wrap(~ age_group) +
    ggplot2::theme_bw()
#> Warning: Removed 1241 rows containing missing values or values outside the scale range
#> (`geom_line()`).

Fitting a polynomial to the incidence data to estimate derivatives.

We can now use the $initialise_state_vector() method to infer the initial state vector.

psi <- m$initialise_state_vector(incidence_data)

psi
#> # A tibble: 168 × 5
#>     time variant age_group state   value
#>    <dbl> <chr>   <chr>     <chr>   <dbl>
#>  1     0 All     00-29     E1    0.00141
#>  2     0 All     00-29     E2    0.00139
#>  3     0 All     00-29     I1    0.00528
#>  4     0 All     00-29     R1    0.00131
#>  5     0 All     00-29     R2    0.00123
#>  6     0 All     30-59     E1    0.00121
#>  7     0 All     30-59     E2    0.00120
#>  8     0 All     30-59     I1    0.00457
#>  9     0 All     30-59     R1    0.00114
#> 10     0 All     30-59     R2    0.00107
#> # ℹ 158 more rows

And we now test the initial conditions by solving the model using these starting conditions.

prediction <- m$get_results(
  "incidence",
  stratification = rlang::quos(variant, age_group),
  prediction_length = 100
) |>
  dplyr::mutate(
    model_configuration = paste0(
      names(m$parameters$compartment_structure),
      m$parameters$compartment_structure,
      collapse = ""
    )
  )

prediction
#> # A tibble: 300 × 7
#>    date       variant age_group incidence realisation_id weight
#>    <date>     <chr>   <chr>         <dbl>          <dbl>  <dbl>
#>  1 2020-02-23 All     00-29       0.00343              1      1
#>  2 2020-02-24 All     00-29       0.00362              1      1
#>  3 2020-02-25 All     00-29       0.00385              1      1
#>  4 2020-02-26 All     00-29       0.00412              1      1
#>  5 2020-02-27 All     00-29       0.00440              1      1
#>  6 2020-02-28 All     00-29       0.00469              1      1
#>  7 2020-02-29 All     00-29       0.00500              1      1
#>  8 2020-03-01 All     00-29       0.00532              1      1
#>  9 2020-03-02 All     00-29       0.00564              1      1
#> 10 2020-03-03 All     00-29       0.00597              1      1
#> # ℹ 290 more rows
#> # ℹ 1 more variable: model_configuration <chr>
ggplot2::ggplot() +
  ggplot2::geom_point(
    data = incidence_data,
    ggplot2::aes(x = date, y = incidence),
    color = switch(
      source,
      "n_infected"        = "deepskyblue3",
      "n_positive_simple" = "orange",
      "n_positive"        = "seagreen"
    )
  ) +
  ggplot2::geom_line(
    data = prediction,
    ggplot2::aes(x = date, y = incidence, color = model_configuration),
    linewidth = 1.5
  ) +
  ggplot2::geom_vline(
    xintercept = obs$last_queryable_date,
    linetype = 2,
    color = "black"
  ) +
  ggplot2::facet_grid(source ~ age_group, scales = "free") +
  ggplot2::labs(y = "Model output", color = "Model Configuration")

Using a correctly specified model to initialise the SEIR model matches the true data near-perfectly.

Misspecified model

Correctly matching the model is the best case scenario. However, we can also the method for a couple of cases where the model is misspecified.

Note that we at this state does not modify the parameters of the model to match the development, we only estimate the initial state vector.

Once we include model fitting, the discrepancy between the data and model predictions may diminish.

Misspecified model in periods of increasing infections
models <- list(
  generate_model(2L, 1L, 1L),
  generate_model(1L, 1L, 2L),
  generate_model(2L, 2L, 2L),
  generate_model(3L, 2L, 5L)
)

predictions <- models |>
  purrr::map(\(m) {
    m$get_results(
      "incidence",
      stratification = rlang::quos(variant, age_group),
      prediction_length = 100
    ) |>
      dplyr::mutate(
        model_configuration = paste0(
          names(m$parameters$compartment_structure),
          m$parameters$compartment_structure,
          collapse = ""
        )
  )
  }) |>
  purrr::reduce(rbind)


ggplot2::ggplot() +
  ggplot2::geom_point(
    data = incidence_data,
    ggplot2::aes(x = date, y = incidence),
    color = switch(
      source,
      "n_infected"        = "deepskyblue3",
      "n_positive_simple" = "orange",
      "n_positive"        = "seagreen"
    )
  ) +
  ggplot2::geom_line(
    data = predictions,
    ggplot2::aes(x = date, y = incidence, color = model_configuration),
    linewidth = 1.5
  ) +
  ggplot2::geom_vline(
    xintercept = models[[1]]$observables$last_queryable_date,
    linetype = 2,
    color = "black"
  ) +
  ggplot2::facet_grid(source ~ age_group, scales = "free") +
  ggplot2::labs(y = "Model output", color = "Model Configuration")

Using a misspecified model to initialise the SEIR model can match the true data well when infections are increasing.

Misspecified model in periods of decreasing infections

models <- list(
  generate_model(2L, 1L, 1L),
  generate_model(1L, 1L, 2L),
  generate_model(2L, 2L, 2L),
  generate_model(3L, 2L, 5L)
)

# Update the last queryable date to later starting point
purrr::walk(models, \(m) {
  m$observables$set_last_queryable_date(
    m$observables$start_date + lubridate::days(150)
  )
})

predictions <- models |>
  purrr::map(\(m) {
    m$get_results(
      "incidence",
      stratification = rlang::quos(variant, age_group),
      prediction_length = 100
    ) |>
      dplyr::mutate(
        model_configuration = paste0(
          names(m$parameters$compartment_structure),
          m$parameters$compartment_structure,
          collapse = ""
        )
      )
  }) |>
  purrr::reduce(rbind)


ggplot2::ggplot() +
  ggplot2::geom_point(
    data = incidence_data,
    ggplot2::aes(x = date, y = incidence),
    color = switch(
      source,
      "n_infected"        = "deepskyblue3",
      "n_positive_simple" = "orange",
      "n_positive"        = "seagreen"
    )
  ) +
  ggplot2::geom_line(
    data = predictions,
    ggplot2::aes(x = date, y = incidence, color = model_configuration),
    linewidth = 1.5
  ) +
  ggplot2::geom_vline(
    xintercept = models[[1]]$observables$last_queryable_date,
    linetype = 2,
    color = "black"
  ) +
  ggplot2::facet_grid(source ~ age_group, scales = "free") +
  ggplot2::labs(y = "Model output", color = "Model Configuration")

Using a misspecified model to initialise the SEIR model can match the true data well when infections are decreasing.