---
title: "Creating target denominator populations"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{a03_Creating_target_denominator_populations}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, warning=FALSE, message=FALSE}
library(IncidencePrevalence)
library(CDMConnector)
library(IncidencePrevalence)
library(dbplyr)
library(dplyr)
library(tidyr)
library(ggplot2)
```

# Using generateDenominatorCohortSet() with a target cohort
As seen in the previous vignette, `generateDenominatorCohortSet()` can be used to generate denominator populations based on all individuals in the database with individuals included once they satisfy criteria. However, in some case we might want to define a denominator population within a specific population of interest, for example people diagnosed with a condition of interest. The function `generateTargetDenominatorCohortSet()` provides functionality for this.

To provide an example its use, let´s generate 5 example patients.

Here we generate a simulated target cohort table with 5 individuals and 2 different target cohorts to illustrate the following examples. Note, some of the individuals in the database are in an acute asthma cohort.

```{r, message=FALSE, warning=FALSE}
personTable <- tibble(
  person_id = c("1", "2", "3", "4", "5"),
  gender_concept_id = c(rep("8507", 2), rep("8532", 3)),
  year_of_birth = 2000,
  month_of_birth = 06,
  day_of_birth = 01
)
observationPeriodTable <- tibble(
  observation_period_id = "1",
  person_id = c("1", "2", "3", "4", "5"),
  observation_period_start_date = c(
    as.Date("2010-12-19"),
    as.Date("2005-04-01"),
    as.Date("2009-04-10"),
    as.Date("2010-08-20"),
    as.Date("2010-01-01")
  ),
  observation_period_end_date = c(
    as.Date("2011-06-19"),
    as.Date("2005-11-29"),
    as.Date("2016-01-02"),
    as.Date("2011-12-11"),
    as.Date("2015-06-01")
  )
)

acute_asthma <- tibble(
  cohort_definition_id = rep("1", 5),
  subject_id = c("3", "3", "5", "5", "2"),
  cohort_start_date = c(
    as.Date("2012-01-01"),
    as.Date("2015-06-01"),
    as.Date("2014-10-01"),
    as.Date("2010-06-01"),
    as.Date("2005-08-20")
  ),
  cohort_end_date = c(
    as.Date("2013-01-01"),
    as.Date("2015-12-31"),
    as.Date("2015-04-01"),
    as.Date("2010-06-01"),
    as.Date("2005-09-20")
  )
)

# mock database
cdm <- mockIncidencePrevalence(
  personTable = personTable,
  observationPeriodTable = observationPeriodTable,
  targetCohortTable = acute_asthma
)
```

As we´ve already seen, we can get a denominator population without including any particular subset like so
```{r, message=FALSE, warning=FALSE}
cdm <- generateDenominatorCohortSet(
  cdm = cdm, 
  name = "denominator"
)
cdm$denominator
```

As we did not specify any study start and end date, the cohort start and end date of our 5 patients correspond to the same registered as observation period. 

```{r, message=FALSE, warning=FALSE}
cdm$denominator %>%
  filter(subject_id %in% c("1", "2", "3", "4", "5")) %>%
  collect() %>%
  pivot_longer(cols = c(
    "cohort_start_date",
    "cohort_end_date"
  )) %>%
  ggplot() +
  geom_point(aes(x = value, y = subject_id)) +
  geom_line(aes(x = value, y = subject_id)) +
  theme_minimal() +
  xlab("Year")
```

But if we use `generateTargetDenominatorCohortSet()` to create a denominator cohort among the individuals in the acute asthma cohort.

```{r, message=FALSE, warning=FALSE}
cdm <- generateTargetDenominatorCohortSet(
  cdm = cdm, 
  name = "denominator_acute_asthma",
  targetCohortTable = "target"
)
```

We can see that persons "3" and "5" experienced this condition in two different occasions and contribute time to the denominator population twice, while person "2" contributes one period of time at risk. 

```{r, message=FALSE, warning=FALSE}
cdm$denominator_acute_asthma %>%
  collect() %>%
  mutate(row = row_number()) %>% 
  pivot_longer(cols = c(
    "cohort_start_date",
    "cohort_end_date"
  )) %>%
  ggplot(aes(group = row)) +
  geom_point(aes(x = value, y = subject_id)) +
  geom_line(aes(x = value, y = subject_id)) +
  theme_minimal() +
  xlab("Year")
```

Restrictions based on age, sex, and prior observation can again be applied. Prior observation requirements are applied relative to an individual´s observation period start date. Importantly, when using a target cohort for the denominator individuals must satisfy the age and prior observation requirements on their target cohort entry date. (If you wish to stratify on a time-invariant feature, you would create a general population denominator and stratify at the time of estimating incidence and prevalence - see subsequent vignettes).

In the case below we can see that persons "3" and "5" satisfy the sex and age requirements on their target cohort start dates, although one of the cohort entries is excluded for patient "5" as they were below the minimum age at that time.

```{r, message=FALSE, warning=FALSE}
cdm <- generateTargetDenominatorCohortSet(
  cdm = cdm, 
  name = "denominator_acute_asthma_2",
  ageGroup = list(c(11, 15)),
  sex = "Female",
  daysPriorObservation = 0,
  targetCohortTable = "target"
)

cdm$denominator_acute_asthma_2 %>%
  collect() %>%
  mutate(row = row_number()) %>% 
  pivot_longer(cols = c(
    "cohort_start_date",
    "cohort_end_date"
  )) %>%
  ggplot(aes(group = row)) +
  geom_point(aes(x = value, y = subject_id)) +
  geom_line(aes(x = value, y = subject_id)) +
  theme_minimal() +
  xlab("Year")
```
We can also enforce a time at risk for people to contribute to the denominator. For instance, we might only want to take into account events for people in the target denominator cohort on their first month after cohort entry. This can be achieved by adding the input parameter `timeAtRisk = c(0,30)` when computing the denominator, before any incidence or prevalence calculations. 

```{r, message=FALSE, warning=FALSE}
cdm <- generateTargetDenominatorCohortSet(
  cdm = cdm, 
  name = "denominator_acute_asthma_2",
  ageGroup = list(c(11, 15)),
  sex = "Female",
  daysPriorObservation = 0,
  targetCohortTable = "target",
  timeAtRisk = c(0,30)
)

cdm$denominator_acute_asthma_2 %>%
  collect() %>%
  mutate(row = row_number()) %>% 
  pivot_longer(cols = c(
    "cohort_start_date",
    "cohort_end_date"
  )) %>%
  ggplot(aes(group = row)) +
  geom_point(aes(x = value, y = subject_id)) +
  geom_line(aes(x = value, y = subject_id)) +
  theme_minimal() +
  xlab("Year")
```
Note that this parameter allows the user to input different time at risk values in the same function call. Therefore, if we ask for `timeAtRisk = list(c(0, 30), c(31, 60))`, we will get a denominator cohort of people contributing time up to 30 days following cohort entry, and another one with time from 31 days following cohort entry to 60 days. 

```{r, message=FALSE, warning=FALSE}
cdm <- generateTargetDenominatorCohortSet(
  cdm = cdm, 
  name = "denominator_acute_asthma_3",
  ageGroup = list(c(11, 15)),
  sex = "Female",
  daysPriorObservation = 0,
  targetCohortTable = "target",
  timeAtRisk = list(c(0,30), c(31,60))
)

cdm$denominator_acute_asthma_3 %>%
  collect() %>%
  dplyr::left_join(attr(cdm$denominator_acute_asthma_3, "cohort_set") %>%
                     dplyr::select(c("cohort_definition_id",
                                     "time_at_risk")),
                   by = "cohort_definition_id",
                   copy = TRUE) %>%
  mutate(row = row_number()) %>% 
  pivot_longer(cols = c(
    "cohort_start_date",
    "cohort_end_date"
  )) %>%
  ggplot(aes(group = row, colour = time_at_risk)) +
  geom_point(aes(x = value, y = subject_id)) +
  geom_line(aes(x = value, y = subject_id)) +
  theme_minimal() +
  xlab("Year")
```

Additionally, observe that this parameter allows us to  control the follow-up of these individuals after cohort entry for the denominator cohort, but this is inherently linked to their cohort exit and the end of their observation period. Hence, if we define the target cohort so that individuals are only followed for one month, and we now require `timeAtRisk = c(0,90)`, we will get the same denominator cohort as in the previous example.

```{r, message=FALSE, warning=FALSE}
cdm$target_2 <- cdm$target |>
  dplyr::mutate(dif = cohort_end_date - cohort_start_date) |>
  dplyr::mutate(cohort_end_date = dplyr::if_else(
    dif > 30,
    clock::add_days(cohort_start_date, 30),
    cohort_end_date
  )) |>
  dplyr::select(-"dif") |>
  dplyr::compute()

cdm <- generateTargetDenominatorCohortSet(
  cdm = cdm, 
  name = "denominator_acute_asthma_4",
  ageGroup = list(c(11, 15)),
  sex = "Female",
  daysPriorObservation = 0,
  targetCohortTable = "target_2",
  timeAtRisk = c(0,90)
)

cdm$denominator_acute_asthma_4 %>%
  collect() %>%
  mutate(row = row_number()) %>% 
  pivot_longer(cols = c(
    "cohort_start_date",
    "cohort_end_date"
  )) %>%
  ggplot(aes(group = row)) +
  geom_point(aes(x = value, y = subject_id)) +
  geom_line(aes(x = value, y = subject_id)) +
  theme_minimal() +
  xlab("Year")
```