--- title: "DQAstats2SHIPdataset" vignette: > %\VignetteIndexEntry{DQAstats2SHIPdataset} %\VignetteEncoding{UTF-8} %\VignetteEngine{quarto::html} editor_options: chunk_output_type: console execute: echo: true message: false warning: false collapse: true comment: "#>" --- ```{r, include = FALSE} # nolint start library(magrittr) ``` The intention of this vignette is to provide a tutorial, how to prepare the metadata information of a dataset in order to use the `DQAstats` R package for data quality assessment. The R package `dataquieR` provides both a dataset and corresponding metadata information. For demonstration purposes, the SHIP-dataset from `dataquieR` will be analyzed with `DQAstats` in the following. The necessary steps to convert `dataquieR`'s metadata representation into the format understood by `DQAstats` are provided along with some background information. # Load Data from `dataquieR` R package First of all, load the data files from the `dataquieR` R package. These files include the actual dataset to be analyzed `ship.RDS` and the corresponding metadata representation `ship_meta.RDS`. For further information on `dataquieR`, please refer to [https://dataquality.ship-med.uni-greifswald.de/](https://dataquality.ship-med.uni-greifswald.de/). ```{r} # the next lines are commentend, since we avoided to have 'dataquieR' as yet # another dependency to our package; however, if you have installed it, you # can uncomment these lines to load the required files from the dataquieR-package #list.files(system.file("extdata", package = "dataquieR")) # load the ship dataset # ship_data <- readRDS( # system.file("extdata/ship.RDS", package = "dataquieR") # ) %>% # data.table::data.table() ship_url <- paste0( "https://raw.githubusercontent.com/cran/dataquieR/", "master/inst/extdata/ship.RDS" ) tmpf <- tempfile() download.file(ship_url, tmpf, mode = "wb") ship_data <- readRDS(tmpf) %>% data.table::data.table() # export the dataset to a CSV-file (one of the formats supported by DQAstats) ship_data_export_fn <- "ship_data.csv" data.table::fwrite( x = ship_data, file = file.path(tempdir(), ship_data_export_fn) ) ``` ```{r} # load the ship metadata # ship_meta <- readRDS( # system.file("extdata/ship_meta.RDS", package = "dataquieR") # ) %>% # data.table::data.table() ship_meta_url <- paste0( "https://raw.githubusercontent.com/cran/dataquieR/", "master/inst/extdata/ship_meta.RDS" ) tmpf <- tempfile() download.file(ship_meta_url, tmpf, mode = "wb") ship_meta <- readRDS(tmpf) %>% data.table::data.table() ``` Have a look at `dataquieR`'s metadata representation: ```{r results='asis'} ship_meta %>% DT::datatable(options = list( scrollX = TRUE, pageLength = 4 )) ``` # Prepare `DQAstats` MDR The most important part for using `DQAstats` for data quality assessment is to represent the metadata of the dataset(s) to be analyzed in the correct format. The MDR format expected by `DQAstats` is described in the [DQAstats-wiki](https://github.com/miracum/dqa-dqastats/wiki/MDR). An example MDR is also provided with the `DQAstats` R package [here](https://github.com/miracum/dqa-dqastats/tree/master/inst/demo_data/utilities/MDR). Furthermore, `DQAstats` comes with an exemplary toy dataset, which can be analyzed by executing the commands from the [README.md](https://github.com/miracum/dqa-dqastats#example). ## Read empty `DQAstats` MDR as template In order to get an empty template of the `DQAstats`' MDR to be filled with the metadata information from `dataquieR`, we will just read the columns names from the example MDR: ```{r} mdr <- data.table::fread( file = system.file( "demo_data/utilities/MDR/mdr_example_data.csv", package = "DQAstats" ), header = TRUE, nrows = 0, colClasses = "character" ) dim(mdr) colnames(mdr) ``` For a detailed description of the MDR fields, please refer to the [DQAstats-wiki](https://github.com/miracum/dqa-dqastats/wiki/MDR#mdr-overview). ## Transform SHIP-MDR to `DQAstats` representation Next, we need to fill the rows of the `DQAstats` MDR. Since we are only analyzing one dataset (opposed to comparing datasets from two different databases, whereas the dataset was transferred from one of the databases to the other, e.g., via an extract-transfrom-load (ETL) process), we can set the MDR files "designation", "variable_name", and "key" to the same values. ```{r warning=FALSE} # get names of dataelements from ship dataset ship_var_names <- intersect(colnames(ship_data), ship_meta[, get("VAR_NAMES")]) # loop over dataelements for (var in ship_var_names) { # get variable type from ship metadata data_type <- ship_meta[get("VAR_NAMES") == var, get("DATA_TYPE")] # get variable description from ship metadata labels <- ship_meta[get("VAR_NAMES") == var, get("VALUE_LABELS")] split_list <- strsplit( x = labels, split = " | ", fixed = TRUE ) # convert variable description into human readable format if (length(split_list[[1]]) == 1 && is.na(split_list[[1]])) { descr <- paste0("Description for dataelement '", var, "'") } else { descr <- paste0( "Description for dataelement '", var, "': \n\n", paste0(unlist(split_list), collapse = "; ") ) } # for this dataelement, fill row of DQAstats-MDR with extracted information mdr <- data.table::rbindlist( l = list( mdr, data.table::as.data.table( x = cbind( "designation" = var, "definition" = descr, "source_variable_name" = var, "variable_name" = var, "key" = var, "source_table_name" = ship_data_export_fn, "source_system_type" = "csv", # indicates that datasets format is CSV "source_system_name" = "ship", # arbitrary name for this dataset "dqa_assessment" = 1, # default value for dataelements to be analyzed "variable_type" = data_type ) ) ), fill = TRUE ) } ``` ## Change Variable Type for Categorical Variables In `dataquieR`, categorical variables are defined as "DATA_TYPE=integer" in the MDR. To work well with `DQAstats`, this needs to be changed to "enumerated" (see also the [DQAstats-wiki](https://github.com/miracum/dqa-dqastats/wiki/MDR#mdr-variable-types-and-constraints) for further details). ```{r} # save all categorical variablees in a vector cat_vars <- c( "id", "sex", "obs_bp", "dev_bp", "obs_soma", "dev_length", "dev_weight", "obs_int", "school", "family", "smoking", "stroke", "myocard", "diab_known", "contraception", "income" ) # change variable type from integer to "enumerated" (to get meaningful # results from DQAstats) mdr[get("designation") %in% cat_vars, ("variable_type") := "enumerated"] ``` ## Define Constraints Next, the formatting of the constraints defined in `dataquieR`'s MDR needs to be transformed to the formatting, required by `DQAstats`. Please find further details on the definition of constraints in the [DQAstats-wiki](https://github.com/miracum/dqa-dqastats/wiki/MDR#mdr-variable-types-and-constraints). ### Constraints for Categorical Variables ```{r warning=FALSE} # loop over categorical variables for (var in cat_vars) { # get definitions of category-levels labels <- ship_meta[get("VAR_NAMES") == var, get("VALUE_LABELS")] split_list <- strsplit( x = labels, split = " | ", fixed = TRUE ) # extract the allowed values for each categorical dataelement allowed_values <- strsplit( x = split_list[[1]], split = " =", fixed = TRUE ) # transform allowed values into a "value_set" array, which is stored # as a JSON object (required by DQAstats) c <- jsonlite::toJSON( list( "value_set" = sapply( X = allowed_values, FUN = function(x) { return(x[[1]]) }) ), pretty = TRUE ) # add constraints for this dataelement to the DQAstats MDR mdr[get("source_variable_name") == var, ("constraints") := c] } ``` ```{r} # for comparison, show metadata definition from dataquieR ... # dataquieR: string, split by pipes '|' ship_meta[get("VAR_NAMES") == "smoking", get("VALUE_LABELS")] ``` ```{r} # ... and DQAstats: JSON mdr[get("source_variable_name") == "smoking", get("constraints")] mdr[get("source_variable_name") == "smoking", get("constraints")] %>% jsonlite::fromJSON() ``` We can see here that the mapping from the data codes to the respective values gets lost during the transformation of constraints from `dataquieR` to `DQAstats`. This is because `DQAstats` is meant to be applied directly to the raw dataset and checks the conforming of the observed data values to the respective constraints for each variable. Furthermore, it is to be noted that `DQAstats` **does not require constraints** to be defined for each data element in order to be applied to a dataset. Thus, data quality assessment with `DQAstats` can also be performed without the definition of constraints. ### Constraints for Continuous Variables ```{r} # get names of continuous variables (float/integer) cont_vars <- mdr[ get("variable_type") %in% c("integer", "float"), unique(get("source_variable_name")) ] # loop over continuous dataelements for (var in cont_vars) { # extract constraints of continuous dataelement labels <- ship_meta[get("VAR_NAMES") == var, get("HARD_LIMITS")] split_list <- strsplit( x = labels, split = ";", fixed = TRUE ) # transform these constraints to a list, with entries "min" and "max" constraints <- lapply( X = split_list, FUN = function(x) { if (!is.na(x[[1]])) { return( list( "min" = as.numeric(gsub("[[:punct:]]", "", x[[1]])), "max" = as.numeric(gsub("[[:punct:]]", "", x[[2]])) ) ) } else { return("error") } } ) # save list as nested JSON within the key "range" (required by DQAstats) if (length(constraints[[1]]) > 1) { c <- jsonlite::toJSON( list("range" = constraints[[1]]), pretty = TRUE, auto_unbox = TRUE ) # write constraint for this dataelement to DQAstats MDR mdr[get("source_variable_name") == var, ("constraints") := c] } } ``` ### Constraints for String Variables When setting "id" as `variable_type="string"`, we are able to test for constraints using regular expressions. This is especially meaningful, when investigating a categorical variable with a lot of categories. Here, a simple regular expression is used, however, they can get arbitrarily complex. ```{r} # The `id` variable should actually not be treated as a categorical variable, # since the number of potential categories is (theoretically) unlimited and # an analysis of the categories in this case is seldom meaningful. # Nevertheless, we could perform quality checks on the `id` variable, e.g. to # check for a required formatting of the values. # This can be achieved with DQAstats by changing the variable type to "string" # and use a regular expression that checks for the expected format. # change variable type to "string" mdr[get("source_variable_name") == "id", ("variable_type") := "string"] # add regular expression for the dataelement "id" mdr[get("source_variable_name") == "id", ("constraints") := jsonlite::toJSON( list("regex" = "^[[:digit:]]{1,5}$"), pretty = TRUE, auto_unbox = TRUE )] ``` ## Add Plausibility Checks Currently, there is no mechanism to automatically transform the plausibility checks (or "contradictions" as named by `dataquieR`) from the `dataquieR` format to `DQAstats`. The following checks were manually taken from the file [`ship_contradiction_checks.csv`](https://gitlab.com/libreumg/dataquier/-/blob/master/inst/extdata/ship_contradiction_checks.csv) and transformed into the representation required by `DQAstats`. Please find further details on the definition of plausibility checks in the [DQAstats-wiki](https://github.com/miracum/dqa-dqastats/wiki/MDR#mdr-plausibility-relations). ### Contraception in males (atemporal plausibility) ```{r} # add plausibilities # sex and contraception p <- jsonlite::toJSON(list( "atemporal" = list( "contraception" = list( "name" = "A_present_and_B_vv", "description" = "Contracept in males.", "join_crit" = "id", "filter" = list( "ship" = "^1$" ), "constraints" = list( "value_set" = list( "ship" = "1" ) ) ) ) ), pretty = TRUE, auto_unbox = TRUE ) # write plausibility to mdr mdr[ source_variable_name == "sex" & dqa_assessment == 1, plausibility_relation := p ] ``` ### Diabetes age but no Diabetes (atemporal plausibility) ```{r} p <- jsonlite::toJSON(list( "atemporal" = list( "diab_age" = list( "name" = "A_present_and_B_levels_vl", "description" = "Diab age but no diab.", "join_crit" = "id", "filter" = list( "ship" = "^(?!.*(99900|99901|99801)).*$" ), "constraints" = list( "value_set" = list( "ship" = "1" ) ) ) )), pretty = TRUE, auto_unbox = TRUE ) mdr[ source_variable_name == "diab_known" & dqa_assessment == 1, plausibility_relation := p ] ``` ### Every ID is associated with one Sex-value (uniqueness plausibility) ```{r} p <- jsonlite::toJSON(list( "uniqueness" = list( "sex" = list( "name" = "Every ID is associated with one Sex", "description" = paste0("With each distinct value of 'id', ", "only one value of 'sex' may be associated." ) ))), pretty = TRUE, auto_unbox = TRUE ) mdr[ source_variable_name == "id" & dqa_assessment == 1, plausibility_relation := p ] ``` ## Display the prepared MDR Now, have a look at the MDR prepared for `DQAstats`: ```{r results='asis'} mdr %>% DT::datatable(options = list( scrollX = TRUE, pageLength = 4 )) ``` # Create `utils`-folder `DQAstats` requires a so-called `utilities`-folder that contains at least two sub-directories, named `MDR` and `RMD`. Please find further details on this folder in the [DQAstats-wiki](https://github.com/miracum/dqa-dqastats/wiki/Add_Datasets). When having created this folder structure, the MDR prepared for `DQAstats` can be saved as 'CSV'-file to the `MDR`-sub-directory. This file is then used as input for the function `DQAstats::dqa()`. ```{r message=FALSE} utils_path <- file.path(tempdir(), "utilities") dir.create(file.path(utils_path, "MDR"), recursive = TRUE) data.table::fwrite( x = mdr, file = file.path(utils_path, "MDR/mdr.csv") ) ``` The `RMD` sub-directory of the `utilities`-folder contains the template files for the Rmarkdown report. For demonstration purposes, the templates provided with `DQAstats` can be used directly. ```{r results='hide'} file.copy( from = system.file("demo_data/utilities/RMD", package = "DQAstats"), to = utils_path, overwrite = TRUE, recursive = TRUE ) ``` # Run `DQAstats::dqa()` Since we are here not aiming to compare two datasets, we are going to set the `target_system_name` to `source_system_name`. Thus we will get the analysis of the database compared with itself. Furthermore, the path to the `utilities`-folder, the logging directory, the MDR-filename, as well as the directory to store the resulting PDF-report (`output_dir`) need to be defined and provided to the function `DQAstats::dqa()`. Additionally, the directory that contains the dataset(s) to be analyzed must be specified using an environment variable. This variable is composed of the dataset-name (as specified in the `mdr.csv` field `source_table_name`) and the suffix `_PATH`, here `SHIP_PATH`. ```{r eval=FALSE, warning=FALSE, message=FALSE} # ship data set source_system_name <- "ship" target_system_name <- source_system_name mdr_filename <- "mdr.csv" output_dir <- file.path(tempdir(), "output") logfile_dir = tempdir() # does only work, if "ship_data.csv" is lying next to this RMD-file Sys.setenv("SHIP_PATH" = tempdir()) # provide all arguments to main function all_results <- DQAstats::dqa( source_system_name = source_system_name, target_system_name = target_system_name, utils_path = utils_path, mdr_filename = mdr_filename, output_dir = output_dir, logfile_dir = logfile_dir, parallel = FALSE ) ``` # Launch `DQAgui` as GUI-frontend to `DQAstats` When executing the following command, the GUI will be launched at port `3838` by default and can be accessed from any web-browser via `http://localhost:3838`. A tutorial for `DQAgui` can be found [here](https://github.com/miracum/dqa-dqastats/wiki/DQAgui_intro). ```{r eval=FALSE} library(DQAgui) # set basepath for file-browser in GUI-config Sys.setenv("CSV_SOURCE_BASEPATH" = tempdir()) ## launch GUI DQAgui::launch_app( utils_path = utils_path, mdr_filename = "mdr.csv", parallel = FALSE ) # nolint end ```