---
title: "opendataformat"
output:
rmarkdown::html_vignette:
toc: true
vignette: >
%\VignetteIndexEntry{opendataformat}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Introducing the opendataformat package
Tired of struggling to convert open data formats into R dataframes? Look no further and welcome the opendataformat package.
With just a few lines of code, you can convert a data package specified as opendataformat into an R dataframe (`read_odf()`) or convert an R dataframe into a data package specified in the opendataformat (`write_odf()`).
But wait, there's more! Our package goes beyond parsing. Accessing metadata has never been easier. Dive into the treasure trove of information stored in your R dataframes. Explore dataset labels, descriptions, and valuable details about variables such as labels and value labels with the `docu_odf()` function and the `getmetadata_odf()` function.
This vignette will guide you through a series of examples that demonstrate the possibilities of these functions. Let's get started!
## Installation
You can download and install the package by downloading the latest version from
Zenodo.
Alternatively you can download the development version from GitHub using the
`install_git()`-function from the devtools-library.
```{r setup, eval=FALSE}
# At this point you can download and install the the latest version of the
# opendataformat package from Zenodo:
install.packages(
"https://zenodo.org/records/13683314/files/opendataformat_1.2.1.tar.gz",
repos = NULL, method = "libcurl")
# Alternatively you can install the development version from GitHub:
devtools::install_git(
"https://github.com/opendataformat/r-package-opendataformat.git")
```
12917713
# Functions in the openmetadata Package
## Read the Open Data Format: `read_odf()`
The opendataformat package provides example data that is specified as 'Open Data Format'. Data in the open data format is a ZIP file containing a csv-file and a xml-file. The example data contains the data-csv (`data.csv`) with 20 rows and 7 columns and the metadata-XML (`metadata.xml`). The metadata file describes the dataset and its variables. If you are interested in what these files look like, you can find an example in the Open Data Format git repository: [https://git.soep.de/opendata/specification/-/tree/main/external/example](https://git.soep.de/opendata/specification/-/tree/main/external/example)
To load the data, we need to specify the path to the zip-file.
Here the example data is loaded using `read_odf()`
Alternatively, you can set `file` to a zip-file in the working directory.
```{r read_odf}
library(opendataformat)
path <- system.file("extdata", "data.zip", package = "opendataformat")
df <- read_odf(file = path)
```
The output of the `read_odf()` function is an R-tibble object, that has the additional class `odf`. It has additional metadata stored in the attributes of the tibble and the variables/columns. These include the languages of the metadata, labels, descriptions, urls, variable types, and value labels.
```{r view df, comment = ""}
df
```
If you load the haven package, you see the variable labels in the [active language](#set-the-active-language-of-a-data-frame-setlanguage_odf) .
```{r view df haven, eval=FALSE}
library(haven)
View(df)
```
If you want to import a dataset with metadata only in one or several languages. You can use the `languages`-argument. To load the example data only with english labels and descriptions, set `languages="en"`:
```{r read_odf language, eval=FALSE}
df_en <- read_odf(file = path, languages = "en")
```
By default `languages = "all"`:
```{r read_odf language all, eval=FALSE}
df_en <- read_odf(file = path, languages = "all")
```
You can also give a list of languages::
```{r read_odf language list, eval=FALSE}
df_en <- read_odf(file = path, languages = c("en", "de"))
```
You can set further arguments for the `read_odf()` function. With the `nrows` argument you define how many rows to read excluding the header. With the `skip` parameter you set how many rows to skip (excluding the header).With the `select` input you determine which columns/variables to load with a vector of indices or variable/column names.
```{r read_odf all inputs, eval = FALSE}
df <- read_odf(file, languages = "all", nrows = Inf, skip = 0, select = NULL)
```
## Explore Dataset Information: `docu_odf()`
### Display Medatata
You can explore dataset information using two methods. Firstly, you can browse metadata at the record level, providing an overview of the dataset. Alternatively, you have the option to examine specific variable details, allowing you to gain insights into selected data attributes.
By default, when using the `docu_odf()` function, dataset-level information is presented through the console and an HTML page. If you're utilizing RStudio, this html-page will be displayed within the RStudio viewer.
```{r docu_odf, eval = FALSE}
docu_odf(df)
```
To display the metadata only in the console, utilize the `style` argument with the value set to `print` (or `console`). This ensures that the information is conveniently displayed on the R console, serving our specific demonstration purposes. To display metadata information only in the viewer, set `style="viewer"` or `style="html"`. By default `style="both"`.
```{r docu_odf print}
docu_odf(df, style = "print")
```
To obtain a comprehensive overview of all variables within the dataset, simply set the argument `variables="yes"`.
```{r docu_odf variables, comment=""}
docu_odf(df, variables = "yes", style = "print")
```
If you are interested in just one specific variable, you can do this:
```{r docu_odf selected variable default}
docu_odf(df$bap9001, style = "print")
```
### Display Metadata in a Specific Language
Certain datasets offer metadata such as labels, descriptions, or value labels in multiple languages. To display the metadata in all languages supported by your dataset, you can simply set the `languages` argument to `all`. This setting enables you to identify the range of languages available for accessing the relevant metadata within your dataset.
```{r docu_odf languages all, eval = FALSE}
docu_odf(df$bap9001, style = "print", variables = "yes", languages = "all")
```
If you have a specific language of interest, you can easily display it by utilizing the corresponding language code. Simply specify the desired language code to retrieve the metadata in the language of your choice. This enables you to access the specific language variant of variable labels, value labels. In this example, we display the German version:
```{r docu_odf languages code print}
docu_odf(df$bap9001, style = "print", languages = "de")
```
You can apply this function to the entire dataset, allowing you to access the desired information across all variables.
```{r docu_odf languages code all data, eval = FALSE}
docu_odf(df, style = "print", variables = "yes", languages = "de")
```
If you prefer another display style, you can use the datasets' metadata directly from the attributes and write your own code:
```{r docu own style, eval = TRUE, comment = ""}
for (i in names(df)) {
cat(
paste0(attributes(df[[i]])$name, ": ", attributes(df[[i]])$label_de, "\n")
)
}
```
You can also use the getmetadata_odf() function to retrieve labels and other metadata for the variables:
```{r getmetadata_odf1, comment = ""}
getmetadata_odf(df, type = "label")
```
or the value labels:
```{r getmetadata_odf2, comment = ""}
getmetadata_odf(df$bap87, type = "valuelabels")
```
## Set the Active Language of a Data Frame: `setlanguage_odf()`
Alternatively, you can set the current (active) language for a dataset-object. (This function tries to copy the label language function from Stata.)
```{r docu_odf setLanguage2, eval = FALSE}
df <- setlanguage_odf(df, language = "de")
docu_odf(df$bap9001, style = "print")
```
To display which languages are available for the dataset metadata, display the `languages` attribute:
```{r display languages}
attributes(df)$languages
attr(df, "languages")
```
## Access Metadata: `getmetadata_odf()` and `attributes()`
Browsing through datasets' metadata provides a valuable initial overview. However, when it comes time to dive into the analysis work, questions arise regarding the storage location of the metadata and the process of accessing and utilizing it. Let's explore how and where the metadata is stored, and how we can effectively access and leverage it for analysis purposes.
A easy way to retrieve metadata is to use the `getmetadata_odf()` function to get metadata.
### Retrieve the Attributes with `attributes()` and `attr()`
Another way is to retrieve metadata directly from the attributes. The metadata imported from the Open Data Format file into an R tibble (dataframe) is stored as R attributes. By using the base R functions `attributes()` and `attr()`, you can easily access this metadata. When providing the entire dataset to the function, R will display all the metadata describing the dataset as a whole in your console.
```{r dataset attributes, comment = ""}
attributes(df)
```
If you provide a specific variable to the function, only the corresponding metadata for that variable will be printed.
```{r variable attributes, comment = ""}
attributes(df$bap87)
```
If you're interested in a particular attribute, you can access it using the dollar sign followed by the attribute name. For instance, let's consider accessing a variable label in German (language code: de) as an example.
```{r variable label attributes, comment = ""}
attributes(df$bap87)$label_de
```
Alternatively, you can use the `attr()` function to get the same result:
```{r variable label attributes2, eval = FALSE, comment = ""}
attr(df$bap87, "label_de")
```
Moreover, you have the flexibility to copy, remove, and modify these attributes to suit your needs.
```{r attributes deletion, comment = ""}
attributes(df$bap87)$description_de <- NULL
attributes(df$bap87)$description_de
```
### Retrieve Metadata with `getmetadata_odf()`
You can also use the `getmetadata_odf()` function to retrieve labels and other metadata for the variables.
By default, the function will return the variable labels for a dataset:
```{r getmetadata_odf3, comment = ""}
getmetadata_odf(df, type = "labels")
```
or for a specific variable::
```{r getmetadata_odf4, comment = ""}
getmetadata_odf(df$bap96, type = "labels")
```
To retrieve metadata in a specific language, use the language parameter:
```{r getmetadata_odf5, eval = FALSE, comment = ""}
getmetadata_odf(df, type = "labels", language = "en")
```
Or set the active language of the dataset using the `setlanguage_odf()` function:
```{r getmetadata_odf6, eval = FALSE, comment = ""}
df <- setlanguage_odf(df, language = "en")
getmetadata_odf(df, type = "labels")
```
You can also use the `getmetadata_odf()` function to retrieve value labels for a specific variable by setting the argument `type="valuelabels"`:
```{r getmetadata_odf valuelabels, comment = ""}
getmetadata_odf(df$bap9001, type = "valuelabels")
```
The value labels for each value are stored in the namespace:
```{r getmetadata_odf valuelabels names, comment = ""}
names(getmetadata_odf(df$bap9001, type = "valuelabels"))
```
You can use the `getmetadata_odf()` function to return descriptions, urls, variable types and metadata languages as well:
To retrieve variable description(s), set the argument `type="description"`:
```{r getmetadata_odf descriptions, comment = ""}
getmetadata_odf(df, type = "description")
```
To retrieve variable url(s), set the argument `type="url"`:
```{r getmetadata_odf url, eval = FALSE, comment = ""}
getmetadata_odf(df, type = "url")
```
To retrieve variable type(s), set the argument `type="type"`:
```{r getmetadata_odf type, eval = FALSE, comment = ""}
getmetadata_odf(df, type = "type")
```
## Export to the Open Data Format: `write_odf()`
To save a dataset as odf-file, we can use the `write_odf()` function.
Let's assume we want to save the first four columns of our dataset as a new odf-file.
We use the `write_odf()` function and indicate the r-dataframe and the file name (and location if it).
```{r write_odf, comment = "", eval = FALSE}
write_odf(
x = df[, 1:4],
file = "../df_1_4.zip"
)
#or :
df_14 <- df[, 1:4]
write_odf(
x = df[, 1:4],
file = "df_1_4.zip"
)
```
The XML file metadata.xml and the CSV file data.csv are saved within the directory 'data_rec', as well as within the ZIP file 'data_rec.zip. The dataset looks the same as before, just with fewer variables:
```xml
bap
The data were collected as part of the SOEP-Core study using the questionnaire "Living in Germany - Survey 2010 on the social situation - Personal questionnaire for all. This questionnaire is addressed to the individual persons in the household. A view of the survey instrument can be found here: https://www.diw.de/documents/dokumentenarchiv/17/diw_01.c.369781.de/soepfrabo_personen_2010.pdf
Die Daten wurden im Rahmen der Studie SOEP-Core mittels des Fragebogens „Leben in Deutschland – Befragung 2010 zur sozialen Lage - Personenfragebogen für alle“ erhoben. Dieser Fragebogen richtet sich an die einzelnen Personen im Haushalt. Eine Ansicht des Erhebungsinstrumentes finden Sie hier: https://www.diw.de/documents/dokumentenarchiv/17/diw_01.c.369781.de/soepfrabo_personen_2010.pdf
Data from individual questionnaires 2010
Daten vom Personenfragebogen 2010
Current Health
Gesundheitszustand gegenwärtig
Question: How would you describe your current health?
Frage: Wie würden Sie Ihren gegenwärtigen Gesundheitszustand beschreiben?
-2
Does not apply
trifft nicht zu
-1
No Answer
keine Angabe
1
Very good
Sehr gut
...
```
The data.csv file now includes just four columns:
```csv
"bap87","bap9201","bap9001","bap9002"
4,-2,1,-1
3,5,-2,1
,-1,-1,2
1,9,-2,2
-1,4,2,3
3,4,-1,4
1,9,2,-1
...
```
If you wish to export only the metadata for documentation or archiving purposes, you can achieve this by setting the argument `export_data=FALSE`. By doing so, the resulting directory or zip file will solely contain the metadata XML file, excluding the data CSV file. This allows you to specifically capture and preserve the metadata without including the actual data, providing a solution for documentation or archiving needs.
```{r write_odf metadata, comment = "", eval = FALSE}
write_odf(
x = df,
file = "../df_metadata.zip",
export_data = FALSE
)
```
If you wish to export the dataset with the metadata only in one or some languages, set the languages argument to `languages=c("en")`. Default: `languages="all"`
```{r write_odf english, comment = "", eval = FALSE}
write_odf(
x = df,
file = "../df_en.zip",
languages = "en"
)
```
By default, languages is set to `languages="all"`. You can also define a list of languages to be exported:
```{r write_odf english german, comment = "", eval = FALSE}
write_odf(
x = df,
file = "../df_en_de.zip",
languages = c("en", "de")
)
```
# Let's Analyse and Make Use of the Metadata
Now let's see how we can use the metadata to better understand the data and make more informative plots.
### Frequency Table
```{r table, comment = ""}
table(df$bap87, useNA = "ifany")
```
As expected, the frequency table displays the occurrence count of each variable value. Now, let's enhance the convenience of the frequency table by utilizing the value labels associated with the variables. To access the value labels, as explained in the preceding section, you can utilize the base R function `attributes()`. Let's proceed to examine them now:
```{r table attributes labels, comment = ""}
attributes(df$bap87)$labels_en
```
```{r table attributes labels_de, comment = ""}
attributes(df$bap87)$labels_de
```
```{r table factor, comment = ""}
table(factor(df$bap87, labels = names(attributes(df$bap87)$labels_en)))
```
Alternatively you can use the getmetadata_odf()-function to get the value labels:
```{r table factor getmetadata_odf, eval=F, comment = ""}
table(factor(df$bap87,
labels = names(getmetadata_odf(df$bap87, type = "valuelabels"))))
```
To display the data in a language other than the default one, let's try German by appending the respective language code to the attribute name. For example, you can use $labels_de to access the German language labels and present the information accordingly.
```{r table factor german, comment = ""}
table(
factor(
df$bap87,
labels = names(attributes(df$bap87)$labels_de)
)
)
```
Or using getmetadata_odf()-function:
```{r table factor german getmetadata_odf, comment = ""}
table(
factor(
df$bap87,
labels = names(getmetadata_odf(df$bap87, type = "valuelabels",
language = "de"))
)
)
```
## Merge ODF-Datasets:
To merge ODF-datasets you should use the left_join(), right_join(), full_join(),
and inner_join() from the dplyr-package instead of the the merge()-function to
keep the attributes with the metadata of the merged datasets.
```{r merge with join, eval = FALSE}
library(dplyr)
#similar to merge(df[,c(1:3,6)], df[,c(4:6)], by="name", all.x=T, all.y=F)
merged_df <- left_join(df[, c(1:3, 6)], df[, c(4:6)], by = "name")
#or
merged_df <- left_join(df[, c(1:3, 6)], df[, c(4:6)])
```
### Recoding a Variable
We want to display the table with only valid answers. Therefore, we set the values `-2` and `-1` to `NA`. Because we do not want to overwrite the original variable, we generate a new one:
```{r copy data, comment = ""}
bap87_rec <- df$bap87
```
We check the attributes of the metadata and notice they are also copied from the original variable to the new one:
```{r check metadata, comment = ""}
attributes(bap87_rec)
```
Now we can set the negative values to NA:
```{r NA, comment = ""}
for (row in seq(1, length(bap87_rec))) {
if (!is.na(bap87_rec[row]) && bap87_rec[row] <= -1) {
bap87_rec[row] <- NA
}
}
table(bap87_rec, useNA = "ifany")
```
We notice that the copied values and value labels do not fit anymore:
```{r check metadata again, comment = ""}
attributes(bap87_rec)$labels_en
```
To change that, we'll copy positions `3` to `7`, retaining the desired range of values and their respective value labels.
```{r copy metadata to recoded var, comment = ""}
attributes(bap87_rec)$labels_en <-
unname(attributes(df$bap87)$labels_en)[3:7] # values
names(attributes(bap87_rec)$labels_en) <-
names(attributes(df$bap87)$labels_en)[3:7] # labels
attributes(bap87_rec)$labels_en
```
Do the same for the other language versions of the new recoded variable:
```{r remove language versions of labels, comment = ""}
attributes(bap87_rec)$labels_de <-
unname(attributes(df$bap87)$labels_de)[3:7] # values
names(attributes(bap87_rec)$labels_de) <-
names(attributes(df$bap87)$labels_de)[3:7] # labels
```
We do also notice that the variable name is not adequate. We replace the name copied from the original variable with the new name `bap87_rec`.
```{r change variable name of recoded var, comment = ""}
attributes(bap87_rec)$name <- "bap87_rec"
attributes(bap87_rec)$name
```
Now we generate the frequency table by using the variable as a factor variable.
```{r frequency table of recoded var, comment = ""}
table(
factor(
bap87_rec,
labels = names(attributes(bap87_rec)$labels_en)
)
)
```
### Barplot
To create a barplot, we will utilize the recoded variable from the previous section. This example will demonstrate how to leverage metadata to create a more convenient and informative graph. By incorporating the metadata into the visualization, we can enhance the graph's interpretability and provide a clearer understanding of the data.
```{r barplot, comment = "", fig.height = 3, fig.width = 5, fig.align = "center"}
barplot(
table(
factor(
bap87_rec,
labels = names(attributes(bap87_rec)$labels_en)
)
),
main = attributes(bap87_rec)$description_en, # title
xlab = paste0(
attributes(bap87_rec)$name, ": ", attributes(bap87_rec)$label), # label
sub = attributes(bap87_rec)$url, # subtitle
cex.main = 0.9, cex.names = 0.7, cex.sub = 0.8, cex.axis = 0.6,
cex.lab = 0.7 # font sizes
)
```
Drawing a barplot with the German description becomes effortless when dealing with dates that have multiple language versions of labels and descriptions. Simply append the language code to the end of the label attributes, and you'll be able to generate the desired barplot with the German description:
```{r barplot german, comment = "", fig.height = 3, fig.width = 5, fig.align = "center"}
barplot(
table(
factor(
bap87_rec,
labels = names(attributes(bap87_rec)$labels_de)
)
),
main = attributes(bap87_rec)$description_de, # title
xlab = paste0(
attributes(bap87_rec)$name, ": ", attributes(bap87_rec)$label_de), # label
sub = attributes(bap87_rec)$url, # subtitle
cex.main = 0.7, cex.names = 0.5, cex.sub = 0.8, cex.axis = 0.7,
cex.lab = 0.7 # font sizes
)
```