data.checker is a package for helping with boilerplate data checks. It enables you to automate fundamental data checks which, while simple, can be time-consuming to implement.

data.checker

Checks data against a user supplied schema that defines what columns and data types are expected
Enables user to add additional custom data checks based on multiple columns
Creates exports of the results for QA

Getting Started

Installation

Software requirements

To use this package, you’ll need the following software on your computer:

RStudio 2024.04.2 or later and R 4.5.0 or later
GIT 2.35.3 or later

To install this R package, you will first need to clone the repository to you local machine by running

git clone https://github.com/ONSdigital/data.checker.git

Open the project in RStudio and in the console run:

devtools::install()

The package will be installed in you R library.

Setup and Usage

data.checker requires an input dataframe and a data schema to validate against. A full list of checks performed by the data checker, alongside how to include custom checks can be found here. The schema can either be defined within the R script itself or saved to either a JSON or YAML file to be loaded by the data checker. We recommend that schemas be saved as either a JSON or YAML to simplify the process of adding additional checks and column information. Once defined, we can pass both the dataframe and schema, alongside an output filepath and format for the report and the option for hardchecks into the check_and_export function.

libary(data.checker)

df <- data.frame(
  age = c(10, 11, 13, 15, 22, 34, 80),
  sex = c("M", "F", "M", "F", "M", "F", "M")
)

my_schema <- list(
  check_duplicates = TRUE,
  check_completeness = FALSE,
  columns = list(
    age = list(type = "integer", optional = FALSE),
    sex = list(type = "character", optional = FALSE)
  )
)

check_and_export(data = df,
         schema = my_schema, 
         file = "report.csv", 
         format = "csv", 
         hard_check =TRUE)

This will produce a report.csv containing the status of each of the validation checks. With hard_check set to TRUE, this will mean the code stops running if any validation checks fail. The report will still be produced before this stop so you can view and investigate the issue causing a fail.

Pre-Defined and Adding Custom Checks

Pre-Defined Checks

These checks can be included in the lists for individual columns in your schema, depending on the data type.

Data Type	Check Name	Parameter	Check Definition
integer / double	Minimum value	min_val	Checks that all values are above or equal to the minimum value
integer / double	Maximum value	max_val	Checks that all values are below or equal to the maximum value
integer / double	Interquartile range (IQR) outlier check	iqr_check	Checks that all values fall within $Q1 - (IQRmultiplier)$ and $Q3 + (IQRmultiplier)$ , where the $multiplier$ is given by `iqr_check`
integer / double	Maximum absolute z score	max_z_score	Checks that the absolute value of all z scores are below or equal to the maximum z score
double	Minimum decimal places	min_decimal	Checks that all values have more or equal amounts of decimal places
double	Maximum decimal places	max_decimal	Checks that all values have less or equal amounts of decimal places
character	Minimum length	min_length	Checks that all strings have length are above or equal to the minimum length
character	Maximum length	max_length	Checks that all strings have length below or equal to the maximum length
character	allowed strings	allowed_strings	Validates that entries match a set of permitted values, list or regex can be used. (Optional and can use forbidden strings instead)
character	forbidden strings	forbidden_strings	Validates that entries do not contain a set of forbidden values, list or regex can be used. (Optional and can use allowed strings instead)
date / datetime	Minimum Date	min_date	Checks that all dates are after the minimum date using the format “YYYY-MM-DD”
date / datetime	Maximum Date	max_date	Checks that all dates are before the maximum date using the format “YYYY-MM-DD”
date/ datetime	Minimum Datetime	min_datetime	Checks that all dates are after the minimum datetime. Accepted formats: Y, YM, YMD, YMDH, YMDHM and YMDHMS
date/ datetime	Maximum Datetime	max_datetime	Checks that all dates are before the maximum datetime. Accepted formats: Y, YM, YMD, YMDH, YMDHM and YMDHMS
any	Missing values check	allow_na	Checks for missing or NA values in the column.
any	Class	class	Checks that column data Class matches the specified type

Adding Custom Checks

Additionally, you can write your own checks and add them to the validator object using the add_custom_check function. This is particularly useful for checks involving more than one column, which cannot be configured using the standard template. The checks are done in the context of the original data, meaning you can reference columns as if they are variables in the environment (similar to tidy evaluation). This is recommended because it guarantees the checks are done on the correct data only. Alternatively, you can use standard evaluation (see example below).

The example below demonstrates how to incorporate both pre-defined and custom checks into your validation.

df <- data.frame(
  id = 1:10,
  age = c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
  sex = c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F")
)

schema <- list(
  check_duplicates = TRUE,
  check_completeness = FALSE,
  columns = list(
    id = list(type = "double", optional = FALSE),
    age = list(type = "double", optional = FALSE, min_val = 0),
    sex = list(type = "character", optional = FALSE, allowed_strings = c("M", "F"))
  )
)

data_check_results <- data.checker::new_validator(df, schema) |>
  data.checker::check() |>
  data.checker::add_check(description = "There are no males over 90 (tidy evaluation)", condition = !(sex == "M" & age > 90)) |>
  data.checker::add_check(description = "There are no males over 90 (standard evaluation)", condition = !(df$sex == "M" & df$age > 90))

print(data_check_results)

Contributing

We always welcome contributions and suggestions to improve functionality of our products. Feel free to open an issue using the issue tab. If you wish to make a direct contribution, please fork the repository, make your changes and raise a pull request and we can review and merge your changes.