Schema for Tabular Dataset Specification

Stadas (Schema for Tabular Dataset Specification) is a schema for specifying a tabular dataset in a format that is easy for humans to read and write. The specification is typically composed of tables, such as CSV or Excel files, but may also be represented as, for example, a JSON document. Stadas works as a documentation of a dataset but is also used for validating data for the dataset. Stadas helps in reducing the risk that data does not match documentation.

This document includes the following sections:

Tabular data: A brief specification of tabular data
Schema: The schema for tabular dataset specification (Stadas)
Simple example: A simple Stadas example as HTML tables
Full example: A full Stadas example in various formats
Process example: Surveys: An example of a survey process using Stadas
Tools: Programming tools in different languages for using Stadas in applications

Tabular data

Tabular data is composed of values organized in rows and columns. Specifically, you should always be able to revert tabular data into a "tidy" format (emphasis added):

There are three interrelated rules which make a dataset tidy:

Each variable must have its own column.

Each observation must have its own row.

Each value must have its own cell.

A variable has three components: A name (unique within a dataset), a list of values, and a mapping that ties each value to a certain observation in a certain dataset. An observation is, for example, a state of a system or a set of responses from a survey participant at a certain time point. The responses or entries are mapped to the variables as values in a consistent manner. In a "tidy" dataset, each row represents an observation. Each value in a row belongs to a different variable (i.e. a column in a table). The values are typically of the same type (for example, integers or decimal numbers) or belong to a set of predefined categories, but they vary within a dataset (hence, "variable").

Schema

The specification is typically defined with two or more tables:

setup: exactly one
preprocess: zero or one
variables: one or more
categories: zero or more
colors: zero or more
rules: zero or more

`setup`

The specification consists exactly one setup table
Description:
- The first table to be read
- Sets up the specification
Columns:
- tabletype: the type of specification table
  - Constraints:
    - Only rows with a value are considered
    - One of:
      - "variable"
      - "rule"
- tablename: the name of a specification table
  - Constraints:
    - Must start with a lower case ASCII letter (a...z)
    - Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)

`preprocess`

The specification consists zero or one preprocess table
Description:
- The table containing code for creating preprocess variables
Columns:
- mappingkey: a key for selecting the right data
  - Constraints:
    - Must be unique within the dataset
- newvarname: the name of the new variable to be created
  - Constraints:
    - Must start with a lower case ASCII letter (a...z)
    - Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- operation: the operation to perform to create the new variable
  - Constraints:
    - Must be a valid epxression
- description: a human-readable description for the operation

`variables`

The specification consists one or more variables tables
Description:
- Defines the variables of a dataset
Columns:
- varname: the name of the variable
  - Constraints:
    - Only rows with a value are considered
    - Must be unique within the dataset
    - Must start with a lower case ASCII letter (a...z)
    - Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- vargroup: the name of the group of the variable
  - Optional
  - Constraints:
    - Must start with a lower case ASCII letter (a...z)
    - Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- datatype: the data type of the variable
  - Constraints:
    - Must have a value if varname has a value
    - One of:
      - "integer"
      - "decimal"
      - "categorical"
      - "text"
- unique: whether the values of the variable should be unique within the dataset
  - Constraints:
    - One of:
      - NA
      - "unique"
- nona: whether the variable must have a value in every observation within the dataset
  - Constraints:
    - One of:
      - NA
      - "nona"
- categorytable: the name of a categories table
  - Constraints:
    - If has a value, it must refer to a categories table in the specification
    - Must start with a lower case ASCII letter (a...z)
    - Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- categoryset: the name of the set of categories of the variable
  - Constraints:
    - Must have a value if categorytable has a value
    - Must be found in the referred categories table
    - Must start with a lower case ASCII letter (a...z)
    - Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- label{_LANG}: the label of the variable in the language LANG; if no LANG key defined, the given label will be used in all situations; if no label column defined, varname used instead
  - Optional
- mapping{_MAPPING_KEY}: a mapping between the variable and a column in the data MAPPING_KEY; if no MAPPING_KEY defined, the given mapping will be used for all sub datasets; if no mapping column defined, varname used instead
  - Optional

`categories`

The specification consists zero or more categories tables
Description:
- Defines the categories of the categorical variables
Columns:
- categoryset: the name of the set of the category
  - Constraints:
    - Only rows with a value are considered
    - Must start with a lower case ASCII letter (a...z)
    - Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- name: the name of the category
  - Constraints:
    - Must start with a lower case ASCII letter (a...z)
    - Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- label{_LANG}: the label of the category in the language LANG; if no LANG key defined, the given label will be used is all situations; if no label column defined, varname used instead
  - Optional
- mapping{_MAPPING_KEY}: a mapping between the category and a value in data MAPPING_KEY; if no MAPPING_KEY defined, the given mapping will be used for all sub datasets; if no mapping column defined, name used instead
  - Optional
- colortable: the name of a colors table
  - Optional
  - Constraints:
    - Must refer to a colors table in the specification
    - Must start with a lower case ASCII letter (a...z)
    - Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- colorname: the name of the color for the category to be used, for example, in plots
  - Optional (must exist if colortable exists)
  - Constraints:
    - Must have a value if colortable is not empty
    - Must be found in the referred color table
    - Must start with a lower case ASCII letter (a...z)
    - Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)

`colors`

The specification consists zero or more colors tables
Description:
- Defines colours for the categories to be used, for example, in plots
Columns:
- colorname: the name of the colour
  - Constraints:
    - Must start with a lower case ASCII letter (a...z)
    - Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- colorhex: the hexadecimal code of the colour
  - constraints:
    - must be a valid hexadecimal code

`rules`

The specification consists zero or more rules tables
Description:
- Defines additional constraints and rules for the variables and the dataset
Columns:
- ruleset: the name of the set of the rule
- rule: the rule
  - Constraints:
    - must be a valid rule
- label: a label for the rule
- description: a description for the rule

Simple example

Here is a simple example of a Stadas specification as HTML tables. The setup table specifies two tables: vars (type: "variable") and rules (type: "rule"). The vars table defines the categories table, which in turn defines the sequential table.

Specification

`setup`

tabletype	tablename
variable	vars
rule	rules

`vars`

varname	datatype	unique	nona	categorytable	categoryset	label_en
id	integer	unique	nona			ID
age	integer					Age in years
height	integer					Height in centimeters
weight	decimal					Weight in kilograms
health	categorical			categories	health	Self rated health

`categories`

categoryset	name	label_en	mapping	colortable	colorname
health	excellent	Excellent	5	sequential	seq5
health	verygood	Very good	4	sequential	seq4
health	good	Good	3	sequential	seq3
health	fair	Fair	2	sequential	seq2
health	poor	Poor	1	sequential	seq1

`sequential`

colorname	colorhex
seq1	#f1eef6
seq2	#bdc9e1
seq3	#74a9cf
seq4	#2b8cbe
seq5	#045a8d

`rules`

rulename	rule	label	description
age_valid	age => 18 & age < 100	Age is valid	Age must be at least 18 and less than 100 year
height_valid	height => 140 & height < 210	Height is valid	Height must be at least 140 and less than 210 centimeters
weight_valid	weight => 40 & weight < 150	Weight is valid	Weight must be at least 40 and less than 150 kilograms

Example data

id	age	height	weight	health
1	32	164	65	4
2	23	183	85	5
3	63	171	69	3
4	33	174	71	3
5	19	191	88	4

Full example

A full example is created with the National Health Interview Survey (NHIS). Here are specifications for a subset of the 2021 NHIS dataset in different formats:

The example has seven tables. The setup table names the variables and rules tables. The preprocess table defines operations for calculating some preprocessing variables. Two variables tables define the variables and two categories tables the categories. Finally, a rules table defines rules for integer and decimal variables.

Process example: Surveys

Here is an example of a survey process using Stadas.

Define a dataset with Stadas specification
Build surveys with items corresponding to the variables defined in the dataset (different surveys typically contain also other items)
Gather data with the different surveys in different contexts
Validate data against the Stadas specification
- Ensure that the columns are labelled with valid variable names
- Ensure that the data contains only valid values
- Ensure that the values of categorical variables are labelled with valid category names (instead of, for example, integer codes)
- Drop unspecified variables if needed
Start doing analyses and plots with high confidence that the data from all the different surveys match the specification

Tools

R: rstadas