Version 0.2.0 (GitHub)

Schema for Tabular Dataset Specification

Stadas (Schema for Tabular Dataset Specification) is a schema for specifying a tabular dataset in a format that is easy for humans to read and write. The specification is typically composed of tables, such as CSV or Excel files, but may also be represented as, for example, a JSON document. Stadas works as a documentation of a dataset but is also used for validating data for the dataset. Stadas helps in reducing the risk that data does not match documentation.

This document includes the following sections:

  1. Tabular data: A brief specification of tabular data
  2. Schema: The schema for tabular dataset specification (Stadas)
  3. Simple example: A simple Stadas example as HTML tables
  4. Full example: A full Stadas example in various formats
  5. Process example: Surveys: An example of a survey process using Stadas
  6. Tools: Programming tools in different languages for using Stadas in applications

Tabular data

Tabular data is composed of values organized in rows and columns. Specifically, you should always be able to revert tabular data into a "tidy" format (emphasis added):

There are three interrelated rules which make a dataset tidy:
  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

A variable has three components: A name (unique within a dataset), a list of values, and a mapping that ties each value to a certain observation in a certain dataset. An observation is, for example, a state of a system or a set of responses from a survey participant at a certain time point. The responses or entries are mapped to the variables as values in a consistent manner. In a "tidy" dataset, each row represents an observation. Each value in a row belongs to a different variable (i.e. a column in a table). The values are typically of the same type (for example, integers or decimal numbers) or belong to a set of predefined categories, but they vary within a dataset (hence, "variable").

Schema

The specification is typically defined with two or more tables:

setup

preprocess

variables

categories

colors

rules

Simple example

Here is a simple example of a Stadas specification as HTML tables. The setup table specifies two tables: vars (type: "variable") and rules (type: "rule"). The vars table defines the categories table, which in turn defines the sequential table.

Specification

setup

tabletype tablename
variable vars
rule rules

vars

varname datatype unique nona categorytable categoryset label_en
id integer unique nona ID
age integer Age in years
height integer Height in centimeters
weight decimal Weight in kilograms
health categorical categories health Self rated health

categories

categoryset name label_en mapping colortable colorname
health excellent Excellent 5 sequential seq5
health verygood Very good 4 sequential seq4
health good Good 3 sequential seq3
health fair Fair 2 sequential seq2
health poor Poor 1 sequential seq1

sequential

colorname colorhex
seq1 #f1eef6
seq2 #bdc9e1
seq3 #74a9cf
seq4 #2b8cbe
seq5 #045a8d

rules

rulename rule label description
age_valid age => 18 & age < 100 Age is valid Age must be at least 18 and less than 100 year
height_valid height => 140 & height < 210 Height is valid Height must be at least 140 and less than 210 centimeters
weight_valid weight => 40 & weight < 150 Weight is valid Weight must be at least 40 and less than 150 kilograms

Example data

id age height weight health
1 32 164 65 4
2 23 183 85 5
3 63 171 69 3
4 33 174 71 3
5 19 191 88 4

Full example

A full example is created with the National Health Interview Survey (NHIS). Here are specifications for a subset of the 2021 NHIS dataset in different formats:

The example has seven tables. The setup table names the variables and rules tables. The preprocess table defines operations for calculating some preprocessing variables. Two variables tables define the variables and two categories tables the categories. Finally, a rules table defines rules for integer and decimal variables.

Process example: Surveys

Here is an example of a survey process using Stadas.

  1. Define a dataset with Stadas specification
  2. Build surveys with items corresponding to the variables defined in the dataset (different surveys typically contain also other items)
  3. Gather data with the different surveys in different contexts
  4. Validate data against the Stadas specification
    • Ensure that the columns are labelled with valid variable names
    • Ensure that the data contains only valid values
    • Ensure that the values of categorical variables are labelled with valid category names (instead of, for example, integer codes)
    • Drop unspecified variables if needed
  5. Start doing analyses and plots with high confidence that the data from all the different surveys match the specification

Tools