Schema for Tabular Dataset Specification
Stadas (Schema for Tabular Dataset Specification) is a schema for specifying a tabular dataset in a format that is easy for humans to read and write. The specification is typically composed of tables, such as CSV or Excel files, but may also be represented as, for example, a JSON document. Stadas works as a documentation of a dataset but is also used for validating data for the dataset. Stadas helps in reducing the risk that data does not match documentation.
This document includes the following sections:
- Tabular data: A brief specification of tabular data
- Schema: The schema for tabular dataset specification (Stadas)
- Simple example: A simple Stadas example as HTML tables
- Full example: A full Stadas example in various formats
- Process example: Surveys: An example of a survey process using Stadas
- Tools: Programming tools in different languages for using Stadas in applications
Tabular data
Tabular data is composed of values organized in rows and columns. Specifically, you should always be able to revert tabular data into a "tidy" format (emphasis added):
There are three interrelated rules which make a dataset tidy:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
A variable has three components: A name (unique within a dataset), a list of values, and a mapping that ties each value to a certain observation in a certain dataset. An observation is, for example, a state of a system or a set of responses from a survey participant at a certain time point. The responses or entries are mapped to the variables as values in a consistent manner. In a "tidy" dataset, each row represents an observation. Each value in a row belongs to a different variable (i.e. a column in a table). The values are typically of the same type (for example, integers or decimal numbers) or belong to a set of predefined categories, but they vary within a dataset (hence, "variable").
Schema
The specification is typically defined with two or more tables:setup: exactly onepreprocess: zero or onevariables: one or morecategories: zero or morecolors: zero or morerules: zero or more
setup
- The specification consists exactly one
setuptable - Description:
- The first table to be read
- Sets up the specification
- Columns:
tabletype: the type of specification table- Constraints:
- Only rows with a value are considered
- One of:
"variable""rule"
- Constraints:
tablename: the name of a specification table- Constraints:
- Must start with a lower case ASCII letter (a...z)
- Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- Constraints:
preprocess
- The specification consists zero or one
preprocesstable - Description:
- The table containing code for creating preprocess variables
- Columns:
mappingkey: a key for selecting the right data- Constraints:
- Must be unique within the dataset
- Constraints:
newvarname: the name of the new variable to be created- Constraints:
- Must start with a lower case ASCII letter (a...z)
- Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- Constraints:
operation: the operation to perform to create the new variable- Constraints:
- Must be a valid epxression
- Constraints:
description: a human-readable description for the operation
variables
- The specification consists one or more
variablestables - Description:
- Defines the variables of a dataset
- Columns:
varname: the name of the variable- Constraints:
- Only rows with a value are considered
- Must be unique within the dataset
- Must start with a lower case ASCII letter (a...z)
- Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- Constraints:
vargroup: the name of the group of the variable- Optional
- Constraints:
- Must start with a lower case ASCII letter (a...z)
- Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
datatype: the data type of the variable- Constraints:
- Must have a value if
varnamehas a value - One of:
"integer""decimal""categorical""text"
- Must have a value if
- Constraints:
unique: whether the values of the variable should be unique within the dataset- Constraints:
- One of:
NA"unique"
- One of:
- Constraints:
nona: whether the variable must have a value in every observation within the dataset- Constraints:
- One of:
NA"nona"
- One of:
- Constraints:
categorytable: the name of acategoriestable- Constraints:
- If has a value, it must refer to a
categoriestable in the specification - Must start with a lower case ASCII letter (a...z)
- Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- If has a value, it must refer to a
- Constraints:
categoryset: the name of the set of categories of the variable- Constraints:
- Must have a value if
categorytablehas a value - Must be found in the referred
categoriestable - Must start with a lower case ASCII letter (a...z)
- Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- Must have a value if
- Constraints:
label{_LANG}: the label of the variable in the languageLANG; if noLANGkey defined, the given label will be used in all situations; if nolabelcolumn defined,varnameused instead- Optional
mapping{_MAPPING_KEY}: a mapping between the variable and a column in the dataMAPPING_KEY; if noMAPPING_KEYdefined, the given mapping will be used for all sub datasets; if nomappingcolumn defined,varnameused instead- Optional
categories
- The specification consists zero or more
categoriestables - Description:
- Defines the categories of the categorical variables
- Columns:
categoryset: the name of the set of the category- Constraints:
- Only rows with a value are considered
- Must start with a lower case ASCII letter (a...z)
- Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- Constraints:
name: the name of the category- Constraints:
- Must start with a lower case ASCII letter (a...z)
- Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- Constraints:
label{_LANG}: the label of the category in the languageLANG; if noLANGkey defined, the given label will be used is all situations; if nolabelcolumn defined,varnameused instead- Optional
mapping{_MAPPING_KEY}: a mapping between the category and a value in dataMAPPING_KEY; if noMAPPING_KEYdefined, the given mapping will be used for all sub datasets; if nomappingcolumn defined,nameused instead- Optional
colortable: the name of acolorstable- Optional
- Constraints:
- Must refer to a
colorstable in the specification - Must start with a lower case ASCII letter (a...z)
- Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- Must refer to a
colorname: the name of the color for the category to be used, for example, in plots- Optional (must exist if
colortableexists) - Constraints:
- Must have a value if
colortableis not empty - Must be found in the referred
colortable - Must start with a lower case ASCII letter (a...z)
- Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- Must have a value if
- Optional (must exist if
colors
- The specification consists zero or more
colorstables - Description:
- Defines colours for the categories to be used, for example, in plots
- Columns:
colorname: the name of the colour- Constraints:
- Must start with a lower case ASCII letter (a...z)
- Must contain only lower case alpha-numeric ASCII characters and underscores (a...z, 0...9 and _)
- Constraints:
colorhex: the hexadecimal code of the colour- constraints:
- must be a valid hexadecimal code
- constraints:
rules
- The specification consists zero or more
rulestables - Description:
- Defines additional constraints and rules for the variables and the dataset
- Columns:
ruleset: the name of the set of the rulerule: the rule- Constraints:
- must be a valid rule
- Constraints:
label: a label for the ruledescription: a description for the rule
Simple example
Here is a simple example of a Stadas specification as HTML tables.
The setup table specifies two tables: vars
(type: "variable") and rules (type: "rule").
The vars table defines the categories table, which in turn
defines the sequential table.
Specification
setup
| tabletype | tablename |
|---|---|
| variable | vars |
| rule | rules |
vars
| varname | datatype | unique | nona | categorytable | categoryset | label_en |
|---|---|---|---|---|---|---|
| id | integer | unique | nona | ID | ||
| age | integer | Age in years | ||||
| height | integer | Height in centimeters | ||||
| weight | decimal | Weight in kilograms | ||||
| health | categorical | categories | health | Self rated health |
categories
| categoryset | name | label_en | mapping | colortable | colorname |
|---|---|---|---|---|---|
| health | excellent | Excellent | 5 | sequential | seq5 |
| health | verygood | Very good | 4 | sequential | seq4 |
| health | good | Good | 3 | sequential | seq3 |
| health | fair | Fair | 2 | sequential | seq2 |
| health | poor | Poor | 1 | sequential | seq1 |
sequential
| colorname | colorhex |
|---|---|
| seq1 | #f1eef6 |
| seq2 | #bdc9e1 |
| seq3 | #74a9cf |
| seq4 | #2b8cbe |
| seq5 | #045a8d |
rules
| rulename | rule | label | description |
|---|---|---|---|
| age_valid | age => 18 & age < 100 | Age is valid | Age must be at least 18 and less than 100 year |
| height_valid | height => 140 & height < 210 | Height is valid | Height must be at least 140 and less than 210 centimeters |
| weight_valid | weight => 40 & weight < 150 | Weight is valid | Weight must be at least 40 and less than 150 kilograms |
Example data
| id | age | height | weight | health |
|---|---|---|---|---|
| 1 | 32 | 164 | 65 | 4 |
| 2 | 23 | 183 | 85 | 5 |
| 3 | 63 | 171 | 69 | 3 |
| 4 | 33 | 174 | 71 | 3 |
| 5 | 19 | 191 | 88 | 4 |
Full example
A full example is created with the National Health Interview Survey (NHIS). Here are specifications for a subset of the 2021 NHIS dataset in different formats:
The example has seven tables. The setup table names the variables and
rules tables. The preprocess table defines operations for calculating
some preprocessing variables. Two variables tables define the variables and
two categories tables the categories. Finally, a rules table defines
rules for integer and decimal variables.
Process example: Surveys
Here is an example of a survey process using Stadas.
- Define a dataset with Stadas specification
- Build surveys with items corresponding to the variables defined in the dataset (different surveys typically contain also other items)
- Gather data with the different surveys in different contexts
- Validate data against the Stadas specification
- Ensure that the columns are labelled with valid variable names
- Ensure that the data contains only valid values
- Ensure that the values of categorical variables are labelled with valid category names (instead of, for example, integer codes)
- Drop unspecified variables if needed
- Start doing analyses and plots with high confidence that the data from all the different surveys match the specification
Tools
- R: rstadas