Skip to content

Configuration

Koality uses YAML configuration files to define data quality checks. This page describes the configuration schema and available options.

Configuration Schema

name: string                    # Name of the configuration
database_setup: string          # SQL to set up database connections
database_accessor: string       # Database accessor prefix for tables

defaults:                # Global default settings
  monitor_only: bool            # If true, checks don't fail the run
  result_table: string          # Table to store results (optional)
  log_path: string              # Path to write failed checks log (optional)
  date_filter_column: string    # Default date filter column
  date_filter_value: string     # Default date filter value
  filter_column: string         # Default filter column
  filter_value: string          # Default filter value

check_bundles:                  # List of check bundles
  - name: string                # Bundle name
    defaults:               # Default arguments for checks in this bundle
      table: string
      check_column: string
      # ... other defaults
    checks:                     # List of checks
      - check_type: string      # Type of check (required)
        # ... check-specific arguments

Global Defaults

Global defaults are applied to all checks unless overridden at the bundle or check level.

Field Type Description
monitor_only bool If true, check failures don't fail the overall run
result_table string Table name for persisting check results
log_path string File path to write failed checks log
date_filter_column string Column name for date filtering
date_filter_value string Value for date filtering
filter_column string Column name for generic filtering
filter_value string Value for generic filtering

Check Bundles

Check bundles group related checks together and can define shared default arguments.

check_bundles:
  - name: orders_quality
    defaults:
      table: orders
      date_filter_column: order_date
    checks:
      - check_type: NullRatioCheck
        check_column: customer_id
        upper_threshold: 0.01

Check Types

NullRatioCheck

Validates the ratio of null values in a column.

- check_type: NullRatioCheck
  table: my_table
  check_column: my_column
  lower_threshold: 0.0    # Minimum allowed null ratio
  upper_threshold: 0.05   # Maximum allowed null ratio

RegexMatchCheck

Checks if column values match a regex pattern.

- check_type: RegexMatchCheck
  table: my_table
  check_column: email
  regex_to_match: "^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+$"
  lower_threshold: 0.95
  upper_threshold: 1.0

ValuesInSetCheck

Validates that column values are within an allowed set.

- check_type: ValuesInSetCheck
  table: my_table
  check_column: status
  value_set: ["pending", "active", "completed"]
  lower_threshold: 1.0
  upper_threshold: 1.0

RollingValuesInSetCheck

Rolling window version of ValuesInSetCheck (14-day window).

- check_type: RollingValuesInSetCheck
  table: my_table
  check_column: category
  value_set: ["A", "B", "C"]
  date_filter_column: created_at
  date_filter_value: "2024-01-01"
  lower_threshold: 0.95

DuplicateCheck

Detects duplicate values in a column.

- check_type: DuplicateCheck
  table: my_table
  check_column: unique_id
  lower_threshold: 0.0
  upper_threshold: 0.0    # 0 means no duplicates allowed

CountCheck

Validates row counts or distinct counts.

- check_type: CountCheck
  table: my_table
  check_column: "*"       # Use "*" for row count
  distinct: false         # Set to true for distinct count
  lower_threshold: 1000
  upper_threshold: 100000

OccurrenceCheck

Checks minimum or maximum occurrence of any value.

- check_type: OccurrenceCheck
  table: my_table
  check_column: product_id
  max_or_min: "max"       # "max" or "min"
  upper_threshold: 100    # No value should occur more than 100 times

MatchRateCheck

Validates join match rates between two tables.

- check_type: MatchRateCheck
  left_table: orders
  right_table: products
  check_column: product_id
  join_columns: ["product_id"]
  # Or use different column names:
  # join_columns_left: ["prod_id"]
  # join_columns_right: ["product_id"]
  lower_threshold: 0.95

RelCountChangeCheck

Detects relative count changes over a rolling window.

- check_type: RelCountChangeCheck
  table: my_table
  check_column: "*"
  rolling_days: 7
  date_filter_column: date
  date_filter_value: "2024-01-01"
  lower_threshold: 0.8    # At least 80% of average
  upper_threshold: 1.2    # At most 120% of average

IqrOutlierCheck

Detects outliers using interquartile range.

- check_type: IqrOutlierCheck
  table: my_table
  check_column: amount
  date_filter_column: date
  date_filter_value: "2024-01-01"
  interval_days: 30
  how: "both"             # "both", "upper", or "lower"
  iqr_factor: 1.5
  lower_threshold: 0.0
  upper_threshold: 0.05   # Max 5% outliers

Filtering

Checks support filtering using column/value pairs:

- check_type: CountCheck
  table: orders
  check_column: "*"
  shop_id_filter_column: shop_code
  shop_id_filter_value: "SHOP01"
  date_filter_column: order_date
  date_filter_value: "2024-01-01"

The filter pattern is {name}_filter_column paired with {name}_filter_value.

Example Configuration

name: ecommerce_dqm
database_setup: |
  ATTACH 'warehouse.duckdb' AS warehouse;
database_accessor: warehouse

defaults:
  monitor_only: false
  result_table: dqm_results
  log_path: /var/log/dqm/failed_checks.log
  date_filter_column: date
  date_filter_value: "2024-01-01"

check_bundles:
  - name: orders
    defaults:
      table: orders
    checks:
      - check_type: NullRatioCheck
        check_column: order_id
        upper_threshold: 0.0

      - check_type: CountCheck
        check_column: "*"
        lower_threshold: 100

  - name: products
    defaults:
      table: products
    checks:
      - check_type: DuplicateCheck
        check_column: sku
        upper_threshold: 0.0

      - check_type: ValuesInSetCheck
        check_column: status
        value_set: ["active", "inactive", "discontinued"]
        lower_threshold: 1.0