Skip to content

Configuration

Koality uses YAML configuration files to define data quality checks. This page describes the configuration schema and available options.

Supported Databases

Koality uses DuckDB as its query engine. Currently supported databases:

Database Status
DuckDB (in-memory) ✅ Fully supported
Google Cloud BigQuery ✅ Fully supported

External databases are accessed through DuckDB extensions. For BigQuery, queries are executed using the BigQuery Jobs API for read operations and bigquery_execute for write operations. Other databases may need custom handling in execute_query!

Configuration Schema

name: string                    # Name of the configuration
database_setup: string          # SQL to set up database connections
database_accessor: string       # Database accessor prefix for tables

defaults:                       # Global default settings
  monitor_only: bool            # If true, checks don't fail the run
  result_table: string          # Table to store results (optional)
  log_path: string              # File path to write failed checks log (optional)
  identifier_format: string     # How to format identifier in results (optional)
  filters:                      # Default filters applied to all checks
    <filter_name>:
      column: string            # Column to filter on
      value: any                # Filter value (use null for IS NULL)
      type: date | identifier | other  # Filter type
      operator: string          # Comparison operator (default: "=")
      parse_as_date: bool       # Parse value as date (default: false)

check_bundles:                  # List of check bundles
  - name: string                # Bundle name
    defaults:                   # Default arguments for checks in this bundle
      table: string
      check_column: string
      filters: {...}            # Bundle-level default filters
      # ... other defaults
    checks:                     # List of checks
      - check_type: string      # Type of check (required)
        filters: {...}          # Check-level filters (merged with defaults)
        # ... check-specific arguments

Database Connection

The database_setup and database_accessor fields work together to configure how Koality connects to your data:

database_setup

A SQL string that is executed when the CheckExecutor is initialized (if no custom DuckDB client is provided). Use this to:

  • Install and load DuckDB extensions
  • Attach external databases
  • Configure connection settings

database_accessor

The name/alias of the attached database. This is used to:

  • Identify the database provider type (e.g., "bigquery")
  • Prefix table references in queries
  • Route queries through the appropriate execution method

Examples

DuckDB (in-memory with local file):

name: local_checks
database_setup: |
  ATTACH 'warehouse.duckdb' AS warehouse;
database_accessor: warehouse

Google Cloud BigQuery:

name: bigquery_checks
database_setup: |
  INSTALL bigquery;
  LOAD bigquery;
  ATTACH 'project=my-gcp-project' AS bq (TYPE bigquery, READ_ONLY);
database_accessor: bq

When using BigQuery, Koality automatically:

  • Uses bigquery_query() for SELECT operations (supports views)
  • Uses bigquery_execute() for write operations (INSERT, CREATE, etc.)
  • Maps DuckDB types to BigQuery types (e.g., VARCHAR → STRING, NUMERIC → FLOAT64)

Database Setup Variables

The database_setup field supports dynamic variable substitution using ${VAR_NAME} syntax. This is useful for:

  • Using different GCP projects per environment (dev/staging/prod)
  • Keeping sensitive configuration out of YAML files
  • Enabling reusable configurations across different contexts

Using CLI option (-dsv):

# Single variable
koality run --config_path checks.yaml -dsv PROJECT_ID=my-gcp-project

# Multiple variables
koality run --config_path checks.yaml -dsv PROJECT_ID=prod-project -dsv DATASET=analytics

Using environment variable:

# Comma-separated VAR=value pairs
DATABASE_SETUP_VARIABLES="PROJECT_ID=my-project,DATASET=prod" koality run --config_path checks.yaml

Example configuration:

name: bigquery_checks
database_setup: |
  INSTALL bigquery;
  LOAD bigquery;
  ATTACH 'project=${PROJECT_ID}' AS bq (TYPE bigquery, READ_ONLY);
database_accessor: bq

Variable resolution order:

  1. CLI options (-dsv) take highest priority
  2. Environment variable (DATABASE_SETUP_VARIABLES) is used as fallback
  3. If a variable is referenced but not provided, an error is shown with a helpful hint

Preview with print:

Use the print command to verify variable substitution before running checks:

koality print --config_path checks.yaml -dsv PROJECT_ID=my-project --format yaml

Global Defaults

Global defaults are applied to all checks unless overridden at the bundle or check level.

Field Type Description
monitor_only bool If true, check failures don't fail the overall run
result_table string Table name for persisting check results
log_path string File path to write failed checks log
identifier_format string How to format identifier in output (see below)
filters object Default filters applied to all checks (see Filtering section)

Identifier Format

The identifier_format option controls how the identifier filter value appears in check results and messages. Three options are available:

Format Result Column Value Format Example
identifier IDENTIFIER column=value shop_code=SHOP001
filter_name Filter name Value only Column: SHOP_ID, Value: SHOP001
column_name Column name Value only Column: SHOP_CODE, Value: SHOP001

Default: identifier

Consistency requirement: When using filter_name or column_name format, all identifier filters across all checks must use the same filter name or column name respectively. This ensures consistent column headers in results. The identifier format has no such restriction since it always uses IDENTIFIER as the column header.

defaults:
  identifier_format: identifier  # or "filter_name" or "column_name"
  filters:
    shop_id:
      column: shop_code
      value: SHOP001
      type: identifier

Check Bundles

Check bundles group related checks together and can define shared default arguments.

check_bundles:
  - name: orders_quality
    defaults:
      table: orders
      filters:
        partition_date:
          column: order_date
          value: yesterday
          type: date
    checks:
      - check_type: NullRatioCheck
        check_column: customer_id
        upper_threshold: 0.01

Check Types

NullRatioCheck

Validates the ratio of null values in a column.

- check_type: NullRatioCheck
  table: my_table
  check_column: my_column
  lower_threshold: 0.0    # Minimum allowed null ratio
  upper_threshold: 0.05   # Maximum allowed null ratio

RegexMatchCheck

Checks if column values match a regex pattern.

- check_type: RegexMatchCheck
  table: my_table
  check_column: email
  regex_to_match: "^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+$"
  lower_threshold: 0.95
  upper_threshold: 1.0

ValuesInSetCheck

Validates that column values are within an allowed set.

- check_type: ValuesInSetCheck
  table: my_table
  check_column: status
  value_set: ["pending", "active", "completed"]
  lower_threshold: 1.0
  upper_threshold: 1.0

RollingValuesInSetCheck

Rolling window version of ValuesInSetCheck (14-day window). Requires a filter with type: date.

- check_type: RollingValuesInSetCheck
  table: my_table
  check_column: category
  value_set: ["A", "B", "C"]
  filters:
    partition_date:
      column: created_at
      value: yesterday
      type: date
  lower_threshold: 0.95

DuplicateCheck

Detects duplicate values in a column.

- check_type: DuplicateCheck
  table: my_table
  check_column: unique_id
  lower_threshold: 0.0
  upper_threshold: 0.0    # 0 means no duplicates allowed

CountCheck

Validates row counts or distinct counts.

- check_type: CountCheck
  table: my_table
  check_column: "*"       # Use "*" for row count
  distinct: false         # Set to true for distinct count
  lower_threshold: 1000
  upper_threshold: 100000

OccurrenceCheck

Checks minimum or maximum occurrence of any value.

- check_type: OccurrenceCheck
  table: my_table
  check_column: product_id
  max_or_min: "max"       # "max" or "min"
  upper_threshold: 100    # No value should occur more than 100 times

MatchRateCheck

Validates join match rates between two tables.

- check_type: MatchRateCheck
  left_table: orders
  right_table: products
  check_column: product_id
  join_columns: ["product_id"]
  # Or use different column names:
  # join_columns_left: ["prod_id"]
  # join_columns_right: ["product_id"]
  lower_threshold: 0.95

RelCountChangeCheck

Detects relative count changes over a rolling window. Requires a filter with type: date.

- check_type: RelCountChangeCheck
  table: my_table
  check_column: "*"
  rolling_days: 7
  filters:
    partition_date:
      column: date
      value: yesterday
      type: date
  lower_threshold: -0.2   # At least 80% of average
  upper_threshold: 0.2    # At most 120% of average

IqrOutlierCheck

Detects outliers using interquartile range. Requires a filter with type: date.

Note: IqrOutlierCheck is a transformation-based check and does not accept lower_threshold/upper_threshold in its constructor; the executor will ignore those fields when instantiating the check. To run IqrOutlierCheck via configuration, specify its parameters (interval_days, how, iqr_factor) and provide a date filter in the filters section.

- check_type: IqrOutlierCheck
  table: my_table
  check_column: amount
  filters:
    partition_date:
      column: date
      value: yesterday
      type: date
  interval_days: 30
  how: "both"             # "both", "upper", or "lower"
  iqr_factor: 1.5

Filtering

Checks support filtering using a structured filters configuration:

- check_type: CountCheck
  table: orders
  check_column: "*"
  filters:
    partition_date:
      column: order_date
      value: yesterday
      type: date
    shop_id:
      column: shop_code
      value: "SHOP01"
      type: identifier

Filter Configuration

Each filter has the following properties:

Property Type Required Description
column string Yes* The database column name to filter on
value any Yes* The filter value (string, number, list for IN operators, or null for IS NULL). For dates, supports inline offsets like yesterday-2
operator string No SQL operator: =, !=, >, >=, <, <=, IN, NOT IN, LIKE. Default: =
type string No Filter type: date, identifier, or other. Default: other
parse_as_date bool No If true, parse the value as a date (supports relative dates). Default: false

*column and value are optional in defaults (global or bundle level) but must be set after merging. This allows defining partial filters in defaults that are completed at the check level.

Filter Types

Koality supports three filter types:

Type Purpose Limit
date Date filter for rolling checks One per check
identifier Identifier for grouping results (e.g., shop) One per check
other Regular filters Unlimited

Identifier Filters

Use type: identifier to mark the filter that identifies your data partition (e.g., shop, tenant, region):

filters:
  shop_id:
    column: shop_code
    value: SHOP001
    type: identifier

The identifier value appears in check results and failure messages. How it's formatted depends on the identifier_format global setting.

For more details about naming-only identifier filters and the identifier_placeholder option, see the guide: Identifier filters and naming.

Date Filters

When type: date is set, the value is automatically parsed as a date. Supported formats:

  • ISO dates: 2024-01-15, 20240115
  • Relative dates: today, yesterday, tomorrow
  • With inline offset: today-2, yesterday+1, tomorrow-3 (add or subtract days directly in the value)
filters:
  partition_date:
    column: BQ_PARTITIONTIME
    value: yesterday-1    # 2 days ago (yesterday minus 1 day)
    type: date

Important: Only one filter with type: date is allowed per check. This is the filter used by rolling checks (RollingValuesInSetCheck, RelCountChangeCheck, IqrOutlierCheck) for their date-based calculations.

If you need to parse multiple date values, use parse_as_date: true for additional filters:

filters:
  partition_date:
    column: BQ_PARTITIONTIME
    value: yesterday
    type: date              # THE date filter for rolling checks
  created_after:
    column: created_at
    value: today-7          # 7 days ago
    parse_as_date: true     # Also parses date, but isn't "the" date filter
    operator: ">="

Operators

Use the operator property for different comparison types:

filters:
  # Equality (default)
  status:
    column: order_status
    value: completed

  # Greater than
  revenue:
    column: total_revenue
    value: 1000
    operator: ">="

  # IN operator (list of values)
  category:
    column: category
    value: ["toys", "electronics", "clothing"]
    operator: "IN"

  # NOT IN operator
  excluded:
    column: region
    value: ["test", "staging"]
    operator: "NOT IN"

  # LIKE operator (pattern matching)
  email_domain:
    column: email
    value: "%@example.com"
    operator: "LIKE"

  # IS NULL (filter for missing values)
  not_deleted:
    column: deleted_at
    value: null           # or ~ or empty
    operator: "="         # generates: deleted_at IS NULL

  # IS NOT NULL (filter for existing values)
  has_email:
    column: email
    value: null
    operator: "!="        # generates: email IS NOT NULL

Default Filters

Filters can be defined at any level (global defaults, bundle defaults, or individual checks) and are merged with inheritance:

defaults:
  filters:
    partition_date:
      column: DATE
      value: yesterday
      type: date

check_bundles:
  - name: orders
    defaults:
      filters:
        shop_id:
          column: shop_code
          value: "SHOP01"
    checks:
      - check_type: CountCheck
        table: orders
        check_column: "*"
        # Inherits both partition_date and shop_id filters

Partial Filters in Defaults

You can define partial filters in defaults (omitting column or value) and complete them at a lower level:

defaults:
  filters:
    shop_id:
      column: shopId       # Define column once
      type: identifier     # Define type once
      # value not set - must be set per check

check_bundles:
  - name: shop1_checks
    defaults:
      table: orders
      filters:
        shop_id:
          value: SHOP001   # Set value, inherits column and type
    checks:
      - check_type: CountCheck
        check_column: "*"

  - name: shop2_checks
    defaults:
      table: orders
    checks:
      - check_type: CountCheck
        check_column: "*"
        filters:
          shop_id:
            value: SHOP002   # Set value at check level

Example Configuration

name: ecommerce_dqm
database_setup: |
  ATTACH 'warehouse.duckdb' AS warehouse;
database_accessor: warehouse

defaults:
  monitor_only: false
  result_table: dqm_results
  log_path: /var/log/dqm/failed_checks.log
  identifier_format: identifier  # Results show "shop_code=SHOP01"
  filters:
    partition_date:
      column: date
      value: yesterday
      type: date

check_bundles:
  - name: orders
    defaults:
      table: orders
      filters:
        shop_id:
          column: shop_code
          value: "SHOP01"
          type: identifier
    checks:
      - check_type: NullRatioCheck
        check_column: order_id
        upper_threshold: 0.0

      - check_type: CountCheck
        check_column: "*"
        lower_threshold: 100

  - name: products
    defaults:
      table: products
    checks:
      - check_type: DuplicateCheck
        check_column: sku
        upper_threshold: 0.0

      - check_type: ValuesInSetCheck
        check_column: status
        value_set: ["active", "inactive", "discontinued"]
        lower_threshold: 1.0

CLI Overwrites

The --overwrites (-o) option allows overriding configuration values at runtime without modifying the YAML file. This is useful for:

  • Running checks for a specific date instead of "yesterday"
  • Testing with different filter values
  • Temporarily changing settings like monitor_only

Overwrite Syntax

Overwrites use a dot-notation path to target specific values, mirroring the YAML structure:

# Override filter value (targets the "value" field by default)
koality run --config_path checks.yaml -o defaults.filters.partition_date=2023-06-15

# Override filter field (column, operator, type, etc.)
koality run --config_path checks.yaml -o defaults.filters.partition_date.column=OTHER_DATE_COL
koality run --config_path checks.yaml -o defaults.filters.amount.operator=">="

# Override other defaults
koality run --config_path checks.yaml -o defaults.identifier_format=column_name
koality run --config_path checks.yaml -o defaults.monitor_only=true

# Multiple overwrites
koality run --config_path checks.yaml -o defaults.filters.partition_date=2023-06-15 -o defaults.filters.shop_id=SHOP02

Overwrite Levels

Overwrites can target different levels of the configuration hierarchy:

# Global defaults (propagates to all checks)
-o defaults.filters.partition_date=2023-06-15
-o defaults.identifier_format=column_name

# Bundle-level defaults (only affects that bundle's checks)
-o check_bundles.orders.filters.partition_date=2023-06-15
-o check_bundles.orders.identifier_format=filter_name

# Check-level (only affects a specific check by index)
-o check_bundles.orders.0.table=orders_archive
-o check_bundles.orders.0.filters.partition_date=2023-06-15

Propagation

Overwrites applied to global defaults automatically propagate to all checks through the normal default inheritance mechanism. This means:

# This single overwrite...
koality run --config_path checks.yaml -o partition_date=2023-06-15

# ...affects ALL checks that inherit the partition_date filter from defaults

Preview with Print

Use the print command with overwrites to verify your overrides before running checks:

# Preview the resolved configuration with overwrites applied
koality print --config_path checks.yaml -o partition_date=2023-06-15 --format yaml