adc_toolkit.data.validators

Data validation framework for the adc-toolkit.

This module provides a flexible data validation system that supports multiple validation backends through a protocol-based architecture. Validators implement the DataValidator protocol, enabling seamless integration with ValidatedDataCatalog and other toolkit components regardless of the underlying validation framework.

The validators module includes three main implementations:

  1. Great Expectations (GX): Enterprise-grade validation with rich features including expectation suites, checkpoints, data profiling, and data documentation. Supports multiple storage backends (filesystem, AWS S3, GCP, Azure).

  2. Pandera: Lightweight, pandas-native validation with Python-based schemas, automatic schema generation, and tight integration with type hints. Ideal for rapid prototyping and teams comfortable with code-based configuration.

  3. NoValidator: Pass-through validator that bypasses validation for development, testing, or scenarios with trusted data sources.

The protocol-based design enables dependency injection and the strategy pattern, allowing users to swap validator implementations without changing downstream code. All validators follow a consistent interface: in_directory() factory method for configuration-based instantiation and validate() method for data validation.

Modules

gx Great Expectations validator implementation with batch managers, data context implementations, and expectation management strategies. pandera Pandera validator implementation with automatic schema generation, compilation, and execution. no_validator No-operation validator that passes data through unchanged without validation.

See Also

adc_toolkit.data.abs.DataValidator: Protocol defining the validator interface.
adc_toolkit.data.ValidatedDataCatalog: Data catalog with integrated validation.
adc_toolkit.data.catalogs: Data catalog implementations.

Notes

Choosing a Validator

Select a validator based on your project requirements:

Use Great Expectations (GX) when:

  • You need enterprise features (data docs, profiling, cloud backends)
  • You want declarative YAML/JSON configuration
  • You require comprehensive data documentation websites
  • Your organization has existing Great Expectations infrastructure
  • You need advanced features like data quality dashboards and monitoring

Use Pandera when:

  • You prefer lightweight, pandas-native validation
  • You want Python-based schema definitions for better IDE support
  • You need tight integration with type hints and static analysis
  • Your team prefers code-based configuration over YAML/JSON
  • You're building prototypes or smaller-scale projects

Use NoValidator when:

  • Developing or debugging without validation overhead
  • Testing with mocked data where validation is not relevant
  • Working with trusted data sources that have external validation guarantees
  • Temporarily bypassing validation for performance profiling

Protocol-Based Architecture

All validators implement the DataValidator protocol, ensuring consistent interfaces:

class DataValidator(Protocol):
    def validate(self, name: str, data: Data) -> Data: ...
    @classmethod
    def in_directory(cls, path: str | Path) -> "DataValidator": ...

This design enables:

  • Dependency injection: Pass validators as constructor parameters
  • Strategy pattern: Swap validators without changing application code
  • Type safety: Static type checking with protocol-based type hints
  • Testability: Easy to mock validators in unit tests

Integration with ValidatedDataCatalog

Validators are typically used through ValidatedDataCatalog, which automatically validates data on load and save operations:

from adc_toolkit.data import ValidatedDataCatalog
from adc_toolkit.data.validators.pandera import PanderaValidator

catalog = ValidatedDataCatalog.in_directory(path="config", validator=PanderaValidator.in_directory("config/validators"))
# Validation happens automatically
df = catalog.load("dataset_name")  # Validates after loading
catalog.save("output_name", df)  # Validates before saving

Version Control Practices

For version-controlled validation rules:

Great Expectations:

  • Commit: expectations/, checkpoints/, great_expectations.yml, plugins/
  • Ignore: uncommitted/ (validation results and data docs)

Pandera:

  • Commit: All schema scripts in pandera_schemas/
  • Ignore: None (all schema files should be version controlled)

Performance Considerations

Validation adds overhead to data pipelines. Consider these optimization strategies:

  • Caching: Reuse validator instances across multiple validations
  • Sampling: Validate representative samples of very large datasets
  • Lazy validation: Use Pandera's lazy mode to collect all errors at once
  • Selective validation: Validate only critical datasets in production
  • Async validation: Run validations in parallel for independent datasets
Examples

Using Great Expectations validator:

>>> from adc_toolkit.data.validators.gx import GXValidator
>>> import pandas as pd
>>> validator = GXValidator.in_directory("config/gx")
>>> df = pd.DataFrame({"id": [1, 2, 3], "value": [10, 20, 30]})
>>> validated = validator.validate("my_dataset", df)

Using Pandera validator:

>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> validator = PanderaValidator.in_directory("config/validators")
>>> df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"]})
>>> validated = validator.validate("customers", df)

Using NoValidator for testing:

>>> from adc_toolkit.data.validators.no_validator import NoValidator
>>> validator = NoValidator()
>>> df = pd.DataFrame({"a": [1, 2, 3]})
>>> validated = validator.validate("test_data", df)  # No validation performed

Swapping validators with dependency injection:

>>> def create_pipeline(validator_type: str):
...     if validator_type == "gx":
...         validator = GXValidator.in_directory("config/gx")
...     elif validator_type == "pandera":
...         validator = PanderaValidator.in_directory("config/pandera")
...     else:
...         validator = NoValidator()
...     return DataPipeline(validator=validator)

Integration with ValidatedDataCatalog:

>>> from adc_toolkit.data import ValidatedDataCatalog
>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> catalog = ValidatedDataCatalog.in_directory(
...     path="config", validator=PanderaValidator.in_directory("config/validators")
... )
>>> df = catalog.load("dataset")  # Automatically validated
>>> catalog.save("output", df)  # Automatically validated

Validation in a data pipeline:

>>> def etl_pipeline(validator):
...     raw = load_raw_data()
...     validated_raw = validator.validate("raw_data", raw)
...     cleaned = clean_data(validated_raw)
...     validated_clean = validator.validate("cleaned_data", cleaned)
...     features = engineer_features(validated_clean)
...     validated_features = validator.validate("features", features)
...     return validated_features
  1"""
  2Data validation framework for the adc-toolkit.
  3
  4This module provides a flexible data validation system that supports multiple
  5validation backends through a protocol-based architecture. Validators implement
  6the DataValidator protocol, enabling seamless integration with ValidatedDataCatalog
  7and other toolkit components regardless of the underlying validation framework.
  8
  9The validators module includes three main implementations:
 10
 111. **Great Expectations (GX)**: Enterprise-grade validation with rich features
 12   including expectation suites, checkpoints, data profiling, and data documentation.
 13   Supports multiple storage backends (filesystem, AWS S3, GCP, Azure).
 14
 152. **Pandera**: Lightweight, pandas-native validation with Python-based schemas,
 16   automatic schema generation, and tight integration with type hints. Ideal for
 17   rapid prototyping and teams comfortable with code-based configuration.
 18
 193. **NoValidator**: Pass-through validator that bypasses validation for development,
 20   testing, or scenarios with trusted data sources.
 21
 22The protocol-based design enables dependency injection and the strategy pattern,
 23allowing users to swap validator implementations without changing downstream code.
 24All validators follow a consistent interface: ``in_directory()`` factory method
 25for configuration-based instantiation and ``validate()`` method for data validation.
 26
 27Modules
 28-------
 29gx
 30    Great Expectations validator implementation with batch managers, data context
 31    implementations, and expectation management strategies.
 32pandera
 33    Pandera validator implementation with automatic schema generation, compilation,
 34    and execution.
 35no_validator
 36    No-operation validator that passes data through unchanged without validation.
 37
 38See Also
 39--------
 40adc_toolkit.data.abs.DataValidator : Protocol defining the validator interface.
 41adc_toolkit.data.ValidatedDataCatalog : Data catalog with integrated validation.
 42adc_toolkit.data.catalogs : Data catalog implementations.
 43
 44Notes
 45-----
 46**Choosing a Validator**
 47
 48Select a validator based on your project requirements:
 49
 50**Use Great Expectations (GX) when:**
 51- You need enterprise features (data docs, profiling, cloud backends)
 52- You want declarative YAML/JSON configuration
 53- You require comprehensive data documentation websites
 54- Your organization has existing Great Expectations infrastructure
 55- You need advanced features like data quality dashboards and monitoring
 56
 57**Use Pandera when:**
 58- You prefer lightweight, pandas-native validation
 59- You want Python-based schema definitions for better IDE support
 60- You need tight integration with type hints and static analysis
 61- Your team prefers code-based configuration over YAML/JSON
 62- You're building prototypes or smaller-scale projects
 63
 64**Use NoValidator when:**
 65- Developing or debugging without validation overhead
 66- Testing with mocked data where validation is not relevant
 67- Working with trusted data sources that have external validation guarantees
 68- Temporarily bypassing validation for performance profiling
 69
 70**Protocol-Based Architecture**
 71
 72All validators implement the ``DataValidator`` protocol, ensuring consistent
 73interfaces:
 74
 75.. code-block:: python
 76
 77    class DataValidator(Protocol):
 78        def validate(self, name: str, data: Data) -> Data: ...
 79        @classmethod
 80        def in_directory(cls, path: str | Path) -> "DataValidator": ...
 81
 82This design enables:
 83- Dependency injection: Pass validators as constructor parameters
 84- Strategy pattern: Swap validators without changing application code
 85- Type safety: Static type checking with protocol-based type hints
 86- Testability: Easy to mock validators in unit tests
 87
 88**Integration with ValidatedDataCatalog**
 89
 90Validators are typically used through ``ValidatedDataCatalog``, which automatically
 91validates data on load and save operations:
 92
 93.. code-block:: python
 94
 95    from adc_toolkit.data import ValidatedDataCatalog
 96    from adc_toolkit.data.validators.pandera import PanderaValidator
 97
 98    catalog = ValidatedDataCatalog.in_directory(path="config", validator=PanderaValidator.in_directory("config/validators"))
 99    # Validation happens automatically
100    df = catalog.load("dataset_name")  # Validates after loading
101    catalog.save("output_name", df)  # Validates before saving
102
103**Version Control Practices**
104
105For version-controlled validation rules:
106
107**Great Expectations:**
108- Commit: expectations/, checkpoints/, great_expectations.yml, plugins/
109- Ignore: uncommitted/ (validation results and data docs)
110
111**Pandera:**
112- Commit: All schema scripts in pandera_schemas/
113- Ignore: None (all schema files should be version controlled)
114
115**Performance Considerations**
116
117Validation adds overhead to data pipelines. Consider these optimization strategies:
118
119- **Caching**: Reuse validator instances across multiple validations
120- **Sampling**: Validate representative samples of very large datasets
121- **Lazy validation**: Use Pandera's lazy mode to collect all errors at once
122- **Selective validation**: Validate only critical datasets in production
123- **Async validation**: Run validations in parallel for independent datasets
124
125Examples
126--------
127Using Great Expectations validator:
128
129>>> from adc_toolkit.data.validators.gx import GXValidator
130>>> import pandas as pd
131>>> validator = GXValidator.in_directory("config/gx")
132>>> df = pd.DataFrame({"id": [1, 2, 3], "value": [10, 20, 30]})
133>>> validated = validator.validate("my_dataset", df)
134
135Using Pandera validator:
136
137>>> from adc_toolkit.data.validators.pandera import PanderaValidator
138>>> validator = PanderaValidator.in_directory("config/validators")
139>>> df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"]})
140>>> validated = validator.validate("customers", df)
141
142Using NoValidator for testing:
143
144>>> from adc_toolkit.data.validators.no_validator import NoValidator
145>>> validator = NoValidator()
146>>> df = pd.DataFrame({"a": [1, 2, 3]})
147>>> validated = validator.validate("test_data", df)  # No validation performed
148
149Swapping validators with dependency injection:
150
151>>> def create_pipeline(validator_type: str):
152...     if validator_type == "gx":
153...         validator = GXValidator.in_directory("config/gx")
154...     elif validator_type == "pandera":
155...         validator = PanderaValidator.in_directory("config/pandera")
156...     else:
157...         validator = NoValidator()
158...     return DataPipeline(validator=validator)
159
160Integration with ValidatedDataCatalog:
161
162>>> from adc_toolkit.data import ValidatedDataCatalog
163>>> from adc_toolkit.data.validators.pandera import PanderaValidator
164>>> catalog = ValidatedDataCatalog.in_directory(
165...     path="config", validator=PanderaValidator.in_directory("config/validators")
166... )
167>>> df = catalog.load("dataset")  # Automatically validated
168>>> catalog.save("output", df)  # Automatically validated
169
170Validation in a data pipeline:
171
172>>> def etl_pipeline(validator):
173...     raw = load_raw_data()
174...     validated_raw = validator.validate("raw_data", raw)
175...     cleaned = clean_data(validated_raw)
176...     validated_clean = validator.validate("cleaned_data", cleaned)
177...     features = engineer_features(validated_clean)
178...     validated_features = validator.validate("features", features)
179...     return validated_features
180"""