adc_toolkit.data.validators
Data validation framework for the adc-toolkit.
This module provides a flexible data validation system that supports multiple validation backends through a protocol-based architecture. Validators implement the DataValidator protocol, enabling seamless integration with ValidatedDataCatalog and other toolkit components regardless of the underlying validation framework.
The validators module includes three main implementations:
Great Expectations (GX): Enterprise-grade validation with rich features including expectation suites, checkpoints, data profiling, and data documentation. Supports multiple storage backends (filesystem, AWS S3, GCP, Azure).
Pandera: Lightweight, pandas-native validation with Python-based schemas, automatic schema generation, and tight integration with type hints. Ideal for rapid prototyping and teams comfortable with code-based configuration.
NoValidator: Pass-through validator that bypasses validation for development, testing, or scenarios with trusted data sources.
The protocol-based design enables dependency injection and the strategy pattern,
allowing users to swap validator implementations without changing downstream code.
All validators follow a consistent interface: in_directory() factory method
for configuration-based instantiation and validate() method for data validation.
Modules
gx Great Expectations validator implementation with batch managers, data context implementations, and expectation management strategies. pandera Pandera validator implementation with automatic schema generation, compilation, and execution. no_validator No-operation validator that passes data through unchanged without validation.
See Also
adc_toolkit.data.abs.DataValidator: Protocol defining the validator interface.
adc_toolkit.data.ValidatedDataCatalog: Data catalog with integrated validation.
adc_toolkit.data.catalogs: Data catalog implementations.
Notes
Choosing a Validator
Select a validator based on your project requirements:
Use Great Expectations (GX) when:
- You need enterprise features (data docs, profiling, cloud backends)
- You want declarative YAML/JSON configuration
- You require comprehensive data documentation websites
- Your organization has existing Great Expectations infrastructure
- You need advanced features like data quality dashboards and monitoring
Use Pandera when:
- You prefer lightweight, pandas-native validation
- You want Python-based schema definitions for better IDE support
- You need tight integration with type hints and static analysis
- Your team prefers code-based configuration over YAML/JSON
- You're building prototypes or smaller-scale projects
Use NoValidator when:
- Developing or debugging without validation overhead
- Testing with mocked data where validation is not relevant
- Working with trusted data sources that have external validation guarantees
- Temporarily bypassing validation for performance profiling
Protocol-Based Architecture
All validators implement the DataValidator protocol, ensuring consistent
interfaces:
class DataValidator(Protocol):
def validate(self, name: str, data: Data) -> Data: ...
@classmethod
def in_directory(cls, path: str | Path) -> "DataValidator": ...
This design enables:
- Dependency injection: Pass validators as constructor parameters
- Strategy pattern: Swap validators without changing application code
- Type safety: Static type checking with protocol-based type hints
- Testability: Easy to mock validators in unit tests
Integration with ValidatedDataCatalog
Validators are typically used through ValidatedDataCatalog, which automatically
validates data on load and save operations:
from adc_toolkit.data import ValidatedDataCatalog
from adc_toolkit.data.validators.pandera import PanderaValidator
catalog = ValidatedDataCatalog.in_directory(path="config", validator=PanderaValidator.in_directory("config/validators"))
# Validation happens automatically
df = catalog.load("dataset_name") # Validates after loading
catalog.save("output_name", df) # Validates before saving
Version Control Practices
For version-controlled validation rules:
Great Expectations:
- Commit: expectations/, checkpoints/, great_expectations.yml, plugins/
- Ignore: uncommitted/ (validation results and data docs)
Pandera:
- Commit: All schema scripts in pandera_schemas/
- Ignore: None (all schema files should be version controlled)
Performance Considerations
Validation adds overhead to data pipelines. Consider these optimization strategies:
- Caching: Reuse validator instances across multiple validations
- Sampling: Validate representative samples of very large datasets
- Lazy validation: Use Pandera's lazy mode to collect all errors at once
- Selective validation: Validate only critical datasets in production
- Async validation: Run validations in parallel for independent datasets
Examples
Using Great Expectations validator:
>>> from adc_toolkit.data.validators.gx import GXValidator
>>> import pandas as pd
>>> validator = GXValidator.in_directory("config/gx")
>>> df = pd.DataFrame({"id": [1, 2, 3], "value": [10, 20, 30]})
>>> validated = validator.validate("my_dataset", df)
Using Pandera validator:
>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> validator = PanderaValidator.in_directory("config/validators")
>>> df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"]})
>>> validated = validator.validate("customers", df)
Using NoValidator for testing:
>>> from adc_toolkit.data.validators.no_validator import NoValidator
>>> validator = NoValidator()
>>> df = pd.DataFrame({"a": [1, 2, 3]})
>>> validated = validator.validate("test_data", df) # No validation performed
Swapping validators with dependency injection:
>>> def create_pipeline(validator_type: str):
... if validator_type == "gx":
... validator = GXValidator.in_directory("config/gx")
... elif validator_type == "pandera":
... validator = PanderaValidator.in_directory("config/pandera")
... else:
... validator = NoValidator()
... return DataPipeline(validator=validator)
Integration with ValidatedDataCatalog:
>>> from adc_toolkit.data import ValidatedDataCatalog
>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> catalog = ValidatedDataCatalog.in_directory(
... path="config", validator=PanderaValidator.in_directory("config/validators")
... )
>>> df = catalog.load("dataset") # Automatically validated
>>> catalog.save("output", df) # Automatically validated
Validation in a data pipeline:
>>> def etl_pipeline(validator):
... raw = load_raw_data()
... validated_raw = validator.validate("raw_data", raw)
... cleaned = clean_data(validated_raw)
... validated_clean = validator.validate("cleaned_data", cleaned)
... features = engineer_features(validated_clean)
... validated_features = validator.validate("features", features)
... return validated_features
1""" 2Data validation framework for the adc-toolkit. 3 4This module provides a flexible data validation system that supports multiple 5validation backends through a protocol-based architecture. Validators implement 6the DataValidator protocol, enabling seamless integration with ValidatedDataCatalog 7and other toolkit components regardless of the underlying validation framework. 8 9The validators module includes three main implementations: 10 111. **Great Expectations (GX)**: Enterprise-grade validation with rich features 12 including expectation suites, checkpoints, data profiling, and data documentation. 13 Supports multiple storage backends (filesystem, AWS S3, GCP, Azure). 14 152. **Pandera**: Lightweight, pandas-native validation with Python-based schemas, 16 automatic schema generation, and tight integration with type hints. Ideal for 17 rapid prototyping and teams comfortable with code-based configuration. 18 193. **NoValidator**: Pass-through validator that bypasses validation for development, 20 testing, or scenarios with trusted data sources. 21 22The protocol-based design enables dependency injection and the strategy pattern, 23allowing users to swap validator implementations without changing downstream code. 24All validators follow a consistent interface: ``in_directory()`` factory method 25for configuration-based instantiation and ``validate()`` method for data validation. 26 27Modules 28------- 29gx 30 Great Expectations validator implementation with batch managers, data context 31 implementations, and expectation management strategies. 32pandera 33 Pandera validator implementation with automatic schema generation, compilation, 34 and execution. 35no_validator 36 No-operation validator that passes data through unchanged without validation. 37 38See Also 39-------- 40adc_toolkit.data.abs.DataValidator : Protocol defining the validator interface. 41adc_toolkit.data.ValidatedDataCatalog : Data catalog with integrated validation. 42adc_toolkit.data.catalogs : Data catalog implementations. 43 44Notes 45----- 46**Choosing a Validator** 47 48Select a validator based on your project requirements: 49 50**Use Great Expectations (GX) when:** 51- You need enterprise features (data docs, profiling, cloud backends) 52- You want declarative YAML/JSON configuration 53- You require comprehensive data documentation websites 54- Your organization has existing Great Expectations infrastructure 55- You need advanced features like data quality dashboards and monitoring 56 57**Use Pandera when:** 58- You prefer lightweight, pandas-native validation 59- You want Python-based schema definitions for better IDE support 60- You need tight integration with type hints and static analysis 61- Your team prefers code-based configuration over YAML/JSON 62- You're building prototypes or smaller-scale projects 63 64**Use NoValidator when:** 65- Developing or debugging without validation overhead 66- Testing with mocked data where validation is not relevant 67- Working with trusted data sources that have external validation guarantees 68- Temporarily bypassing validation for performance profiling 69 70**Protocol-Based Architecture** 71 72All validators implement the ``DataValidator`` protocol, ensuring consistent 73interfaces: 74 75.. code-block:: python 76 77 class DataValidator(Protocol): 78 def validate(self, name: str, data: Data) -> Data: ... 79 @classmethod 80 def in_directory(cls, path: str | Path) -> "DataValidator": ... 81 82This design enables: 83- Dependency injection: Pass validators as constructor parameters 84- Strategy pattern: Swap validators without changing application code 85- Type safety: Static type checking with protocol-based type hints 86- Testability: Easy to mock validators in unit tests 87 88**Integration with ValidatedDataCatalog** 89 90Validators are typically used through ``ValidatedDataCatalog``, which automatically 91validates data on load and save operations: 92 93.. code-block:: python 94 95 from adc_toolkit.data import ValidatedDataCatalog 96 from adc_toolkit.data.validators.pandera import PanderaValidator 97 98 catalog = ValidatedDataCatalog.in_directory(path="config", validator=PanderaValidator.in_directory("config/validators")) 99 # Validation happens automatically 100 df = catalog.load("dataset_name") # Validates after loading 101 catalog.save("output_name", df) # Validates before saving 102 103**Version Control Practices** 104 105For version-controlled validation rules: 106 107**Great Expectations:** 108- Commit: expectations/, checkpoints/, great_expectations.yml, plugins/ 109- Ignore: uncommitted/ (validation results and data docs) 110 111**Pandera:** 112- Commit: All schema scripts in pandera_schemas/ 113- Ignore: None (all schema files should be version controlled) 114 115**Performance Considerations** 116 117Validation adds overhead to data pipelines. Consider these optimization strategies: 118 119- **Caching**: Reuse validator instances across multiple validations 120- **Sampling**: Validate representative samples of very large datasets 121- **Lazy validation**: Use Pandera's lazy mode to collect all errors at once 122- **Selective validation**: Validate only critical datasets in production 123- **Async validation**: Run validations in parallel for independent datasets 124 125Examples 126-------- 127Using Great Expectations validator: 128 129>>> from adc_toolkit.data.validators.gx import GXValidator 130>>> import pandas as pd 131>>> validator = GXValidator.in_directory("config/gx") 132>>> df = pd.DataFrame({"id": [1, 2, 3], "value": [10, 20, 30]}) 133>>> validated = validator.validate("my_dataset", df) 134 135Using Pandera validator: 136 137>>> from adc_toolkit.data.validators.pandera import PanderaValidator 138>>> validator = PanderaValidator.in_directory("config/validators") 139>>> df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"]}) 140>>> validated = validator.validate("customers", df) 141 142Using NoValidator for testing: 143 144>>> from adc_toolkit.data.validators.no_validator import NoValidator 145>>> validator = NoValidator() 146>>> df = pd.DataFrame({"a": [1, 2, 3]}) 147>>> validated = validator.validate("test_data", df) # No validation performed 148 149Swapping validators with dependency injection: 150 151>>> def create_pipeline(validator_type: str): 152... if validator_type == "gx": 153... validator = GXValidator.in_directory("config/gx") 154... elif validator_type == "pandera": 155... validator = PanderaValidator.in_directory("config/pandera") 156... else: 157... validator = NoValidator() 158... return DataPipeline(validator=validator) 159 160Integration with ValidatedDataCatalog: 161 162>>> from adc_toolkit.data import ValidatedDataCatalog 163>>> from adc_toolkit.data.validators.pandera import PanderaValidator 164>>> catalog = ValidatedDataCatalog.in_directory( 165... path="config", validator=PanderaValidator.in_directory("config/validators") 166... ) 167>>> df = catalog.load("dataset") # Automatically validated 168>>> catalog.save("output", df) # Automatically validated 169 170Validation in a data pipeline: 171 172>>> def etl_pipeline(validator): 173... raw = load_raw_data() 174... validated_raw = validator.validate("raw_data", raw) 175... cleaned = clean_data(validated_raw) 176... validated_clean = validator.validate("cleaned_data", cleaned) 177... features = engineer_features(validated_clean) 178... validated_features = validator.validate("features", features) 179... return validated_features 180"""