adc_toolkit.data.validators.pandera
Pandera-based data validation framework for the adc-toolkit.
This module provides a comprehensive Pandera-based validation implementation that combines automatic schema generation with manual customization capabilities. It enables rapid prototyping with zero-setup validation while supporting iterative refinement of validation rules as data requirements evolve.
The validation workflow is optimized for real-world data engineering scenarios where schemas need to be established quickly but refined over time. Schema scripts are stored as editable Python files, enabling version control, code review, and team collaboration on data quality standards.
Classes
PanderaValidator Main validator class implementing the DataValidator protocol. Orchestrates automatic schema generation, schema loading, and validation execution using Pandera's DataFrameSchema.validate() method. Integrates seamlessly with ValidatedDataCatalog for automatic validation on load/save operations. PanderaParameters Immutable configuration dataclass controlling validation behavior. Primary setting is the 'lazy' parameter which determines error collection strategy (collect all errors vs fail-fast on first error).
Exceptions
PanderaValidationError Custom exception raised when validation fails. Wraps Pandera's SchemaError or SchemaErrors with additional context including table name and schema file path for enhanced debugging and error handling.
Notes
Key Features
- Zero-setup validation: Auto-generates schemas on first use by introspecting data
- Incremental refinement: Generated schemas serve as customizable templates
- Version control friendly: Schemas are plain Python files suitable for git
- Comprehensive error reporting: Lazy validation collects all errors in one pass
- Type safety: Full integration with Python type hints and static analysis
- Framework support: Works with pandas and PySpark DataFrames
Validation Workflow
The validation process follows a two-phase approach:
Schema Management (Auto-generation on first use)
- Check if schema file exists at
{config_path}/pandera_schemas/{category}/{dataset}.py - If missing, introspect data structure and generate schema script
- Save generated schema as editable Python file
- Check if schema file exists at
Validation Execution (Every use)
- Load schema script as Python module
- Extract DataFrameSchema object from module
- Execute validation:
schema.validate(data, lazy=parameters.lazy) - Return validated data or raise PanderaValidationError
Schema Organization
Schemas are organized hierarchically based on validation names following the "category.dataset" convention:
config/validators/pandera_schemas/
├── raw/
│ ├── customers.py
│ └── orders.py
├── processed/
│ ├── customers.py
│ └── sales_summary.py
└── gold/
└── analytics.py
Error Collection Strategies
The lazy parameter controls error reporting:
- lazy=True (default): Collects all validation errors across the entire dataset before raising exception. Provides comprehensive error reporting, showing all violations in a single validation run. Recommended for production.
- lazy=False: Raises exception immediately on first validation failure. Useful for debugging when you want to fix errors incrementally.
Comparison with Great Expectations
Use PanderaValidator when:
- You need lightweight, pandas-native validation
- You prefer Python-based schema definitions over YAML/JSON
- You want tight integration with type hints and static analysis
- Your team is comfortable with code-based configuration
Use GXValidator when:
- You need profiling and automatic expectation generation
- You want data documentation websites (Data Docs)
- You need enterprise features (cloud backends, data quality dashboards)
- You prefer declarative YAML/JSON configuration
See Also
adc_toolkit.data.ValidatedDataCatalog: Data catalog with integrated validation.
adc_toolkit.data.validators.gx.GXValidator: Alternative validator using Great Expectations.
adc_toolkit.data.validators.no_validator.NoValidator: No-op validator for testing.
adc_toolkit.data.abs.DataValidator: Protocol defining the validator interface.
pandera.DataFrameSchema: Underlying Pandera schema class used for validation.
Examples
Basic validator setup and usage:
>>> from pathlib import Path
>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> import pandas as pd
>>>
>>> # Create validator
>>> validator = PanderaValidator.in_directory("config/validators")
>>>
>>> # Create sample data
>>> df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]})
>>>
>>> # First validation: auto-generates schema
>>> validated = validator.validate("raw.customers", df)
>>> # Schema created at: config/validators/pandera_schemas/raw/customers.py
Customizing auto-generated schemas:
>>> # After first validation, edit the generated schema file:
>>> # File: config/validators/pandera_schemas/raw/customers.py
>>> #
>>> # import pandera.pandas as pa
>>> #
>>> # schema = pa.DataFrameSchema({
>>> # "id": pa.Column(
>>> # "int64",
>>> # checks=[
>>> # pa.Check.greater_than(0), # Add: IDs must be positive
>>> # pa.Check(lambda s: s.is_unique, element_wise=False), # Add: unique
>>> # ],
>>> # ),
>>> # "name": pa.Column("object"),
>>> # "age": pa.Column(
>>> # "int64",
>>> # checks=[pa.Check.in_range(0, 120)], # Add: realistic age range
>>> # ),
>>> # })
>>>
>>> # Subsequent validations use customized schema
>>> validated = validator.validate("raw.customers", df)
Using custom parameters for fail-fast validation:
>>> from adc_toolkit.data.validators.pandera import PanderaParameters
>>>
>>> # Create validator with fail-fast mode
>>> params = PanderaParameters(lazy=False)
>>> validator_debug = PanderaValidator(config_path="config/validators", parameters=params)
>>>
>>> try:
... validator_debug.validate("raw.customers", invalid_df)
... except Exception as e:
... print(f"First error: {e.original_error}")
Integration with ValidatedDataCatalog:
>>> from adc_toolkit.data import ValidatedDataCatalog
>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>>
>>> # Create catalog with Pandera validator
>>> catalog = ValidatedDataCatalog.in_directory(
... path="config", validator=PanderaValidator.in_directory("config/validators")
... )
>>>
>>> # Load data (automatically validated)
>>> df = catalog.load("raw.customers")
>>>
>>> # Process data
>>> processed_df = transform(df)
>>>
>>> # Save data (automatically validated before saving)
>>> catalog.save("processed.customers", processed_df)
Handling validation errors with comprehensive reporting:
>>> from adc_toolkit.data.validators.pandera import PanderaValidator, PanderaValidationError
>>>
>>> validator = PanderaValidator.in_directory("config/validators")
>>>
>>> # Invalid data
>>> invalid_df = pd.DataFrame(
... {
... "id": [1, -2, 3], # Invalid: negative ID
... "name": ["Alice", "Bob", None], # Invalid: null name
... "age": [25, 30, 150], # Invalid: unrealistic age
... }
... )
>>>
>>> try:
... validator.validate("raw.customers", invalid_df)
... except PanderaValidationError as e:
... print(f"Validation failed for: {e.table_name}")
... print(f"Schema file: {e.schema_path}")
... print(f"All errors: {e.original_error}")
... # With lazy=True, all validation errors are included
Data pipeline with validation at multiple stages:
>>> def data_pipeline():
... validator = PanderaValidator.in_directory("config/validators")
...
... # Load and validate raw data
... raw_data = load_raw_data()
... validated_raw = validator.validate("raw.input", raw_data)
...
... # Transform with confidence
... transformed = transform(validated_raw)
...
... # Validate intermediate results
... validated_intermediate = validator.validate("intermediate.transformed", transformed)
...
... # Final processing
... final = aggregate(validated_intermediate)
...
... # Validate output before downstream consumption
... validated_output = validator.validate("gold.output", final)
...
... return validated_output
Using with PySpark DataFrames:
>>> from pyspark.sql import SparkSession
>>>
>>> spark = SparkSession.builder.getOrCreate()
>>> spark_df = spark.createDataFrame([(1, "Alice", 25), (2, "Bob", 30)], ["id", "name", "age"])
>>>
>>> validator = PanderaValidator.in_directory("config/validators")
>>> validated_spark = validator.validate("raw.spark_customers", spark_df)
>>> # Auto-generates PySpark-compatible schema with pyspark.sql.types imports
1""" 2Pandera-based data validation framework for the adc-toolkit. 3 4This module provides a comprehensive Pandera-based validation implementation that 5combines automatic schema generation with manual customization capabilities. It 6enables rapid prototyping with zero-setup validation while supporting iterative 7refinement of validation rules as data requirements evolve. 8 9The validation workflow is optimized for real-world data engineering scenarios where 10schemas need to be established quickly but refined over time. Schema scripts are 11stored as editable Python files, enabling version control, code review, and team 12collaboration on data quality standards. 13 14Classes 15------- 16PanderaValidator 17 Main validator class implementing the DataValidator protocol. Orchestrates 18 automatic schema generation, schema loading, and validation execution using 19 Pandera's DataFrameSchema.validate() method. Integrates seamlessly with 20 ValidatedDataCatalog for automatic validation on load/save operations. 21PanderaParameters 22 Immutable configuration dataclass controlling validation behavior. Primary 23 setting is the 'lazy' parameter which determines error collection strategy 24 (collect all errors vs fail-fast on first error). 25 26Exceptions 27---------- 28PanderaValidationError 29 Custom exception raised when validation fails. Wraps Pandera's SchemaError or 30 SchemaErrors with additional context including table name and schema file path 31 for enhanced debugging and error handling. 32 33Notes 34----- 35**Key Features** 36 37- **Zero-setup validation**: Auto-generates schemas on first use by introspecting data 38- **Incremental refinement**: Generated schemas serve as customizable templates 39- **Version control friendly**: Schemas are plain Python files suitable for git 40- **Comprehensive error reporting**: Lazy validation collects all errors in one pass 41- **Type safety**: Full integration with Python type hints and static analysis 42- **Framework support**: Works with pandas and PySpark DataFrames 43 44**Validation Workflow** 45 46The validation process follows a two-phase approach: 47 481. **Schema Management** (Auto-generation on first use) 49 - Check if schema file exists at ``{config_path}/pandera_schemas/{category}/{dataset}.py`` 50 - If missing, introspect data structure and generate schema script 51 - Save generated schema as editable Python file 52 532. **Validation Execution** (Every use) 54 - Load schema script as Python module 55 - Extract DataFrameSchema object from module 56 - Execute validation: ``schema.validate(data, lazy=parameters.lazy)`` 57 - Return validated data or raise PanderaValidationError 58 59**Schema Organization** 60 61Schemas are organized hierarchically based on validation names following the 62"category.dataset" convention: 63 64.. code-block:: text 65 66 config/validators/pandera_schemas/ 67 ├── raw/ 68 │ ├── customers.py 69 │ └── orders.py 70 ├── processed/ 71 │ ├── customers.py 72 │ └── sales_summary.py 73 └── gold/ 74 └── analytics.py 75 76**Error Collection Strategies** 77 78The ``lazy`` parameter controls error reporting: 79 80- **lazy=True (default)**: Collects all validation errors across the entire dataset 81 before raising exception. Provides comprehensive error reporting, showing all 82 violations in a single validation run. Recommended for production. 83- **lazy=False**: Raises exception immediately on first validation failure. Useful 84 for debugging when you want to fix errors incrementally. 85 86**Comparison with Great Expectations** 87 88Use PanderaValidator when: 89 90- You need lightweight, pandas-native validation 91- You prefer Python-based schema definitions over YAML/JSON 92- You want tight integration with type hints and static analysis 93- Your team is comfortable with code-based configuration 94 95Use GXValidator when: 96 97- You need profiling and automatic expectation generation 98- You want data documentation websites (Data Docs) 99- You need enterprise features (cloud backends, data quality dashboards) 100- You prefer declarative YAML/JSON configuration 101 102See Also 103-------- 104adc_toolkit.data.ValidatedDataCatalog : Data catalog with integrated validation. 105adc_toolkit.data.validators.gx.GXValidator : Alternative validator using Great Expectations. 106adc_toolkit.data.validators.no_validator.NoValidator : No-op validator for testing. 107adc_toolkit.data.abs.DataValidator : Protocol defining the validator interface. 108pandera.DataFrameSchema : Underlying Pandera schema class used for validation. 109 110Examples 111-------- 112Basic validator setup and usage: 113 114>>> from pathlib import Path 115>>> from adc_toolkit.data.validators.pandera import PanderaValidator 116>>> import pandas as pd 117>>> 118>>> # Create validator 119>>> validator = PanderaValidator.in_directory("config/validators") 120>>> 121>>> # Create sample data 122>>> df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]}) 123>>> 124>>> # First validation: auto-generates schema 125>>> validated = validator.validate("raw.customers", df) 126>>> # Schema created at: config/validators/pandera_schemas/raw/customers.py 127 128Customizing auto-generated schemas: 129 130>>> # After first validation, edit the generated schema file: 131>>> # File: config/validators/pandera_schemas/raw/customers.py 132>>> # 133>>> # import pandera.pandas as pa 134>>> # 135>>> # schema = pa.DataFrameSchema({ 136>>> # "id": pa.Column( 137>>> # "int64", 138>>> # checks=[ 139>>> # pa.Check.greater_than(0), # Add: IDs must be positive 140>>> # pa.Check(lambda s: s.is_unique, element_wise=False), # Add: unique 141>>> # ], 142>>> # ), 143>>> # "name": pa.Column("object"), 144>>> # "age": pa.Column( 145>>> # "int64", 146>>> # checks=[pa.Check.in_range(0, 120)], # Add: realistic age range 147>>> # ), 148>>> # }) 149>>> 150>>> # Subsequent validations use customized schema 151>>> validated = validator.validate("raw.customers", df) 152 153Using custom parameters for fail-fast validation: 154 155>>> from adc_toolkit.data.validators.pandera import PanderaParameters 156>>> 157>>> # Create validator with fail-fast mode 158>>> params = PanderaParameters(lazy=False) 159>>> validator_debug = PanderaValidator(config_path="config/validators", parameters=params) 160>>> 161>>> try: 162... validator_debug.validate("raw.customers", invalid_df) 163... except Exception as e: 164... print(f"First error: {e.original_error}") 165 166Integration with ValidatedDataCatalog: 167 168>>> from adc_toolkit.data import ValidatedDataCatalog 169>>> from adc_toolkit.data.validators.pandera import PanderaValidator 170>>> 171>>> # Create catalog with Pandera validator 172>>> catalog = ValidatedDataCatalog.in_directory( 173... path="config", validator=PanderaValidator.in_directory("config/validators") 174... ) 175>>> 176>>> # Load data (automatically validated) 177>>> df = catalog.load("raw.customers") 178>>> 179>>> # Process data 180>>> processed_df = transform(df) 181>>> 182>>> # Save data (automatically validated before saving) 183>>> catalog.save("processed.customers", processed_df) 184 185Handling validation errors with comprehensive reporting: 186 187>>> from adc_toolkit.data.validators.pandera import PanderaValidator, PanderaValidationError 188>>> 189>>> validator = PanderaValidator.in_directory("config/validators") 190>>> 191>>> # Invalid data 192>>> invalid_df = pd.DataFrame( 193... { 194... "id": [1, -2, 3], # Invalid: negative ID 195... "name": ["Alice", "Bob", None], # Invalid: null name 196... "age": [25, 30, 150], # Invalid: unrealistic age 197... } 198... ) 199>>> 200>>> try: 201... validator.validate("raw.customers", invalid_df) 202... except PanderaValidationError as e: 203... print(f"Validation failed for: {e.table_name}") 204... print(f"Schema file: {e.schema_path}") 205... print(f"All errors: {e.original_error}") 206... # With lazy=True, all validation errors are included 207 208Data pipeline with validation at multiple stages: 209 210>>> def data_pipeline(): 211... validator = PanderaValidator.in_directory("config/validators") 212... 213... # Load and validate raw data 214... raw_data = load_raw_data() 215... validated_raw = validator.validate("raw.input", raw_data) 216... 217... # Transform with confidence 218... transformed = transform(validated_raw) 219... 220... # Validate intermediate results 221... validated_intermediate = validator.validate("intermediate.transformed", transformed) 222... 223... # Final processing 224... final = aggregate(validated_intermediate) 225... 226... # Validate output before downstream consumption 227... validated_output = validator.validate("gold.output", final) 228... 229... return validated_output 230 231Using with PySpark DataFrames: 232 233>>> from pyspark.sql import SparkSession 234>>> 235>>> spark = SparkSession.builder.getOrCreate() 236>>> spark_df = spark.createDataFrame([(1, "Alice", 25), (2, "Bob", 30)], ["id", "name", "age"]) 237>>> 238>>> validator = PanderaValidator.in_directory("config/validators") 239>>> validated_spark = validator.validate("raw.spark_customers", spark_df) 240>>> # Auto-generates PySpark-compatible schema with pyspark.sql.types imports 241""" 242 243from adc_toolkit.data.validators.pandera.parameters import PanderaParameters 244from adc_toolkit.data.validators.pandera.validator import PanderaValidator 245 246 247__all__ = ["PanderaParameters", "PanderaValidator"]
12@dataclass(frozen=True, slots=True, kw_only=True) 13class PanderaParameters: 14 """ 15 Configuration parameters for Pandera data validation. 16 17 This immutable dataclass encapsulates configuration options that control 18 how Pandera validates DataFrames within the adc-toolkit validation workflow. 19 It is designed to be passed to PanderaValidator instances to customize 20 validation behavior. 21 22 The primary configuration option controls error collection strategy: 23 whether to fail fast on the first validation error or to collect all 24 validation errors before raising an exception. Lazy validation is 25 recommended for production workflows as it provides comprehensive 26 error reporting, making it easier to fix all issues in a single pass. 27 28 This class is immutable (frozen) to ensure validation parameters remain 29 consistent throughout the validation lifecycle and to enable safe 30 sharing across multiple validation operations. 31 32 Attributes 33 ---------- 34 lazy : bool, default=True 35 Controls the validation error collection strategy. 36 37 - If True (default): Collects all validation errors across all 38 rows and columns before raising a SchemaErrors exception. This 39 provides comprehensive error reporting, showing all violations 40 in a single validation run. 41 - If False: Raises a SchemaError immediately upon encountering 42 the first validation failure. This "fail-fast" mode is useful 43 for debugging or when you want to fix errors incrementally. 44 45 The lazy parameter is passed directly to Pandera's 46 `DataFrameSchema.validate()` method. 47 48 See Also 49 -------- 50 PanderaValidator : Validator that uses these parameters for data validation. 51 pandera.DataFrameSchema.validate : Underlying Pandera validation method 52 that receives the lazy parameter. 53 54 Notes 55 ----- 56 This dataclass is configured with the following features: 57 58 - **frozen=True**: Makes instances immutable after creation. Attempting 59 to modify attributes after instantiation raises FrozenInstanceError. 60 - **slots=True**: Uses __slots__ for memory efficiency and faster 61 attribute access by preventing dynamic attribute creation. 62 - **kw_only=True**: Requires all parameters to be specified as keyword 63 arguments, improving code clarity and preventing positional argument 64 errors. 65 66 The immutability design ensures that validation parameters cannot be 67 accidentally modified during the validation process, promoting 68 predictable and reproducible validation behavior. 69 70 Examples 71 -------- 72 Create a PanderaParameters instance with default lazy validation: 73 74 >>> from adc_toolkit.data.validators.pandera import PanderaParameters 75 >>> params = PanderaParameters() 76 >>> params.lazy 77 True 78 79 Create parameters for fail-fast validation: 80 81 >>> params_strict = PanderaParameters(lazy=False) 82 >>> params_strict.lazy 83 False 84 85 Use with PanderaValidator for comprehensive error reporting: 86 87 >>> from adc_toolkit.data.validators.pandera import PanderaValidator 88 >>> validator = PanderaValidator(config_path="config/validators", parameters=PanderaParameters(lazy=True)) 89 90 Use with PanderaValidator for fail-fast debugging: 91 92 >>> validator_debug = PanderaValidator(config_path="config/validators", parameters=PanderaParameters(lazy=False)) 93 94 Demonstrate immutability (frozen dataclass): 95 96 >>> params = PanderaParameters() 97 >>> params.lazy = False # doctest: +SKIP 98 Traceback (most recent call last): 99 ... 100 dataclasses.FrozenInstanceError: cannot assign to field 'lazy' 101 102 Compare parameter instances: 103 104 >>> params1 = PanderaParameters(lazy=True) 105 >>> params2 = PanderaParameters(lazy=True) 106 >>> params1 == params2 107 True 108 >>> params3 = PanderaParameters(lazy=False) 109 >>> params1 == params3 110 False 111 """ 112 113 lazy: bool = True
Configuration parameters for Pandera data validation.
This immutable dataclass encapsulates configuration options that control how Pandera validates DataFrames within the adc-toolkit validation workflow. It is designed to be passed to PanderaValidator instances to customize validation behavior.
The primary configuration option controls error collection strategy: whether to fail fast on the first validation error or to collect all validation errors before raising an exception. Lazy validation is recommended for production workflows as it provides comprehensive error reporting, making it easier to fix all issues in a single pass.
This class is immutable (frozen) to ensure validation parameters remain consistent throughout the validation lifecycle and to enable safe sharing across multiple validation operations.
Attributes
lazy (bool, default=True): Controls the validation error collection strategy.
- If True (default): Collects all validation errors across all rows and columns before raising a SchemaErrors exception. This provides comprehensive error reporting, showing all violations in a single validation run.
- If False: Raises a SchemaError immediately upon encountering the first validation failure. This "fail-fast" mode is useful for debugging or when you want to fix errors incrementally.
The lazy parameter is passed directly to Pandera's
DataFrameSchema.validate()method.
See Also
PanderaValidator: Validator that uses these parameters for data validation.
pandera.DataFrameSchema.validate: Underlying Pandera validation method
that receives the lazy parameter.
Notes
This dataclass is configured with the following features:
- frozen=True: Makes instances immutable after creation. Attempting to modify attributes after instantiation raises FrozenInstanceError.
- slots=True: Uses __slots__ for memory efficiency and faster attribute access by preventing dynamic attribute creation.
- kw_only=True: Requires all parameters to be specified as keyword arguments, improving code clarity and preventing positional argument errors.
The immutability design ensures that validation parameters cannot be accidentally modified during the validation process, promoting predictable and reproducible validation behavior.
Examples
Create a PanderaParameters instance with default lazy validation:
>>> from adc_toolkit.data.validators.pandera import PanderaParameters
>>> params = PanderaParameters()
>>> params.lazy
True
Create parameters for fail-fast validation:
>>> params_strict = PanderaParameters(lazy=False)
>>> params_strict.lazy
False
Use with PanderaValidator for comprehensive error reporting:
>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> validator = PanderaValidator(config_path="config/validators", parameters=PanderaParameters(lazy=True))
Use with PanderaValidator for fail-fast debugging:
>>> validator_debug = PanderaValidator(config_path="config/validators", parameters=PanderaParameters(lazy=False))
Demonstrate immutability (frozen dataclass):
>>> params = PanderaParameters()
>>> params.lazy = False # doctest: +SKIP
Traceback (most recent call last):
...
dataclasses.FrozenInstanceError: cannot assign to field 'lazy'
Compare parameter instances:
>>> params1 = PanderaParameters(lazy=True)
>>> params2 = PanderaParameters(lazy=True)
>>> params1 == params2
True
>>> params3 = PanderaParameters(lazy=False)
>>> params1 == params3
False
68class PanderaValidator: 69 """ 70 Pandera-based data validator with automatic schema generation. 71 72 PanderaValidator is a concrete implementation of the DataValidator protocol 73 that uses Pandera (https://pandera.readthedocs.io/) for schema-based data 74 validation. It provides a seamless validation workflow that combines automatic 75 schema generation with manual customization capabilities. 76 77 The validator orchestrates a two-phase approach to data validation: 78 79 **Phase 1: Schema Management** (Auto-generation) 80 On first validation of a dataset, the validator automatically generates 81 a Pandera schema script by introspecting the data structure (column names, 82 data types). The generated schema is saved as an editable Python file at 83 ``{config_path}/pandera_schemas/{category}/{dataset}.py``, where the 84 category and dataset name are derived from the validation name (e.g., 85 "raw.customers" creates ``raw/customers.py``). 86 87 **Phase 2: Validation Execution** (Rule Enforcement) 88 On all validations (including first use), the validator loads the schema 89 script and executes validation against the data using Pandera's 90 ``DataFrameSchema.validate()`` method. If validation fails, it raises a 91 detailed ``PanderaValidationError`` with comprehensive error information. 92 93 This design enables an iterative workflow: 94 95 1. Run validation immediately without manual schema creation 96 2. Review auto-generated schemas and add custom validation rules 97 3. Commit schemas to version control for team collaboration 98 4. Evolve schemas as data requirements change over time 99 100 The validator integrates seamlessly with ``ValidatedDataCatalog`` to provide 101 automatic validation on all data load and save operations, ensuring data 102 quality throughout the entire data pipeline. 103 104 Attributes 105 ---------- 106 config_path : Path 107 The directory path where Pandera schema scripts are stored and loaded from. 108 Schema files are organized in a hierarchical structure under this path, 109 specifically at ``{config_path}/pandera_schemas/``. For example, if 110 config_path is ``Path("config/validators")``, schemas are stored at 111 ``config/validators/pandera_schemas/{category}/{dataset}.py``. 112 parameters : PanderaParameters 113 Configuration parameters controlling validation behavior. The primary 114 parameter is ``lazy``, which determines error collection strategy: 115 116 - ``lazy=True`` (default): Collects all validation errors across the 117 entire dataset before raising an exception, providing comprehensive 118 error reporting. 119 - ``lazy=False``: Raises an exception immediately upon encountering the 120 first validation failure, useful for debugging. 121 122 If None is provided during instantiation, defaults to 123 ``PanderaParameters()`` with default settings (``lazy=True``). 124 125 Parameters 126 ---------- 127 config_path : str or Path 128 Path to the root configuration directory where Pandera schema scripts are 129 stored. The validator will create a ``pandera_schemas`` subdirectory under 130 this path to organize schema files. Can be provided as either a string or 131 pathlib.Path object. 132 parameters : PanderaParameters or None, optional 133 Configuration parameters for validation behavior. If None (default), uses 134 ``PanderaParameters()`` with default settings (``lazy=True`` for 135 comprehensive error reporting). 136 137 Raises 138 ------ 139 TypeError 140 If config_path cannot be converted to a Path object. 141 OSError 142 If the config_path directory does not exist and cannot be created during 143 schema generation. 144 145 See Also 146 -------- 147 PanderaParameters : Configuration parameters for validation behavior. 148 validate_data : Core validation function used internally by this validator. 149 adc_toolkit.data.abs.DataValidator : Protocol that this class implements. 150 adc_toolkit.data.validators.gx.GXValidator : Alternative validator using Great Expectations. 151 adc_toolkit.data.ValidatedDataCatalog : Data catalog with integrated validation. 152 153 Notes 154 ----- 155 **Schema Script Organization** 156 157 Schema scripts are organized hierarchically based on validation names. For 158 a validation name like "raw.customers", the schema script is created at: 159 160 .. code-block:: text 161 162 {config_path}/pandera_schemas/raw/customers.py 163 164 This structure mirrors typical data lake or data warehouse naming conventions 165 (e.g., database.table) and supports large projects with many datasets. 166 167 **Schema Customization Workflow** 168 169 After first validation, edit the generated schema file to add custom checks: 170 171 .. code-block:: python 172 173 # {config_path}/pandera_schemas/raw/customers.py 174 import pandera.pandas as pa 175 176 schema = pa.DataFrameSchema( 177 { 178 "customer_id": pa.Column( 179 "int64", 180 checks=[ 181 pa.Check.greater_than(0), # Must be positive 182 pa.Check(lambda s: s.is_unique, element_wise=False), # Must be unique 183 ], 184 ), 185 "email": pa.Column( 186 "object", 187 checks=[ 188 pa.Check(lambda s: s.str.contains("@"), element_wise=True), 189 ], 190 ), 191 "signup_date": pa.Column( 192 "datetime64[ns]", 193 checks=[ 194 pa.Check.less_than_or_equal_to(pd.Timestamp.now()), 195 ], 196 ), 197 } 198 ) 199 200 **Supported Data Types** 201 202 The validator supports: 203 204 - **pandas DataFrames**: Primary use case with full feature support 205 - **PySpark DataFrames**: Generates PySpark-compatible schemas (requires pyspark) 206 207 **Thread Safety** 208 209 This class is thread-safe for validation operations on existing schemas. 210 However, the initial schema auto-generation is not thread-safe. If multiple 211 threads validate the same dataset for the first time concurrently, race 212 conditions may occur. For concurrent scenarios, pre-generate schemas or 213 implement external locking. 214 215 **Performance Considerations** 216 217 - Schema scripts are dynamically imported on each validation call. For 218 high-frequency scenarios, consider caching validator instances. 219 - The ``lazy=True`` mode has slightly more overhead as it collects all errors, 220 but provides significantly better developer experience for fixing issues. 221 - Auto-generation only occurs once per dataset, so performance impact is 222 negligible after initial schema creation. 223 224 **Comparison with Great Expectations** 225 226 Use PanderaValidator when: 227 228 - You need lightweight, pandas-native validation 229 - You prefer Python-based schema definitions over YAML/JSON 230 - You want tight integration with type hints and static analysis 231 - Your team is comfortable with code-based configuration 232 233 Use GXValidator when: 234 235 - You need profiling and automatic expectation generation 236 - You want data documentation websites (Data Docs) 237 - You need enterprise features (cloud backends, data quality dashboards) 238 - You prefer declarative YAML/JSON configuration 239 240 Examples 241 -------- 242 Create a validator using the factory method: 243 244 >>> from adc_toolkit.data.validators.pandera import PanderaValidator 245 >>> validator = PanderaValidator.in_directory("config/validators") 246 247 Create a validator using the constructor: 248 249 >>> from pathlib import Path 250 >>> validator = PanderaValidator(config_path=Path("config/validators")) 251 252 Create a validator with custom parameters for fail-fast mode: 253 254 >>> from adc_toolkit.data.validators.pandera import PanderaParameters 255 >>> params = PanderaParameters(lazy=False) 256 >>> validator = PanderaValidator(config_path="config/validators", parameters=params) 257 258 Basic validation workflow: 259 260 >>> import pandas as pd 261 >>> df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]}) 262 >>> validator = PanderaValidator.in_directory("config/validators") 263 >>> validated_df = validator.validate("raw.customers", df) 264 >>> # First run: auto-generates schema at config/validators/pandera_schemas/raw/customers.py 265 >>> # Subsequent runs: uses existing schema 266 267 Handle validation failures with comprehensive error reporting: 268 269 >>> df_invalid = pd.DataFrame( 270 ... { 271 ... "id": [1, -2, 3], # Invalid: negative ID 272 ... "name": ["Alice", "Bob", None], # Invalid: null name 273 ... "age": [25, 30, 150], # Invalid: unrealistic age 274 ... } 275 ... ) 276 >>> from adc_toolkit.data.validators.pandera.exceptions import PanderaValidationError 277 >>> try: 278 ... validator.validate("raw.customers", df_invalid) 279 ... except PanderaValidationError as e: 280 ... print(f"Validation failed for: {e.table_name}") 281 ... print(f"Schema file: {e.schema_path}") 282 ... print(f"Errors: {e.original_error}") 283 ... # All validation errors are included (lazy=True) 284 285 Integration with ValidatedDataCatalog: 286 287 >>> from adc_toolkit.data import ValidatedDataCatalog 288 >>> catalog = ValidatedDataCatalog.in_directory( 289 ... path="config", validator=PanderaValidator.in_directory("config/validators") 290 ... ) 291 >>> df = catalog.load("raw.customers") # Validates after loading 292 >>> catalog.save("processed.customers", df) # Validates before saving 293 294 Iterative schema customization workflow: 295 296 >>> # Step 1: First validation auto-generates schema 297 >>> validator = PanderaValidator.in_directory("config/validators") 298 >>> validator.validate("raw.customers", df) 299 >>> 300 >>> # Step 2: Edit generated schema to add custom checks 301 >>> # File: config/validators/pandera_schemas/raw/customers.py 302 >>> # Add: pa.Check.greater_than(0) to "id" column 303 >>> 304 >>> # Step 3: Subsequent validations use customized schema 305 >>> validator.validate("raw.customers", df) # Now enforces custom rules 306 307 Validate PySpark DataFrame: 308 309 >>> from pyspark.sql import SparkSession 310 >>> spark = SparkSession.builder.getOrCreate() 311 >>> spark_df = spark.createDataFrame([(1, "Alice", 25), (2, "Bob", 30)], ["id", "name", "age"]) 312 >>> validator = PanderaValidator.in_directory("config/validators") 313 >>> validated_spark = validator.validate("raw.spark_customers", spark_df) 314 >>> # Generates PySpark-compatible schema with pyspark.sql.types imports 315 316 Use in a data quality pipeline: 317 318 >>> def quality_check_pipeline(input_df): 319 ... validator = PanderaValidator.in_directory("config/validators") 320 ... 321 ... # Validate raw input 322 ... validated_input = validator.validate("raw.data", input_df) 323 ... 324 ... # Transform data 325 ... transformed = transform(validated_input) 326 ... 327 ... # Validate transformed output 328 ... validated_output = validator.validate("processed.data", transformed) 329 ... 330 ... return validated_output 331 332 Multiple validators for different environments: 333 334 >>> dev_validator = PanderaValidator.in_directory("config/validators/dev") 335 >>> prod_validator = PanderaValidator.in_directory("config/validators/prod") 336 >>> # Use different validation rules for dev vs. production 337 """ 338 339 def __init__(self, config_path: str | Path, parameters: PanderaParameters | None = None) -> None: 340 """ 341 Initialize a PanderaValidator instance. 342 343 Constructs a new validator configured to use schema scripts from the 344 specified directory. The constructor sets up the schema storage location 345 and validation parameters, but does not perform any I/O operations or 346 validation at initialization time. 347 348 The validator creates a logical schema directory at 349 ``{config_path}/pandera_schemas/`` where all Pandera schema scripts will 350 be stored and loaded from. This subdirectory organization keeps Pandera 351 schemas separate from other configuration files and validation frameworks 352 (e.g., Great Expectations configurations). 353 354 Parameters 355 ---------- 356 config_path : str or Path 357 Path to the root configuration directory. The validator will use a 358 ``pandera_schemas`` subdirectory under this path for storing and 359 loading schema scripts. Can be provided as either a string (which will 360 be converted to a Path) or a pathlib.Path object. The path can be 361 absolute or relative to the current working directory. 362 363 For example, if config_path is ``"config/validators"``, schema scripts 364 will be stored at ``config/validators/pandera_schemas/{category}/{dataset}.py``. 365 parameters : PanderaParameters or None, optional 366 Configuration parameters controlling validation behavior. If None 367 (default), uses ``PanderaParameters()`` with default settings 368 (``lazy=True`` for comprehensive error reporting). Provide a custom 369 ``PanderaParameters`` instance to configure validation strategy: 370 371 - ``PanderaParameters(lazy=True)``: Collect all errors (recommended) 372 - ``PanderaParameters(lazy=False)``: Fail-fast on first error 373 374 Returns 375 ------- 376 None 377 Constructor does not return a value. 378 379 Raises 380 ------ 381 TypeError 382 If config_path cannot be converted to a Path object (e.g., if an 383 invalid type is provided like int or dict). 384 385 See Also 386 -------- 387 in_directory : Alternative factory method for creating validators. 388 validate : Perform validation on a dataset. 389 PanderaParameters : Configuration parameters for validation behavior. 390 391 Notes 392 ----- 393 **Lazy Initialization** 394 395 The constructor does not create the ``pandera_schemas`` directory at 396 initialization time. The directory is created only when the first schema 397 script is generated during validation. This lazy approach avoids 398 unnecessary file system operations if the validator is created but never 399 used. 400 401 **Path Handling** 402 403 The constructor automatically converts string paths to pathlib.Path objects 404 and appends the ``pandera_schemas`` subdirectory. This means: 405 406 .. code-block:: python 407 408 validator = PanderaValidator(config_path="config") 409 # validator.config_path is Path("config/pandera_schemas") 410 411 **Immutability** 412 413 While the validator instance itself is mutable (standard Python object), 414 the ``parameters`` attribute uses a frozen dataclass (PanderaParameters), 415 ensuring validation behavior remains consistent throughout the validator's 416 lifecycle. 417 418 **No Validation at Initialization** 419 420 This constructor only sets up the validator configuration. No validation 421 occurs until the ``validate()`` method is called. This design allows 422 validators to be created cheaply and reused across multiple validation 423 operations. 424 425 Examples 426 -------- 427 Create a validator with default parameters: 428 429 >>> from adc_toolkit.data.validators.pandera import PanderaValidator 430 >>> validator = PanderaValidator(config_path="config/validators") 431 >>> validator.config_path 432 PosixPath('config/validators/pandera_schemas') 433 >>> validator.parameters.lazy 434 True 435 436 Create a validator with custom parameters for fail-fast mode: 437 438 >>> from adc_toolkit.data.validators.pandera import PanderaParameters 439 >>> params = PanderaParameters(lazy=False) 440 >>> validator = PanderaValidator(config_path="config/validators", parameters=params) 441 >>> validator.parameters.lazy 442 False 443 444 Using pathlib.Path for config_path: 445 446 >>> from pathlib import Path 447 >>> config_dir = Path("config") / "validators" 448 >>> validator = PanderaValidator(config_path=config_dir) 449 >>> validator.config_path 450 PosixPath('config/validators/pandera_schemas') 451 452 Create multiple validators for different schema directories: 453 454 >>> dev_validator = PanderaValidator(config_path="config/dev/validators") 455 >>> prod_validator = PanderaValidator(config_path="config/prod/validators") 456 >>> # Each validator uses a separate schema directory 457 458 Reuse a validator for multiple validations: 459 460 >>> validator = PanderaValidator(config_path="config/validators") 461 >>> validated_df1 = validator.validate("dataset1", df1) 462 >>> validated_df2 = validator.validate("dataset2", df2) 463 >>> # Same validator instance, different datasets 464 """ 465 self.config_path = Path(config_path) / "pandera_schemas" 466 self.parameters = parameters or PanderaParameters() 467 468 @classmethod 469 def in_directory(cls, path: str | Path, parameters: PanderaParameters | None = None) -> "PanderaValidator": 470 """ 471 Create a PanderaValidator from a configuration directory (factory method). 472 473 This is the recommended factory method for creating PanderaValidator 474 instances. It provides a consistent interface with other toolkit components 475 (e.g., ``ValidatedDataCatalog.in_directory()``, ``KedroDataCatalog.in_directory()``) 476 and follows the factory pattern for object construction from configuration. 477 478 The method is functionally equivalent to calling the constructor directly, 479 but provides better semantic clarity in code that uses multiple toolkit 480 components with directory-based configuration. 481 482 Parameters 483 ---------- 484 path : str or Path 485 Path to the root configuration directory where Pandera schema scripts 486 are stored. The validator will use a ``pandera_schemas`` subdirectory 487 under this path. Can be provided as either a string or pathlib.Path 488 object. The path can be absolute or relative to the current working 489 directory. 490 491 For example, if path is ``"config/validators"``, schema scripts will 492 be stored at ``config/validators/pandera_schemas/{category}/{dataset}.py``. 493 parameters : PanderaParameters or None, optional 494 Configuration parameters controlling validation behavior. If None 495 (default), uses ``PanderaParameters()`` with default settings 496 (``lazy=True`` for comprehensive error reporting). Provide a custom 497 ``PanderaParameters`` instance to configure validation strategy: 498 499 - ``PanderaParameters(lazy=True)``: Collect all errors (recommended) 500 - ``PanderaParameters(lazy=False)``: Fail-fast on first error 501 502 Returns 503 ------- 504 PanderaValidator 505 A new validator instance configured to use schema scripts from the 506 specified directory. The returned validator is ready to use for 507 validation operations via the ``validate()`` method. 508 509 Raises 510 ------ 511 TypeError 512 If path cannot be converted to a Path object (e.g., if an invalid 513 type is provided like int or dict). 514 515 See Also 516 -------- 517 __init__ : Alternative constructor for creating validators. 518 validate : Perform validation on a dataset. 519 PanderaParameters : Configuration parameters for validation behavior. 520 adc_toolkit.data.ValidatedDataCatalog.in_directory : Similar factory method 521 for creating validated data catalogs. 522 523 Notes 524 ----- 525 **Factory Pattern** 526 527 This method implements the factory pattern, providing a standard interface 528 for creating validators from directory-based configuration. The pattern is 529 used consistently across the toolkit: 530 531 .. code-block:: python 532 533 # Similar patterns across toolkit components 534 catalog = KedroDataCatalog.in_directory("config/") 535 validator = PanderaValidator.in_directory("config/validators") 536 gx_validator = GXValidator.in_directory("config/gx") 537 538 **Semantic Clarity** 539 540 Using ``in_directory()`` instead of the constructor makes code more 541 readable and self-documenting: 542 543 .. code-block:: python 544 545 # Clear intent: validator configured from this directory 546 validator = PanderaValidator.in_directory("config/validators") 547 548 # vs. less clear constructor call 549 validator = PanderaValidator("config/validators") 550 551 **Design Rationale** 552 553 The factory method pattern is preferred over direct constructor calls in 554 the toolkit because: 555 556 1. Provides consistent API across all components 557 2. Makes the configuration-from-directory pattern explicit 558 3. Allows future extension with additional factory methods 559 4. Improves code readability and maintainability 560 561 **Usage in ValidatedDataCatalog** 562 563 This method is commonly used when configuring ``ValidatedDataCatalog`` 564 with a custom validator: 565 566 .. code-block:: python 567 568 from adc_toolkit.data import ValidatedDataCatalog 569 from adc_toolkit.data.validators.pandera import PanderaValidator 570 571 catalog = ValidatedDataCatalog.in_directory( 572 path="config", validator=PanderaValidator.in_directory("config/validators") 573 ) 574 575 Examples 576 -------- 577 Create a validator using the factory method (recommended): 578 579 >>> from adc_toolkit.data.validators.pandera import PanderaValidator 580 >>> validator = PanderaValidator.in_directory("config/validators") 581 >>> validator.config_path 582 PosixPath('config/validators/pandera_schemas') 583 584 Create a validator with custom parameters: 585 586 >>> from adc_toolkit.data.validators.pandera import PanderaParameters 587 >>> params = PanderaParameters(lazy=False) 588 >>> validator = PanderaValidator.in_directory(path="config/validators", parameters=params) 589 >>> validator.parameters.lazy 590 False 591 592 Using pathlib.Path: 593 594 >>> from pathlib import Path 595 >>> config_dir = Path("config") / "validators" 596 >>> validator = PanderaValidator.in_directory(path=config_dir) 597 598 Integration with ValidatedDataCatalog: 599 600 >>> from adc_toolkit.data import ValidatedDataCatalog 601 >>> catalog = ValidatedDataCatalog.in_directory( 602 ... path="config", validator=PanderaValidator.in_directory("config/validators") 603 ... ) 604 >>> # ValidatedDataCatalog uses the PanderaValidator for all load/save ops 605 606 Consistent API across different validators: 607 608 >>> from adc_toolkit.data.validators.pandera import PanderaValidator 609 >>> from adc_toolkit.data.validators.gx import GXValidator 610 >>> pandera_val = PanderaValidator.in_directory("config/pandera") 611 >>> gx_val = GXValidator.in_directory("config/gx") 612 >>> # Both use the same factory method pattern 613 614 Multiple validators for different environments: 615 616 >>> dev_validator = PanderaValidator.in_directory("config/dev/validators") 617 >>> staging_validator = PanderaValidator.in_directory("config/staging/validators") 618 >>> prod_validator = PanderaValidator.in_directory("config/prod/validators") 619 >>> # Each environment can have different validation rules 620 621 Dependency injection pattern: 622 623 >>> def create_pipeline(validator_path: str): 624 ... validator = PanderaValidator.in_directory(validator_path) 625 ... return DataPipeline(validator=validator) 626 >>> # Easy to swap validator configurations 627 >>> dev_pipeline = create_pipeline("config/dev/validators") 628 >>> prod_pipeline = create_pipeline("config/prod/validators") 629 """ 630 return cls(path, parameters=parameters) 631 632 def validate(self, name: str, data: Data) -> Data: 633 """ 634 Validate a dataset against its Pandera schema. 635 636 This is the primary validation method that implements the DataValidator 637 protocol. It orchestrates the complete validation workflow, from automatic 638 schema generation (if needed) to validation execution, providing seamless 639 data quality checking with minimal setup. 640 641 The method delegates to the ``validate_data`` function, which implements 642 a two-phase validation approach: 643 644 **Phase 1: Schema Preparation** (First Use Only) 645 If no schema script exists for the dataset name, automatically generate 646 one by introspecting the data structure. The generated schema is saved 647 at ``{self.config_path}/{category}/{dataset}.py`` and serves as an 648 editable template for adding custom validation rules. 649 650 **Phase 2: Validation Execution** (Every Use) 651 Load the schema script as a Python module, extract the 652 ``DataFrameSchema`` object, and execute validation using Pandera's 653 ``schema.validate(data, lazy=self.parameters.lazy)`` method. Return 654 validated data if all checks pass, or raise ``PanderaValidationError`` 655 with comprehensive error details if validation fails. 656 657 This design enables rapid prototyping (no upfront schema creation required) 658 while supporting iterative refinement (schemas can be customized after 659 auto-generation). Schema scripts are version-controlled Python files, 660 facilitating team collaboration and schema evolution tracking. 661 662 Parameters 663 ---------- 664 name : str 665 The dataset name/identifier that determines which schema script to use. 666 Should follow the convention ``"category.dataset_name"`` (e.g., 667 ``"raw.customers"``, ``"processed.sales"``). The name serves multiple 668 purposes: 669 670 - Determines schema file location: ``{config_path}/{category}/{dataset}.py`` 671 - Provides context in validation error messages 672 - Enables logical organization of schemas by data pipeline stage 673 674 Names with a single dot separator create a two-level directory 675 structure. For example, ``"raw.customers"`` creates a schema at 676 ``{self.config_path}/raw/customers.py``. 677 data : Data 678 The data object to validate. Must be a protocol-compliant Data object 679 (pandas DataFrame, PySpark DataFrame, etc.) with ``columns`` and 680 ``dtypes`` attributes. The data structure is validated against the 681 schema defined in (or auto-generated for) the corresponding schema 682 script. 683 684 Supported types: 685 - ``pandas.DataFrame``: Primary use case with full feature support 686 - ``pyspark.sql.DataFrame``: Requires pyspark installation 687 688 Returns 689 ------- 690 Data 691 The validated data object. If validation passes all checks defined in 692 the schema, returns the original data object (potentially with 693 Pandera-applied type coercions if configured in the schema). The return 694 type matches the input data type (pandas in, pandas out; PySpark in, 695 PySpark out). 696 697 The returned data can be used immediately in downstream processing with 698 confidence that it meets all defined quality requirements. 699 700 Raises 701 ------ 702 PanderaValidationError 703 Raised when data validation fails against the schema. This custom 704 exception wraps Pandera's underlying SchemaError or SchemaErrors and 705 enriches it with additional context: 706 707 Attributes of PanderaValidationError: 708 - ``table_name``: The dataset name that failed validation 709 - ``schema_path``: Full filesystem path to the schema script file 710 - ``original_error``: The underlying Pandera error with detailed 711 validation failure information (row indices, column names, observed 712 values, expected constraints) 713 714 With ``lazy=True`` (default), the exception includes all validation 715 errors across the entire dataset. With ``lazy=False``, it includes only 716 the first error encountered. 717 ValueError 718 Raised if the dataframe type is not supported (neither pandas nor 719 pyspark), originating from the schema compiler during auto-generation. 720 ModuleNotFoundError 721 Raised if the schema script cannot be imported, typically indicating 722 a Python syntax error in a manually edited schema file. Check the 723 schema file for syntax errors or import statement issues. 724 AttributeError 725 Raised if the schema script module does not define a ``schema`` 726 attribute, indicating the schema file structure is invalid. The schema 727 file must contain ``schema = pa.DataFrameSchema(...)`` at module level. 728 OSError 729 Raised if there are file system permissions issues preventing schema 730 file creation (during auto-generation) or reading (during validation). 731 732 See Also 733 -------- 734 validate_data : The underlying validation function called by this method. 735 PanderaParameters : Configuration parameters controlling validation behavior. 736 PanderaValidationError : Custom exception raised on validation failure. 737 in_directory : Factory method for creating validator instances. 738 adc_toolkit.data.ValidatedDataCatalog : Data catalog with integrated validation. 739 740 Notes 741 ----- 742 **Validation Workflow** 743 744 The complete sequence executed by this method: 745 746 1. Check if schema file exists at ``{self.config_path}/{category}/{dataset}.py`` 747 2. If missing, auto-generate schema by introspecting data structure 748 3. Load schema file as a Python module using dynamic import 749 4. Extract the ``schema`` DataFrameSchema object from the module 750 5. Call ``schema.validate(data, lazy=self.parameters.lazy)`` 751 6. Return validated data if all checks pass 752 7. Raise PanderaValidationError with context if validation fails 753 754 **Schema Customization After Auto-Generation** 755 756 After first validation, edit the generated schema file to add domain-specific 757 validation rules: 758 759 .. code-block:: python 760 761 # {self.config_path}/raw/customers.py (auto-generated) 762 import pandera.pandas as pa 763 764 # Insert your additional checks to `checks` list parameter 765 schema = pa.DataFrameSchema( 766 { 767 "customer_id": pa.Column( 768 "int64", 769 checks=[ 770 pa.Check.greater_than(0), # Add: IDs must be positive 771 pa.Check(lambda s: s.is_unique, element_wise=False), # Add: unique 772 ], 773 ), 774 "email": pa.Column( 775 "object", 776 checks=[ 777 pa.Check(lambda s: s.str.contains("@")), # Add: valid email 778 ], 779 ), 780 "age": pa.Column( 781 "int64", 782 checks=[ 783 pa.Check.in_range(0, 120), # Add: realistic age range 784 ], 785 ), 786 } 787 ) 788 789 **Error Reporting: Lazy vs. Fail-Fast** 790 791 The ``parameters.lazy`` setting significantly affects error reporting: 792 793 **Lazy Mode (lazy=True, default)**: Recommended for production 794 - Collects all validation errors across entire dataset 795 - Provides comprehensive error report in single validation run 796 - Higher overhead but better developer experience 797 - Example: "Found 47 validation errors in columns 'age', 'email'" 798 799 **Fail-Fast Mode (lazy=False)**: Useful for debugging 800 - Raises exception on first validation failure 801 - Lower overhead, faster failure detection 802 - Requires multiple validation runs to find all errors 803 - Example: "Row 23: age value 150 exceeds maximum 120" 804 805 **Performance Considerations** 806 807 - Schema scripts are dynamically imported on each validation call. For 808 high-frequency validation scenarios (e.g., streaming data), consider 809 caching the validator instance and reusing it across validations. 810 - Auto-generation only occurs once per dataset. After initial schema 811 creation, there's no performance penalty for the auto-generation feature. 812 - Large dataset validation can be expensive. Consider sampling strategies 813 for very large datasets if full validation is not required. 814 815 **Thread Safety** 816 817 This method is thread-safe for validation operations on existing schemas 818 (multiple threads can call ``validate()`` concurrently on different datasets). 819 However, the initial schema auto-generation is not thread-safe. If multiple 820 threads validate the same dataset for the first time concurrently, race 821 conditions may occur. For concurrent first-time validation, implement 822 external locking or pre-generate schemas. 823 824 **Integration with Data Pipelines** 825 826 This method integrates seamlessly with data pipeline workflows: 827 828 .. code-block:: python 829 830 def pipeline_stage(validator, input_data): 831 # Validate input from previous stage 832 validated_input = validator.validate("stage_input", input_data) 833 834 # Process with confidence that data meets requirements 835 processed = transform(validated_input) 836 837 # Validate output before passing to next stage 838 validated_output = validator.validate("stage_output", processed) 839 840 return validated_output 841 842 Examples 843 -------- 844 Basic validation with auto-generated schema: 845 846 >>> import pandas as pd 847 >>> from adc_toolkit.data.validators.pandera import PanderaValidator 848 >>> validator = PanderaValidator.in_directory("config/validators") 849 >>> df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]}) 850 >>> validated = validator.validate("raw.customers", df) 851 >>> # First run: auto-generates schema at config/validators/pandera_schemas/raw/customers.py 852 >>> # Subsequent runs: uses existing schema 853 >>> print(validated) 854 id name age 855 0 1 Alice 25 856 1 2 Bob 30 857 2 3 Charlie 35 858 859 Validation failure with comprehensive error reporting (lazy=True): 860 861 >>> df_invalid = pd.DataFrame( 862 ... { 863 ... "id": [1, -2, 3], # Invalid: negative ID (if custom check added) 864 ... "name": ["Alice", "Bob", None], # Invalid: null name 865 ... "age": [25, 30, 150], # Invalid: unrealistic age (if custom check added) 866 ... } 867 ... ) 868 >>> from adc_toolkit.data.validators.pandera.exceptions import PanderaValidationError 869 >>> try: 870 ... validator.validate("raw.customers", df_invalid) 871 ... except PanderaValidationError as e: 872 ... print(f"Validation failed for table: {e.table_name}") 873 ... print(f"Schema location: {e.schema_path}") 874 ... print(f"Error details: {e.original_error}") 875 ... # All validation errors are reported together (lazy=True) 876 877 Fail-fast validation for debugging: 878 879 >>> from adc_toolkit.data.validators.pandera import PanderaParameters 880 >>> validator_debug = PanderaValidator.in_directory( 881 ... path="config/validators", parameters=PanderaParameters(lazy=False) 882 ... ) 883 >>> try: 884 ... validator_debug.validate("raw.customers", df_invalid) 885 ... except Exception as e: 886 ... print(f"First error encountered: {e.original_error}") 887 ... # Only the first validation error is reported (lazy=False) 888 889 Validate multiple datasets with same validator: 890 891 >>> validator = PanderaValidator.in_directory("config/validators") 892 >>> customers = validator.validate("raw.customers", customers_df) 893 >>> orders = validator.validate("raw.orders", orders_df) 894 >>> products = validator.validate("raw.products", products_df) 895 >>> # Reuse same validator instance for efficiency 896 897 Validation in a data processing pipeline: 898 899 >>> def process_customer_data(): 900 ... validator = PanderaValidator.in_directory("config/validators") 901 ... 902 ... # Load raw data 903 ... raw_df = load_raw_customers() 904 ... 905 ... # Validate input 906 ... validated_input = validator.validate("raw.customers", raw_df) 907 ... 908 ... # Process with confidence 909 ... processed_df = transform_customers(validated_input) 910 ... 911 ... # Validate output 912 ... validated_output = validator.validate("processed.customers", processed_df) 913 ... 914 ... return validated_output 915 916 Using with PySpark DataFrame: 917 918 >>> from pyspark.sql import SparkSession 919 >>> spark = SparkSession.builder.getOrCreate() 920 >>> spark_df = spark.createDataFrame([(1, "Alice", 25), (2, "Bob", 30)], ["id", "name", "age"]) 921 >>> validator = PanderaValidator.in_directory("config/validators") 922 >>> validated_spark = validator.validate("raw.spark_customers", spark_df) 923 >>> # Returns validated PySpark DataFrame 924 925 Integration with ValidatedDataCatalog (automatic validation): 926 927 >>> from adc_toolkit.data import ValidatedDataCatalog 928 >>> catalog = ValidatedDataCatalog.in_directory( 929 ... path="config", validator=PanderaValidator.in_directory("config/validators") 930 ... ) 931 >>> # ValidatedDataCatalog internally calls validator.validate() 932 >>> df = catalog.load("raw.customers") # Validates after loading 933 >>> catalog.save("processed.customers", df) # Validates before saving 934 """ 935 return validate_data(name, data, self.config_path, self.parameters)
Pandera-based data validator with automatic schema generation.
PanderaValidator is a concrete implementation of the DataValidator protocol that uses Pandera (https://pandera.readthedocs.io/) for schema-based data validation. It provides a seamless validation workflow that combines automatic schema generation with manual customization capabilities.
The validator orchestrates a two-phase approach to data validation:
Phase 1: Schema Management (Auto-generation)
On first validation of a dataset, the validator automatically generates
a Pandera schema script by introspecting the data structure (column names,
data types). The generated schema is saved as an editable Python file at
{config_path}/pandera_schemas/{category}/{dataset}.py, where the
category and dataset name are derived from the validation name (e.g.,
"raw.customers" creates raw/customers.py).
Phase 2: Validation Execution (Rule Enforcement)
On all validations (including first use), the validator loads the schema
script and executes validation against the data using Pandera's
DataFrameSchema.validate() method. If validation fails, it raises a
detailed PanderaValidationError with comprehensive error information.
This design enables an iterative workflow:
- Run validation immediately without manual schema creation
- Review auto-generated schemas and add custom validation rules
- Commit schemas to version control for team collaboration
- Evolve schemas as data requirements change over time
The validator integrates seamlessly with ValidatedDataCatalog to provide
automatic validation on all data load and save operations, ensuring data
quality throughout the entire data pipeline.
Attributes
- config_path (Path):
The directory path where Pandera schema scripts are stored and loaded from.
Schema files are organized in a hierarchical structure under this path,
specifically at
{config_path}/pandera_schemas/. For example, if config_path isPath("config/validators"), schemas are stored atconfig/validators/pandera_schemas/{category}/{dataset}.py. parameters (PanderaParameters): Configuration parameters controlling validation behavior. The primary parameter is
lazy, which determines error collection strategy:lazy=True(default): Collects all validation errors across the entire dataset before raising an exception, providing comprehensive error reporting.lazy=False: Raises an exception immediately upon encountering the first validation failure, useful for debugging.
If None is provided during instantiation, defaults to
PanderaParameters()with default settings (lazy=True).
Parameters
- config_path (str or Path):
Path to the root configuration directory where Pandera schema scripts are
stored. The validator will create a
pandera_schemassubdirectory under this path to organize schema files. Can be provided as either a string or pathlib.Path object. - parameters (PanderaParameters or None, optional):
Configuration parameters for validation behavior. If None (default), uses
PanderaParameters()with default settings (lazy=Truefor comprehensive error reporting).
Raises
- TypeError: If config_path cannot be converted to a Path object.
- OSError: If the config_path directory does not exist and cannot be created during schema generation.
See Also
PanderaParameters: Configuration parameters for validation behavior.
validate_data: Core validation function used internally by this validator.
adc_toolkit.data.abs.DataValidator: Protocol that this class implements.
adc_toolkit.data.validators.gx.GXValidator: Alternative validator using Great Expectations.
adc_toolkit.data.ValidatedDataCatalog: Data catalog with integrated validation.
Notes
Schema Script Organization
Schema scripts are organized hierarchically based on validation names. For a validation name like "raw.customers", the schema script is created at:
{config_path}/pandera_schemas/raw/customers.py
This structure mirrors typical data lake or data warehouse naming conventions (e.g., database.table) and supports large projects with many datasets.
Schema Customization Workflow
After first validation, edit the generated schema file to add custom checks:
# {config_path}/pandera_schemas/raw/customers.py
import pandera.pandas as pa
schema = pa.DataFrameSchema(
{
"customer_id": pa.Column(
"int64",
checks=[
pa.Check.greater_than(0), # Must be positive
pa.Check(lambda s: s.is_unique, element_wise=False), # Must be unique
],
),
"email": pa.Column(
"object",
checks=[
pa.Check(lambda s: s.str.contains("@"), element_wise=True),
],
),
"signup_date": pa.Column(
"datetime64[ns]",
checks=[
pa.Check.less_than_or_equal_to(pd.Timestamp.now()),
],
),
}
)
Supported Data Types
The validator supports:
- pandas DataFrames: Primary use case with full feature support
- PySpark DataFrames: Generates PySpark-compatible schemas (requires pyspark)
Thread Safety
This class is thread-safe for validation operations on existing schemas. However, the initial schema auto-generation is not thread-safe. If multiple threads validate the same dataset for the first time concurrently, race conditions may occur. For concurrent scenarios, pre-generate schemas or implement external locking.
Performance Considerations
- Schema scripts are dynamically imported on each validation call. For high-frequency scenarios, consider caching validator instances.
- The
lazy=Truemode has slightly more overhead as it collects all errors, but provides significantly better developer experience for fixing issues. - Auto-generation only occurs once per dataset, so performance impact is negligible after initial schema creation.
Comparison with Great Expectations
Use PanderaValidator when:
- You need lightweight, pandas-native validation
- You prefer Python-based schema definitions over YAML/JSON
- You want tight integration with type hints and static analysis
- Your team is comfortable with code-based configuration
Use GXValidator when:
- You need profiling and automatic expectation generation
- You want data documentation websites (Data Docs)
- You need enterprise features (cloud backends, data quality dashboards)
- You prefer declarative YAML/JSON configuration
Examples
Create a validator using the factory method:
>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> validator = PanderaValidator.in_directory("config/validators")
Create a validator using the constructor:
>>> from pathlib import Path
>>> validator = PanderaValidator(config_path=Path("config/validators"))
Create a validator with custom parameters for fail-fast mode:
>>> from adc_toolkit.data.validators.pandera import PanderaParameters
>>> params = PanderaParameters(lazy=False)
>>> validator = PanderaValidator(config_path="config/validators", parameters=params)
Basic validation workflow:
>>> import pandas as pd
>>> df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]})
>>> validator = PanderaValidator.in_directory("config/validators")
>>> validated_df = validator.validate("raw.customers", df)
>>> # First run: auto-generates schema at config/validators/pandera_schemas/raw/customers.py
>>> # Subsequent runs: uses existing schema
Handle validation failures with comprehensive error reporting:
>>> df_invalid = pd.DataFrame(
... {
... "id": [1, -2, 3], # Invalid: negative ID
... "name": ["Alice", "Bob", None], # Invalid: null name
... "age": [25, 30, 150], # Invalid: unrealistic age
... }
... )
>>> from adc_toolkit.data.validators.pandera.exceptions import PanderaValidationError
>>> try:
... validator.validate("raw.customers", df_invalid)
... except PanderaValidationError as e:
... print(f"Validation failed for: {e.table_name}")
... print(f"Schema file: {e.schema_path}")
... print(f"Errors: {e.original_error}")
... # All validation errors are included (lazy=True)
Integration with ValidatedDataCatalog:
>>> from adc_toolkit.data import ValidatedDataCatalog
>>> catalog = ValidatedDataCatalog.in_directory(
... path="config", validator=PanderaValidator.in_directory("config/validators")
... )
>>> df = catalog.load("raw.customers") # Validates after loading
>>> catalog.save("processed.customers", df) # Validates before saving
Iterative schema customization workflow:
>>> # Step 1: First validation auto-generates schema
>>> validator = PanderaValidator.in_directory("config/validators")
>>> validator.validate("raw.customers", df)
>>>
>>> # Step 2: Edit generated schema to add custom checks
>>> # File: config/validators/pandera_schemas/raw/customers.py
>>> # Add: pa.Check.greater_than(0) to "id" column
>>>
>>> # Step 3: Subsequent validations use customized schema
>>> validator.validate("raw.customers", df) # Now enforces custom rules
Validate PySpark DataFrame:
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> spark_df = spark.createDataFrame([(1, "Alice", 25), (2, "Bob", 30)], ["id", "name", "age"])
>>> validator = PanderaValidator.in_directory("config/validators")
>>> validated_spark = validator.validate("raw.spark_customers", spark_df)
>>> # Generates PySpark-compatible schema with pyspark.sql.types imports
Use in a data quality pipeline:
>>> def quality_check_pipeline(input_df):
... validator = PanderaValidator.in_directory("config/validators")
...
... # Validate raw input
... validated_input = validator.validate("raw.data", input_df)
...
... # Transform data
... transformed = transform(validated_input)
...
... # Validate transformed output
... validated_output = validator.validate("processed.data", transformed)
...
... return validated_output
Multiple validators for different environments:
>>> dev_validator = PanderaValidator.in_directory("config/validators/dev")
>>> prod_validator = PanderaValidator.in_directory("config/validators/prod")
>>> # Use different validation rules for dev vs. production
339 def __init__(self, config_path: str | Path, parameters: PanderaParameters | None = None) -> None: 340 """ 341 Initialize a PanderaValidator instance. 342 343 Constructs a new validator configured to use schema scripts from the 344 specified directory. The constructor sets up the schema storage location 345 and validation parameters, but does not perform any I/O operations or 346 validation at initialization time. 347 348 The validator creates a logical schema directory at 349 ``{config_path}/pandera_schemas/`` where all Pandera schema scripts will 350 be stored and loaded from. This subdirectory organization keeps Pandera 351 schemas separate from other configuration files and validation frameworks 352 (e.g., Great Expectations configurations). 353 354 Parameters 355 ---------- 356 config_path : str or Path 357 Path to the root configuration directory. The validator will use a 358 ``pandera_schemas`` subdirectory under this path for storing and 359 loading schema scripts. Can be provided as either a string (which will 360 be converted to a Path) or a pathlib.Path object. The path can be 361 absolute or relative to the current working directory. 362 363 For example, if config_path is ``"config/validators"``, schema scripts 364 will be stored at ``config/validators/pandera_schemas/{category}/{dataset}.py``. 365 parameters : PanderaParameters or None, optional 366 Configuration parameters controlling validation behavior. If None 367 (default), uses ``PanderaParameters()`` with default settings 368 (``lazy=True`` for comprehensive error reporting). Provide a custom 369 ``PanderaParameters`` instance to configure validation strategy: 370 371 - ``PanderaParameters(lazy=True)``: Collect all errors (recommended) 372 - ``PanderaParameters(lazy=False)``: Fail-fast on first error 373 374 Returns 375 ------- 376 None 377 Constructor does not return a value. 378 379 Raises 380 ------ 381 TypeError 382 If config_path cannot be converted to a Path object (e.g., if an 383 invalid type is provided like int or dict). 384 385 See Also 386 -------- 387 in_directory : Alternative factory method for creating validators. 388 validate : Perform validation on a dataset. 389 PanderaParameters : Configuration parameters for validation behavior. 390 391 Notes 392 ----- 393 **Lazy Initialization** 394 395 The constructor does not create the ``pandera_schemas`` directory at 396 initialization time. The directory is created only when the first schema 397 script is generated during validation. This lazy approach avoids 398 unnecessary file system operations if the validator is created but never 399 used. 400 401 **Path Handling** 402 403 The constructor automatically converts string paths to pathlib.Path objects 404 and appends the ``pandera_schemas`` subdirectory. This means: 405 406 .. code-block:: python 407 408 validator = PanderaValidator(config_path="config") 409 # validator.config_path is Path("config/pandera_schemas") 410 411 **Immutability** 412 413 While the validator instance itself is mutable (standard Python object), 414 the ``parameters`` attribute uses a frozen dataclass (PanderaParameters), 415 ensuring validation behavior remains consistent throughout the validator's 416 lifecycle. 417 418 **No Validation at Initialization** 419 420 This constructor only sets up the validator configuration. No validation 421 occurs until the ``validate()`` method is called. This design allows 422 validators to be created cheaply and reused across multiple validation 423 operations. 424 425 Examples 426 -------- 427 Create a validator with default parameters: 428 429 >>> from adc_toolkit.data.validators.pandera import PanderaValidator 430 >>> validator = PanderaValidator(config_path="config/validators") 431 >>> validator.config_path 432 PosixPath('config/validators/pandera_schemas') 433 >>> validator.parameters.lazy 434 True 435 436 Create a validator with custom parameters for fail-fast mode: 437 438 >>> from adc_toolkit.data.validators.pandera import PanderaParameters 439 >>> params = PanderaParameters(lazy=False) 440 >>> validator = PanderaValidator(config_path="config/validators", parameters=params) 441 >>> validator.parameters.lazy 442 False 443 444 Using pathlib.Path for config_path: 445 446 >>> from pathlib import Path 447 >>> config_dir = Path("config") / "validators" 448 >>> validator = PanderaValidator(config_path=config_dir) 449 >>> validator.config_path 450 PosixPath('config/validators/pandera_schemas') 451 452 Create multiple validators for different schema directories: 453 454 >>> dev_validator = PanderaValidator(config_path="config/dev/validators") 455 >>> prod_validator = PanderaValidator(config_path="config/prod/validators") 456 >>> # Each validator uses a separate schema directory 457 458 Reuse a validator for multiple validations: 459 460 >>> validator = PanderaValidator(config_path="config/validators") 461 >>> validated_df1 = validator.validate("dataset1", df1) 462 >>> validated_df2 = validator.validate("dataset2", df2) 463 >>> # Same validator instance, different datasets 464 """ 465 self.config_path = Path(config_path) / "pandera_schemas" 466 self.parameters = parameters or PanderaParameters()
Initialize a PanderaValidator instance.
Constructs a new validator configured to use schema scripts from the specified directory. The constructor sets up the schema storage location and validation parameters, but does not perform any I/O operations or validation at initialization time.
The validator creates a logical schema directory at
{config_path}/pandera_schemas/ where all Pandera schema scripts will
be stored and loaded from. This subdirectory organization keeps Pandera
schemas separate from other configuration files and validation frameworks
(e.g., Great Expectations configurations).
Parameters
config_path (str or Path): Path to the root configuration directory. The validator will use a
pandera_schemassubdirectory under this path for storing and loading schema scripts. Can be provided as either a string (which will be converted to a Path) or a pathlib.Path object. The path can be absolute or relative to the current working directory.For example, if config_path is
"config/validators", schema scripts will be stored atconfig/validators/pandera_schemas/{category}/{dataset}.py.parameters (PanderaParameters or None, optional): Configuration parameters controlling validation behavior. If None (default), uses
PanderaParameters()with default settings (lazy=Truefor comprehensive error reporting). Provide a customPanderaParametersinstance to configure validation strategy:PanderaParameters(lazy=True): Collect all errors (recommended)PanderaParameters(lazy=False): Fail-fast on first error
Returns
- None: Constructor does not return a value.
Raises
- TypeError: If config_path cannot be converted to a Path object (e.g., if an invalid type is provided like int or dict).
See Also
in_directory: Alternative factory method for creating validators.
validate: Perform validation on a dataset.
PanderaParameters: Configuration parameters for validation behavior.
Notes
Lazy Initialization
The constructor does not create the pandera_schemas directory at
initialization time. The directory is created only when the first schema
script is generated during validation. This lazy approach avoids
unnecessary file system operations if the validator is created but never
used.
Path Handling
The constructor automatically converts string paths to pathlib.Path objects
and appends the pandera_schemas subdirectory. This means:
validator = PanderaValidator(config_path="config")
# validator.config_path is Path("config/pandera_schemas")
Immutability
While the validator instance itself is mutable (standard Python object),
the parameters attribute uses a frozen dataclass (PanderaParameters),
ensuring validation behavior remains consistent throughout the validator's
lifecycle.
No Validation at Initialization
This constructor only sets up the validator configuration. No validation
occurs until the validate() method is called. This design allows
validators to be created cheaply and reused across multiple validation
operations.
Examples
Create a validator with default parameters:
>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> validator = PanderaValidator(config_path="config/validators")
>>> validator.config_path
PosixPath('config/validators/pandera_schemas')
>>> validator.parameters.lazy
True
Create a validator with custom parameters for fail-fast mode:
>>> from adc_toolkit.data.validators.pandera import PanderaParameters
>>> params = PanderaParameters(lazy=False)
>>> validator = PanderaValidator(config_path="config/validators", parameters=params)
>>> validator.parameters.lazy
False
Using pathlib.Path for config_path:
>>> from pathlib import Path
>>> config_dir = Path("config") / "validators"
>>> validator = PanderaValidator(config_path=config_dir)
>>> validator.config_path
PosixPath('config/validators/pandera_schemas')
Create multiple validators for different schema directories:
>>> dev_validator = PanderaValidator(config_path="config/dev/validators")
>>> prod_validator = PanderaValidator(config_path="config/prod/validators")
>>> # Each validator uses a separate schema directory
Reuse a validator for multiple validations:
>>> validator = PanderaValidator(config_path="config/validators")
>>> validated_df1 = validator.validate("dataset1", df1)
>>> validated_df2 = validator.validate("dataset2", df2)
>>> # Same validator instance, different datasets
468 @classmethod 469 def in_directory(cls, path: str | Path, parameters: PanderaParameters | None = None) -> "PanderaValidator": 470 """ 471 Create a PanderaValidator from a configuration directory (factory method). 472 473 This is the recommended factory method for creating PanderaValidator 474 instances. It provides a consistent interface with other toolkit components 475 (e.g., ``ValidatedDataCatalog.in_directory()``, ``KedroDataCatalog.in_directory()``) 476 and follows the factory pattern for object construction from configuration. 477 478 The method is functionally equivalent to calling the constructor directly, 479 but provides better semantic clarity in code that uses multiple toolkit 480 components with directory-based configuration. 481 482 Parameters 483 ---------- 484 path : str or Path 485 Path to the root configuration directory where Pandera schema scripts 486 are stored. The validator will use a ``pandera_schemas`` subdirectory 487 under this path. Can be provided as either a string or pathlib.Path 488 object. The path can be absolute or relative to the current working 489 directory. 490 491 For example, if path is ``"config/validators"``, schema scripts will 492 be stored at ``config/validators/pandera_schemas/{category}/{dataset}.py``. 493 parameters : PanderaParameters or None, optional 494 Configuration parameters controlling validation behavior. If None 495 (default), uses ``PanderaParameters()`` with default settings 496 (``lazy=True`` for comprehensive error reporting). Provide a custom 497 ``PanderaParameters`` instance to configure validation strategy: 498 499 - ``PanderaParameters(lazy=True)``: Collect all errors (recommended) 500 - ``PanderaParameters(lazy=False)``: Fail-fast on first error 501 502 Returns 503 ------- 504 PanderaValidator 505 A new validator instance configured to use schema scripts from the 506 specified directory. The returned validator is ready to use for 507 validation operations via the ``validate()`` method. 508 509 Raises 510 ------ 511 TypeError 512 If path cannot be converted to a Path object (e.g., if an invalid 513 type is provided like int or dict). 514 515 See Also 516 -------- 517 __init__ : Alternative constructor for creating validators. 518 validate : Perform validation on a dataset. 519 PanderaParameters : Configuration parameters for validation behavior. 520 adc_toolkit.data.ValidatedDataCatalog.in_directory : Similar factory method 521 for creating validated data catalogs. 522 523 Notes 524 ----- 525 **Factory Pattern** 526 527 This method implements the factory pattern, providing a standard interface 528 for creating validators from directory-based configuration. The pattern is 529 used consistently across the toolkit: 530 531 .. code-block:: python 532 533 # Similar patterns across toolkit components 534 catalog = KedroDataCatalog.in_directory("config/") 535 validator = PanderaValidator.in_directory("config/validators") 536 gx_validator = GXValidator.in_directory("config/gx") 537 538 **Semantic Clarity** 539 540 Using ``in_directory()`` instead of the constructor makes code more 541 readable and self-documenting: 542 543 .. code-block:: python 544 545 # Clear intent: validator configured from this directory 546 validator = PanderaValidator.in_directory("config/validators") 547 548 # vs. less clear constructor call 549 validator = PanderaValidator("config/validators") 550 551 **Design Rationale** 552 553 The factory method pattern is preferred over direct constructor calls in 554 the toolkit because: 555 556 1. Provides consistent API across all components 557 2. Makes the configuration-from-directory pattern explicit 558 3. Allows future extension with additional factory methods 559 4. Improves code readability and maintainability 560 561 **Usage in ValidatedDataCatalog** 562 563 This method is commonly used when configuring ``ValidatedDataCatalog`` 564 with a custom validator: 565 566 .. code-block:: python 567 568 from adc_toolkit.data import ValidatedDataCatalog 569 from adc_toolkit.data.validators.pandera import PanderaValidator 570 571 catalog = ValidatedDataCatalog.in_directory( 572 path="config", validator=PanderaValidator.in_directory("config/validators") 573 ) 574 575 Examples 576 -------- 577 Create a validator using the factory method (recommended): 578 579 >>> from adc_toolkit.data.validators.pandera import PanderaValidator 580 >>> validator = PanderaValidator.in_directory("config/validators") 581 >>> validator.config_path 582 PosixPath('config/validators/pandera_schemas') 583 584 Create a validator with custom parameters: 585 586 >>> from adc_toolkit.data.validators.pandera import PanderaParameters 587 >>> params = PanderaParameters(lazy=False) 588 >>> validator = PanderaValidator.in_directory(path="config/validators", parameters=params) 589 >>> validator.parameters.lazy 590 False 591 592 Using pathlib.Path: 593 594 >>> from pathlib import Path 595 >>> config_dir = Path("config") / "validators" 596 >>> validator = PanderaValidator.in_directory(path=config_dir) 597 598 Integration with ValidatedDataCatalog: 599 600 >>> from adc_toolkit.data import ValidatedDataCatalog 601 >>> catalog = ValidatedDataCatalog.in_directory( 602 ... path="config", validator=PanderaValidator.in_directory("config/validators") 603 ... ) 604 >>> # ValidatedDataCatalog uses the PanderaValidator for all load/save ops 605 606 Consistent API across different validators: 607 608 >>> from adc_toolkit.data.validators.pandera import PanderaValidator 609 >>> from adc_toolkit.data.validators.gx import GXValidator 610 >>> pandera_val = PanderaValidator.in_directory("config/pandera") 611 >>> gx_val = GXValidator.in_directory("config/gx") 612 >>> # Both use the same factory method pattern 613 614 Multiple validators for different environments: 615 616 >>> dev_validator = PanderaValidator.in_directory("config/dev/validators") 617 >>> staging_validator = PanderaValidator.in_directory("config/staging/validators") 618 >>> prod_validator = PanderaValidator.in_directory("config/prod/validators") 619 >>> # Each environment can have different validation rules 620 621 Dependency injection pattern: 622 623 >>> def create_pipeline(validator_path: str): 624 ... validator = PanderaValidator.in_directory(validator_path) 625 ... return DataPipeline(validator=validator) 626 >>> # Easy to swap validator configurations 627 >>> dev_pipeline = create_pipeline("config/dev/validators") 628 >>> prod_pipeline = create_pipeline("config/prod/validators") 629 """ 630 return cls(path, parameters=parameters)
Create a PanderaValidator from a configuration directory (factory method).
This is the recommended factory method for creating PanderaValidator
instances. It provides a consistent interface with other toolkit components
(e.g., ValidatedDataCatalog.in_directory(), KedroDataCatalog.in_directory())
and follows the factory pattern for object construction from configuration.
The method is functionally equivalent to calling the constructor directly, but provides better semantic clarity in code that uses multiple toolkit components with directory-based configuration.
Parameters
path (str or Path): Path to the root configuration directory where Pandera schema scripts are stored. The validator will use a
pandera_schemassubdirectory under this path. Can be provided as either a string or pathlib.Path object. The path can be absolute or relative to the current working directory.For example, if path is
"config/validators", schema scripts will be stored atconfig/validators/pandera_schemas/{category}/{dataset}.py.parameters (PanderaParameters or None, optional): Configuration parameters controlling validation behavior. If None (default), uses
PanderaParameters()with default settings (lazy=Truefor comprehensive error reporting). Provide a customPanderaParametersinstance to configure validation strategy:PanderaParameters(lazy=True): Collect all errors (recommended)PanderaParameters(lazy=False): Fail-fast on first error
Returns
- PanderaValidator: A new validator instance configured to use schema scripts from the
specified directory. The returned validator is ready to use for
validation operations via the
validate()method.
Raises
- TypeError: If path cannot be converted to a Path object (e.g., if an invalid type is provided like int or dict).
See Also
__init__: Alternative constructor for creating validators.
validate: Perform validation on a dataset.
PanderaParameters: Configuration parameters for validation behavior.
adc_toolkit.data.ValidatedDataCatalog.in_directory: Similar factory method
for creating validated data catalogs.
Notes
Factory Pattern
This method implements the factory pattern, providing a standard interface for creating validators from directory-based configuration. The pattern is used consistently across the toolkit:
# Similar patterns across toolkit components
catalog = KedroDataCatalog.in_directory("config/")
validator = PanderaValidator.in_directory("config/validators")
gx_validator = GXValidator.in_directory("config/gx")
Semantic Clarity
Using in_directory() instead of the constructor makes code more
readable and self-documenting:
# Clear intent: validator configured from this directory
validator = PanderaValidator.in_directory("config/validators")
# vs. less clear constructor call
validator = PanderaValidator("config/validators")
Design Rationale
The factory method pattern is preferred over direct constructor calls in the toolkit because:
- Provides consistent API across all components
- Makes the configuration-from-directory pattern explicit
- Allows future extension with additional factory methods
- Improves code readability and maintainability
Usage in ValidatedDataCatalog
This method is commonly used when configuring ValidatedDataCatalog
with a custom validator:
from adc_toolkit.data import ValidatedDataCatalog
from adc_toolkit.data.validators.pandera import PanderaValidator
catalog = ValidatedDataCatalog.in_directory(
path="config", validator=PanderaValidator.in_directory("config/validators")
)
Examples
Create a validator using the factory method (recommended):
>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> validator = PanderaValidator.in_directory("config/validators")
>>> validator.config_path
PosixPath('config/validators/pandera_schemas')
Create a validator with custom parameters:
>>> from adc_toolkit.data.validators.pandera import PanderaParameters
>>> params = PanderaParameters(lazy=False)
>>> validator = PanderaValidator.in_directory(path="config/validators", parameters=params)
>>> validator.parameters.lazy
False
Using pathlib.Path:
>>> from pathlib import Path
>>> config_dir = Path("config") / "validators"
>>> validator = PanderaValidator.in_directory(path=config_dir)
Integration with ValidatedDataCatalog:
>>> from adc_toolkit.data import ValidatedDataCatalog
>>> catalog = ValidatedDataCatalog.in_directory(
... path="config", validator=PanderaValidator.in_directory("config/validators")
... )
>>> # ValidatedDataCatalog uses the PanderaValidator for all load/save ops
Consistent API across different validators:
>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> from adc_toolkit.data.validators.gx import GXValidator
>>> pandera_val = PanderaValidator.in_directory("config/pandera")
>>> gx_val = GXValidator.in_directory("config/gx")
>>> # Both use the same factory method pattern
Multiple validators for different environments:
>>> dev_validator = PanderaValidator.in_directory("config/dev/validators")
>>> staging_validator = PanderaValidator.in_directory("config/staging/validators")
>>> prod_validator = PanderaValidator.in_directory("config/prod/validators")
>>> # Each environment can have different validation rules
Dependency injection pattern:
>>> def create_pipeline(validator_path: str):
... validator = PanderaValidator.in_directory(validator_path)
... return DataPipeline(validator=validator)
>>> # Easy to swap validator configurations
>>> dev_pipeline = create_pipeline("config/dev/validators")
>>> prod_pipeline = create_pipeline("config/prod/validators")
632 def validate(self, name: str, data: Data) -> Data: 633 """ 634 Validate a dataset against its Pandera schema. 635 636 This is the primary validation method that implements the DataValidator 637 protocol. It orchestrates the complete validation workflow, from automatic 638 schema generation (if needed) to validation execution, providing seamless 639 data quality checking with minimal setup. 640 641 The method delegates to the ``validate_data`` function, which implements 642 a two-phase validation approach: 643 644 **Phase 1: Schema Preparation** (First Use Only) 645 If no schema script exists for the dataset name, automatically generate 646 one by introspecting the data structure. The generated schema is saved 647 at ``{self.config_path}/{category}/{dataset}.py`` and serves as an 648 editable template for adding custom validation rules. 649 650 **Phase 2: Validation Execution** (Every Use) 651 Load the schema script as a Python module, extract the 652 ``DataFrameSchema`` object, and execute validation using Pandera's 653 ``schema.validate(data, lazy=self.parameters.lazy)`` method. Return 654 validated data if all checks pass, or raise ``PanderaValidationError`` 655 with comprehensive error details if validation fails. 656 657 This design enables rapid prototyping (no upfront schema creation required) 658 while supporting iterative refinement (schemas can be customized after 659 auto-generation). Schema scripts are version-controlled Python files, 660 facilitating team collaboration and schema evolution tracking. 661 662 Parameters 663 ---------- 664 name : str 665 The dataset name/identifier that determines which schema script to use. 666 Should follow the convention ``"category.dataset_name"`` (e.g., 667 ``"raw.customers"``, ``"processed.sales"``). The name serves multiple 668 purposes: 669 670 - Determines schema file location: ``{config_path}/{category}/{dataset}.py`` 671 - Provides context in validation error messages 672 - Enables logical organization of schemas by data pipeline stage 673 674 Names with a single dot separator create a two-level directory 675 structure. For example, ``"raw.customers"`` creates a schema at 676 ``{self.config_path}/raw/customers.py``. 677 data : Data 678 The data object to validate. Must be a protocol-compliant Data object 679 (pandas DataFrame, PySpark DataFrame, etc.) with ``columns`` and 680 ``dtypes`` attributes. The data structure is validated against the 681 schema defined in (or auto-generated for) the corresponding schema 682 script. 683 684 Supported types: 685 - ``pandas.DataFrame``: Primary use case with full feature support 686 - ``pyspark.sql.DataFrame``: Requires pyspark installation 687 688 Returns 689 ------- 690 Data 691 The validated data object. If validation passes all checks defined in 692 the schema, returns the original data object (potentially with 693 Pandera-applied type coercions if configured in the schema). The return 694 type matches the input data type (pandas in, pandas out; PySpark in, 695 PySpark out). 696 697 The returned data can be used immediately in downstream processing with 698 confidence that it meets all defined quality requirements. 699 700 Raises 701 ------ 702 PanderaValidationError 703 Raised when data validation fails against the schema. This custom 704 exception wraps Pandera's underlying SchemaError or SchemaErrors and 705 enriches it with additional context: 706 707 Attributes of PanderaValidationError: 708 - ``table_name``: The dataset name that failed validation 709 - ``schema_path``: Full filesystem path to the schema script file 710 - ``original_error``: The underlying Pandera error with detailed 711 validation failure information (row indices, column names, observed 712 values, expected constraints) 713 714 With ``lazy=True`` (default), the exception includes all validation 715 errors across the entire dataset. With ``lazy=False``, it includes only 716 the first error encountered. 717 ValueError 718 Raised if the dataframe type is not supported (neither pandas nor 719 pyspark), originating from the schema compiler during auto-generation. 720 ModuleNotFoundError 721 Raised if the schema script cannot be imported, typically indicating 722 a Python syntax error in a manually edited schema file. Check the 723 schema file for syntax errors or import statement issues. 724 AttributeError 725 Raised if the schema script module does not define a ``schema`` 726 attribute, indicating the schema file structure is invalid. The schema 727 file must contain ``schema = pa.DataFrameSchema(...)`` at module level. 728 OSError 729 Raised if there are file system permissions issues preventing schema 730 file creation (during auto-generation) or reading (during validation). 731 732 See Also 733 -------- 734 validate_data : The underlying validation function called by this method. 735 PanderaParameters : Configuration parameters controlling validation behavior. 736 PanderaValidationError : Custom exception raised on validation failure. 737 in_directory : Factory method for creating validator instances. 738 adc_toolkit.data.ValidatedDataCatalog : Data catalog with integrated validation. 739 740 Notes 741 ----- 742 **Validation Workflow** 743 744 The complete sequence executed by this method: 745 746 1. Check if schema file exists at ``{self.config_path}/{category}/{dataset}.py`` 747 2. If missing, auto-generate schema by introspecting data structure 748 3. Load schema file as a Python module using dynamic import 749 4. Extract the ``schema`` DataFrameSchema object from the module 750 5. Call ``schema.validate(data, lazy=self.parameters.lazy)`` 751 6. Return validated data if all checks pass 752 7. Raise PanderaValidationError with context if validation fails 753 754 **Schema Customization After Auto-Generation** 755 756 After first validation, edit the generated schema file to add domain-specific 757 validation rules: 758 759 .. code-block:: python 760 761 # {self.config_path}/raw/customers.py (auto-generated) 762 import pandera.pandas as pa 763 764 # Insert your additional checks to `checks` list parameter 765 schema = pa.DataFrameSchema( 766 { 767 "customer_id": pa.Column( 768 "int64", 769 checks=[ 770 pa.Check.greater_than(0), # Add: IDs must be positive 771 pa.Check(lambda s: s.is_unique, element_wise=False), # Add: unique 772 ], 773 ), 774 "email": pa.Column( 775 "object", 776 checks=[ 777 pa.Check(lambda s: s.str.contains("@")), # Add: valid email 778 ], 779 ), 780 "age": pa.Column( 781 "int64", 782 checks=[ 783 pa.Check.in_range(0, 120), # Add: realistic age range 784 ], 785 ), 786 } 787 ) 788 789 **Error Reporting: Lazy vs. Fail-Fast** 790 791 The ``parameters.lazy`` setting significantly affects error reporting: 792 793 **Lazy Mode (lazy=True, default)**: Recommended for production 794 - Collects all validation errors across entire dataset 795 - Provides comprehensive error report in single validation run 796 - Higher overhead but better developer experience 797 - Example: "Found 47 validation errors in columns 'age', 'email'" 798 799 **Fail-Fast Mode (lazy=False)**: Useful for debugging 800 - Raises exception on first validation failure 801 - Lower overhead, faster failure detection 802 - Requires multiple validation runs to find all errors 803 - Example: "Row 23: age value 150 exceeds maximum 120" 804 805 **Performance Considerations** 806 807 - Schema scripts are dynamically imported on each validation call. For 808 high-frequency validation scenarios (e.g., streaming data), consider 809 caching the validator instance and reusing it across validations. 810 - Auto-generation only occurs once per dataset. After initial schema 811 creation, there's no performance penalty for the auto-generation feature. 812 - Large dataset validation can be expensive. Consider sampling strategies 813 for very large datasets if full validation is not required. 814 815 **Thread Safety** 816 817 This method is thread-safe for validation operations on existing schemas 818 (multiple threads can call ``validate()`` concurrently on different datasets). 819 However, the initial schema auto-generation is not thread-safe. If multiple 820 threads validate the same dataset for the first time concurrently, race 821 conditions may occur. For concurrent first-time validation, implement 822 external locking or pre-generate schemas. 823 824 **Integration with Data Pipelines** 825 826 This method integrates seamlessly with data pipeline workflows: 827 828 .. code-block:: python 829 830 def pipeline_stage(validator, input_data): 831 # Validate input from previous stage 832 validated_input = validator.validate("stage_input", input_data) 833 834 # Process with confidence that data meets requirements 835 processed = transform(validated_input) 836 837 # Validate output before passing to next stage 838 validated_output = validator.validate("stage_output", processed) 839 840 return validated_output 841 842 Examples 843 -------- 844 Basic validation with auto-generated schema: 845 846 >>> import pandas as pd 847 >>> from adc_toolkit.data.validators.pandera import PanderaValidator 848 >>> validator = PanderaValidator.in_directory("config/validators") 849 >>> df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]}) 850 >>> validated = validator.validate("raw.customers", df) 851 >>> # First run: auto-generates schema at config/validators/pandera_schemas/raw/customers.py 852 >>> # Subsequent runs: uses existing schema 853 >>> print(validated) 854 id name age 855 0 1 Alice 25 856 1 2 Bob 30 857 2 3 Charlie 35 858 859 Validation failure with comprehensive error reporting (lazy=True): 860 861 >>> df_invalid = pd.DataFrame( 862 ... { 863 ... "id": [1, -2, 3], # Invalid: negative ID (if custom check added) 864 ... "name": ["Alice", "Bob", None], # Invalid: null name 865 ... "age": [25, 30, 150], # Invalid: unrealistic age (if custom check added) 866 ... } 867 ... ) 868 >>> from adc_toolkit.data.validators.pandera.exceptions import PanderaValidationError 869 >>> try: 870 ... validator.validate("raw.customers", df_invalid) 871 ... except PanderaValidationError as e: 872 ... print(f"Validation failed for table: {e.table_name}") 873 ... print(f"Schema location: {e.schema_path}") 874 ... print(f"Error details: {e.original_error}") 875 ... # All validation errors are reported together (lazy=True) 876 877 Fail-fast validation for debugging: 878 879 >>> from adc_toolkit.data.validators.pandera import PanderaParameters 880 >>> validator_debug = PanderaValidator.in_directory( 881 ... path="config/validators", parameters=PanderaParameters(lazy=False) 882 ... ) 883 >>> try: 884 ... validator_debug.validate("raw.customers", df_invalid) 885 ... except Exception as e: 886 ... print(f"First error encountered: {e.original_error}") 887 ... # Only the first validation error is reported (lazy=False) 888 889 Validate multiple datasets with same validator: 890 891 >>> validator = PanderaValidator.in_directory("config/validators") 892 >>> customers = validator.validate("raw.customers", customers_df) 893 >>> orders = validator.validate("raw.orders", orders_df) 894 >>> products = validator.validate("raw.products", products_df) 895 >>> # Reuse same validator instance for efficiency 896 897 Validation in a data processing pipeline: 898 899 >>> def process_customer_data(): 900 ... validator = PanderaValidator.in_directory("config/validators") 901 ... 902 ... # Load raw data 903 ... raw_df = load_raw_customers() 904 ... 905 ... # Validate input 906 ... validated_input = validator.validate("raw.customers", raw_df) 907 ... 908 ... # Process with confidence 909 ... processed_df = transform_customers(validated_input) 910 ... 911 ... # Validate output 912 ... validated_output = validator.validate("processed.customers", processed_df) 913 ... 914 ... return validated_output 915 916 Using with PySpark DataFrame: 917 918 >>> from pyspark.sql import SparkSession 919 >>> spark = SparkSession.builder.getOrCreate() 920 >>> spark_df = spark.createDataFrame([(1, "Alice", 25), (2, "Bob", 30)], ["id", "name", "age"]) 921 >>> validator = PanderaValidator.in_directory("config/validators") 922 >>> validated_spark = validator.validate("raw.spark_customers", spark_df) 923 >>> # Returns validated PySpark DataFrame 924 925 Integration with ValidatedDataCatalog (automatic validation): 926 927 >>> from adc_toolkit.data import ValidatedDataCatalog 928 >>> catalog = ValidatedDataCatalog.in_directory( 929 ... path="config", validator=PanderaValidator.in_directory("config/validators") 930 ... ) 931 >>> # ValidatedDataCatalog internally calls validator.validate() 932 >>> df = catalog.load("raw.customers") # Validates after loading 933 >>> catalog.save("processed.customers", df) # Validates before saving 934 """ 935 return validate_data(name, data, self.config_path, self.parameters)
Validate a dataset against its Pandera schema.
This is the primary validation method that implements the DataValidator protocol. It orchestrates the complete validation workflow, from automatic schema generation (if needed) to validation execution, providing seamless data quality checking with minimal setup.
The method delegates to the validate_data function, which implements
a two-phase validation approach:
Phase 1: Schema Preparation (First Use Only)
If no schema script exists for the dataset name, automatically generate
one by introspecting the data structure. The generated schema is saved
at {self.config_path}/{category}/{dataset}.py and serves as an
editable template for adding custom validation rules.
Phase 2: Validation Execution (Every Use)
Load the schema script as a Python module, extract the
DataFrameSchema object, and execute validation using Pandera's
schema.validate(data, lazy=self.parameters.lazy) method. Return
validated data if all checks pass, or raise PanderaValidationError
with comprehensive error details if validation fails.
This design enables rapid prototyping (no upfront schema creation required) while supporting iterative refinement (schemas can be customized after auto-generation). Schema scripts are version-controlled Python files, facilitating team collaboration and schema evolution tracking.
Parameters
name (str): The dataset name/identifier that determines which schema script to use. Should follow the convention
"category.dataset_name"(e.g.,"raw.customers","processed.sales"). The name serves multiple purposes:- Determines schema file location:
{config_path}/{category}/{dataset}.py - Provides context in validation error messages
- Enables logical organization of schemas by data pipeline stage
Names with a single dot separator create a two-level directory structure. For example,
"raw.customers"creates a schema at{self.config_path}/raw/customers.py.- Determines schema file location:
data (Data): The data object to validate. Must be a protocol-compliant Data object (pandas DataFrame, PySpark DataFrame, etc.) with
columnsanddtypesattributes. The data structure is validated against the schema defined in (or auto-generated for) the corresponding schema script.Supported types:
pandas.DataFrame: Primary use case with full feature supportpyspark.sql.DataFrame: Requires pyspark installation
Returns
- Data: The validated data object. If validation passes all checks defined in the schema, returns the original data object (potentially with Pandera-applied type coercions if configured in the schema). The return type matches the input data type (pandas in, pandas out; PySpark in, PySpark out).
The returned data can be used immediately in downstream processing with confidence that it meets all defined quality requirements.
Raises
- PanderaValidationError: Raised when data validation fails against the schema. This custom exception wraps Pandera's underlying SchemaError or SchemaErrors and enriches it with additional context:
Attributes of PanderaValidationError:
table_name: The dataset name that failed validationschema_path: Full filesystem path to the schema script fileoriginal_error: The underlying Pandera error with detailed validation failure information (row indices, column names, observed values, expected constraints)
With lazy=True (default), the exception includes all validation
errors across the entire dataset. With lazy=False, it includes only
the first error encountered.
- ValueError: Raised if the dataframe type is not supported (neither pandas nor pyspark), originating from the schema compiler during auto-generation.
- ModuleNotFoundError: Raised if the schema script cannot be imported, typically indicating a Python syntax error in a manually edited schema file. Check the schema file for syntax errors or import statement issues.
- AttributeError: Raised if the schema script module does not define a
schemaattribute, indicating the schema file structure is invalid. The schema file must containschema = pa.DataFrameSchema(...)at module level. - OSError: Raised if there are file system permissions issues preventing schema file creation (during auto-generation) or reading (during validation).
See Also
validate_data: The underlying validation function called by this method.
PanderaParameters: Configuration parameters controlling validation behavior.
PanderaValidationError: Custom exception raised on validation failure.
in_directory: Factory method for creating validator instances.
adc_toolkit.data.ValidatedDataCatalog: Data catalog with integrated validation.
Notes
Validation Workflow
The complete sequence executed by this method:
- Check if schema file exists at
{self.config_path}/{category}/{dataset}.py - If missing, auto-generate schema by introspecting data structure
- Load schema file as a Python module using dynamic import
- Extract the
schemaDataFrameSchema object from the module - Call
schema.validate(data, lazy=self.parameters.lazy) - Return validated data if all checks pass
- Raise PanderaValidationError with context if validation fails
Schema Customization After Auto-Generation
After first validation, edit the generated schema file to add domain-specific validation rules:
# {self.config_path}/raw/customers.py (auto-generated)
import pandera.pandas as pa
# Insert your additional checks to `checks` list parameter
schema = pa.DataFrameSchema(
{
"customer_id": pa.Column(
"int64",
checks=[
pa.Check.greater_than(0), # Add: IDs must be positive
pa.Check(lambda s: s.is_unique, element_wise=False), # Add: unique
],
),
"email": pa.Column(
"object",
checks=[
pa.Check(lambda s: s.str.contains("@")), # Add: valid email
],
),
"age": pa.Column(
"int64",
checks=[
pa.Check.in_range(0, 120), # Add: realistic age range
],
),
}
)
Error Reporting: Lazy vs. Fail-Fast
The parameters.lazy setting significantly affects error reporting:
Lazy Mode (lazy=True, default): Recommended for production - Collects all validation errors across entire dataset - Provides comprehensive error report in single validation run - Higher overhead but better developer experience - Example: "Found 47 validation errors in columns 'age', 'email'"
Fail-Fast Mode (lazy=False): Useful for debugging - Raises exception on first validation failure - Lower overhead, faster failure detection - Requires multiple validation runs to find all errors - Example: "Row 23: age value 150 exceeds maximum 120"
Performance Considerations
- Schema scripts are dynamically imported on each validation call. For high-frequency validation scenarios (e.g., streaming data), consider caching the validator instance and reusing it across validations.
- Auto-generation only occurs once per dataset. After initial schema creation, there's no performance penalty for the auto-generation feature.
- Large dataset validation can be expensive. Consider sampling strategies for very large datasets if full validation is not required.
Thread Safety
This method is thread-safe for validation operations on existing schemas
(multiple threads can call validate() concurrently on different datasets).
However, the initial schema auto-generation is not thread-safe. If multiple
threads validate the same dataset for the first time concurrently, race
conditions may occur. For concurrent first-time validation, implement
external locking or pre-generate schemas.
Integration with Data Pipelines
This method integrates seamlessly with data pipeline workflows:
def pipeline_stage(validator, input_data):
# Validate input from previous stage
validated_input = validator.validate("stage_input", input_data)
# Process with confidence that data meets requirements
processed = transform(validated_input)
# Validate output before passing to next stage
validated_output = validator.validate("stage_output", processed)
return validated_output
Examples
Basic validation with auto-generated schema:
>>> import pandas as pd
>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> validator = PanderaValidator.in_directory("config/validators")
>>> df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]})
>>> validated = validator.validate("raw.customers", df)
>>> # First run: auto-generates schema at config/validators/pandera_schemas/raw/customers.py
>>> # Subsequent runs: uses existing schema
>>> print(validated)
id name age
0 1 Alice 25
1 2 Bob 30
2 3 Charlie 35
Validation failure with comprehensive error reporting (lazy=True):
>>> df_invalid = pd.DataFrame(
... {
... "id": [1, -2, 3], # Invalid: negative ID (if custom check added)
... "name": ["Alice", "Bob", None], # Invalid: null name
... "age": [25, 30, 150], # Invalid: unrealistic age (if custom check added)
... }
... )
>>> from adc_toolkit.data.validators.pandera.exceptions import PanderaValidationError
>>> try:
... validator.validate("raw.customers", df_invalid)
... except PanderaValidationError as e:
... print(f"Validation failed for table: {e.table_name}")
... print(f"Schema location: {e.schema_path}")
... print(f"Error details: {e.original_error}")
... # All validation errors are reported together (lazy=True)
Fail-fast validation for debugging:
>>> from adc_toolkit.data.validators.pandera import PanderaParameters
>>> validator_debug = PanderaValidator.in_directory(
... path="config/validators", parameters=PanderaParameters(lazy=False)
... )
>>> try:
... validator_debug.validate("raw.customers", df_invalid)
... except Exception as e:
... print(f"First error encountered: {e.original_error}")
... # Only the first validation error is reported (lazy=False)
Validate multiple datasets with same validator:
>>> validator = PanderaValidator.in_directory("config/validators")
>>> customers = validator.validate("raw.customers", customers_df)
>>> orders = validator.validate("raw.orders", orders_df)
>>> products = validator.validate("raw.products", products_df)
>>> # Reuse same validator instance for efficiency
Validation in a data processing pipeline:
>>> def process_customer_data():
... validator = PanderaValidator.in_directory("config/validators")
...
... # Load raw data
... raw_df = load_raw_customers()
...
... # Validate input
... validated_input = validator.validate("raw.customers", raw_df)
...
... # Process with confidence
... processed_df = transform_customers(validated_input)
...
... # Validate output
... validated_output = validator.validate("processed.customers", processed_df)
...
... return validated_output
Using with PySpark DataFrame:
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> spark_df = spark.createDataFrame([(1, "Alice", 25), (2, "Bob", 30)], ["id", "name", "age"])
>>> validator = PanderaValidator.in_directory("config/validators")
>>> validated_spark = validator.validate("raw.spark_customers", spark_df)
>>> # Returns validated PySpark DataFrame
Integration with ValidatedDataCatalog (automatic validation):
>>> from adc_toolkit.data import ValidatedDataCatalog
>>> catalog = ValidatedDataCatalog.in_directory(
... path="config", validator=PanderaValidator.in_directory("config/validators")
... )
>>> # ValidatedDataCatalog internally calls validator.validate()
>>> df = catalog.load("raw.customers") # Validates after loading
>>> catalog.save("processed.customers", df) # Validates before saving