adc_toolkit.data.validators.pandera

Pandera-based data validation framework for the adc-toolkit.

This module provides a comprehensive Pandera-based validation implementation that combines automatic schema generation with manual customization capabilities. It enables rapid prototyping with zero-setup validation while supporting iterative refinement of validation rules as data requirements evolve.

The validation workflow is optimized for real-world data engineering scenarios where schemas need to be established quickly but refined over time. Schema scripts are stored as editable Python files, enabling version control, code review, and team collaboration on data quality standards.

Classes

PanderaValidator Main validator class implementing the DataValidator protocol. Orchestrates automatic schema generation, schema loading, and validation execution using Pandera's DataFrameSchema.validate() method. Integrates seamlessly with ValidatedDataCatalog for automatic validation on load/save operations. PanderaParameters Immutable configuration dataclass controlling validation behavior. Primary setting is the 'lazy' parameter which determines error collection strategy (collect all errors vs fail-fast on first error).

Exceptions

PanderaValidationError Custom exception raised when validation fails. Wraps Pandera's SchemaError or SchemaErrors with additional context including table name and schema file path for enhanced debugging and error handling.

Notes

Key Features

Zero-setup validation: Auto-generates schemas on first use by introspecting data
Incremental refinement: Generated schemas serve as customizable templates
Version control friendly: Schemas are plain Python files suitable for git
Comprehensive error reporting: Lazy validation collects all errors in one pass
Type safety: Full integration with Python type hints and static analysis
Framework support: Works with pandas and PySpark DataFrames

Validation Workflow

The validation process follows a two-phase approach:

Schema Management (Auto-generation on first use)
- Check if schema file exists at {config_path}/pandera_schemas/{category}/{dataset}.py
- If missing, introspect data structure and generate schema script
- Save generated schema as editable Python file
Validation Execution (Every use)
- Load schema script as Python module
- Extract DataFrameSchema object from module
- Execute validation: schema.validate(data, lazy=parameters.lazy)
- Return validated data or raise PanderaValidationError

Schema Organization

Schemas are organized hierarchically based on validation names following the "category.dataset" convention:

config/validators/pandera_schemas/
├── raw/
│   ├── customers.py
│   └── orders.py
├── processed/
│   ├── customers.py
│   └── sales_summary.py
└── gold/
    └── analytics.py

Error Collection Strategies

The lazy parameter controls error reporting:

lazy=True (default): Collects all validation errors across the entire dataset before raising exception. Provides comprehensive error reporting, showing all violations in a single validation run. Recommended for production.
lazy=False: Raises exception immediately on first validation failure. Useful for debugging when you want to fix errors incrementally.

Comparison with Great Expectations

Use PanderaValidator when:

You need lightweight, pandas-native validation
You prefer Python-based schema definitions over YAML/JSON
You want tight integration with type hints and static analysis
Your team is comfortable with code-based configuration

Use GXValidator when:

You need profiling and automatic expectation generation
You want data documentation websites (Data Docs)
You need enterprise features (cloud backends, data quality dashboards)
You prefer declarative YAML/JSON configuration

Examples

Basic validator setup and usage:

>>> from pathlib import Path
>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> import pandas as pd
>>>
>>> # Create validator
>>> validator = PanderaValidator.in_directory("config/validators")
>>>
>>> # Create sample data
>>> df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]})
>>>
>>> # First validation: auto-generates schema
>>> validated = validator.validate("raw.customers", df)
>>> # Schema created at: config/validators/pandera_schemas/raw/customers.py

Customizing auto-generated schemas:

>>> # After first validation, edit the generated schema file:
>>> # File: config/validators/pandera_schemas/raw/customers.py
>>> #
>>> # import pandera.pandas as pa
>>> #
>>> # schema = pa.DataFrameSchema({
>>> #     "id": pa.Column(
>>> #         "int64",
>>> #         checks=[
>>> #             pa.Check.greater_than(0),  # Add: IDs must be positive
>>> #             pa.Check(lambda s: s.is_unique, element_wise=False),  # Add: unique
>>> #         ],
>>> #     ),
>>> #     "name": pa.Column("object"),
>>> #     "age": pa.Column(
>>> #         "int64",
>>> #         checks=[pa.Check.in_range(0, 120)],  # Add: realistic age range
>>> #     ),
>>> # })
>>>
>>> # Subsequent validations use customized schema
>>> validated = validator.validate("raw.customers", df)

Using custom parameters for fail-fast validation:

>>> from adc_toolkit.data.validators.pandera import PanderaParameters
>>>
>>> # Create validator with fail-fast mode
>>> params = PanderaParameters(lazy=False)
>>> validator_debug = PanderaValidator(config_path="config/validators", parameters=params)
>>>
>>> try:
...     validator_debug.validate("raw.customers", invalid_df)
... except Exception as e:
...     print(f"First error: {e.original_error}")

Integration with ValidatedDataCatalog:

>>> from adc_toolkit.data import ValidatedDataCatalog
>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>>
>>> # Create catalog with Pandera validator
>>> catalog = ValidatedDataCatalog.in_directory(
...     path="config", validator=PanderaValidator.in_directory("config/validators")
... )
>>>
>>> # Load data (automatically validated)
>>> df = catalog.load("raw.customers")
>>>
>>> # Process data
>>> processed_df = transform(df)
>>>
>>> # Save data (automatically validated before saving)
>>> catalog.save("processed.customers", processed_df)

Handling validation errors with comprehensive reporting:

>>> from adc_toolkit.data.validators.pandera import PanderaValidator, PanderaValidationError
>>>
>>> validator = PanderaValidator.in_directory("config/validators")
>>>
>>> # Invalid data
>>> invalid_df = pd.DataFrame(
...     {
...         "id": [1, -2, 3],  # Invalid: negative ID
...         "name": ["Alice", "Bob", None],  # Invalid: null name
...         "age": [25, 30, 150],  # Invalid: unrealistic age
...     }
... )
>>>
>>> try:
...     validator.validate("raw.customers", invalid_df)
... except PanderaValidationError as e:
...     print(f"Validation failed for: {e.table_name}")
...     print(f"Schema file: {e.schema_path}")
...     print(f"All errors: {e.original_error}")
...     # With lazy=True, all validation errors are included

Data pipeline with validation at multiple stages:

>>> def data_pipeline():
...     validator = PanderaValidator.in_directory("config/validators")
...
...     # Load and validate raw data
...     raw_data = load_raw_data()
...     validated_raw = validator.validate("raw.input", raw_data)
...
...     # Transform with confidence
...     transformed = transform(validated_raw)
...
...     # Validate intermediate results
...     validated_intermediate = validator.validate("intermediate.transformed", transformed)
...
...     # Final processing
...     final = aggregate(validated_intermediate)
...
...     # Validate output before downstream consumption
...     validated_output = validator.validate("gold.output", final)
...
...     return validated_output

Using with PySpark DataFrames:

>>> from pyspark.sql import SparkSession
>>>
>>> spark = SparkSession.builder.getOrCreate()
>>> spark_df = spark.createDataFrame([(1, "Alice", 25), (2, "Bob", 30)], ["id", "name", "age"])
>>>
>>> validator = PanderaValidator.in_directory("config/validators")
>>> validated_spark = validator.validate("raw.spark_customers", spark_df)
>>> # Auto-generates PySpark-compatible schema with pyspark.sql.types imports

View Source

  1"""
  2Pandera-based data validation framework for the adc-toolkit.
  3
  4This module provides a comprehensive Pandera-based validation implementation that
  5combines automatic schema generation with manual customization capabilities. It
  6enables rapid prototyping with zero-setup validation while supporting iterative
  7refinement of validation rules as data requirements evolve.
  8
  9The validation workflow is optimized for real-world data engineering scenarios where
 10schemas need to be established quickly but refined over time. Schema scripts are
 11stored as editable Python files, enabling version control, code review, and team
 12collaboration on data quality standards.
 13
 14Classes
 15-------
 16PanderaValidator
 17    Main validator class implementing the DataValidator protocol. Orchestrates
 18    automatic schema generation, schema loading, and validation execution using
 19    Pandera's DataFrameSchema.validate() method. Integrates seamlessly with
 20    ValidatedDataCatalog for automatic validation on load/save operations.
 21PanderaParameters
 22    Immutable configuration dataclass controlling validation behavior. Primary
 23    setting is the 'lazy' parameter which determines error collection strategy
 24    (collect all errors vs fail-fast on first error).
 25
 26Exceptions
 27----------
 28PanderaValidationError
 29    Custom exception raised when validation fails. Wraps Pandera's SchemaError or
 30    SchemaErrors with additional context including table name and schema file path
 31    for enhanced debugging and error handling.
 32
 33Notes
 34-----
 35**Key Features**
 36
 37- **Zero-setup validation**: Auto-generates schemas on first use by introspecting data
 38- **Incremental refinement**: Generated schemas serve as customizable templates
 39- **Version control friendly**: Schemas are plain Python files suitable for git
 40- **Comprehensive error reporting**: Lazy validation collects all errors in one pass
 41- **Type safety**: Full integration with Python type hints and static analysis
 42- **Framework support**: Works with pandas and PySpark DataFrames
 43
 44**Validation Workflow**
 45
 46The validation process follows a two-phase approach:
 47
 481. **Schema Management** (Auto-generation on first use)
 49   - Check if schema file exists at ``{config_path}/pandera_schemas/{category}/{dataset}.py``
 50   - If missing, introspect data structure and generate schema script
 51   - Save generated schema as editable Python file
 52
 532. **Validation Execution** (Every use)
 54   - Load schema script as Python module
 55   - Extract DataFrameSchema object from module
 56   - Execute validation: ``schema.validate(data, lazy=parameters.lazy)``
 57   - Return validated data or raise PanderaValidationError
 58
 59**Schema Organization**
 60
 61Schemas are organized hierarchically based on validation names following the
 62"category.dataset" convention:
 63
 64.. code-block:: text
 65
 66    config/validators/pandera_schemas/
 67    ├── raw/
 68    │   ├── customers.py
 69    │   └── orders.py
 70    ├── processed/
 71    │   ├── customers.py
 72    │   └── sales_summary.py
 73    └── gold/
 74        └── analytics.py
 75
 76**Error Collection Strategies**
 77
 78The ``lazy`` parameter controls error reporting:
 79
 80- **lazy=True (default)**: Collects all validation errors across the entire dataset
 81  before raising exception. Provides comprehensive error reporting, showing all
 82  violations in a single validation run. Recommended for production.
 83- **lazy=False**: Raises exception immediately on first validation failure. Useful
 84  for debugging when you want to fix errors incrementally.
 85
 86**Comparison with Great Expectations**
 87
 88Use PanderaValidator when:
 89
 90- You need lightweight, pandas-native validation
 91- You prefer Python-based schema definitions over YAML/JSON
 92- You want tight integration with type hints and static analysis
 93- Your team is comfortable with code-based configuration
 94
 95Use GXValidator when:
 96
 97- You need profiling and automatic expectation generation
 98- You want data documentation websites (Data Docs)
 99- You need enterprise features (cloud backends, data quality dashboards)
100- You prefer declarative YAML/JSON configuration
101
102See Also
103--------
104adc_toolkit.data.ValidatedDataCatalog : Data catalog with integrated validation.
105adc_toolkit.data.validators.gx.GXValidator : Alternative validator using Great Expectations.
106adc_toolkit.data.validators.no_validator.NoValidator : No-op validator for testing.
107adc_toolkit.data.abs.DataValidator : Protocol defining the validator interface.
108pandera.DataFrameSchema : Underlying Pandera schema class used for validation.
109
110Examples
111--------
112Basic validator setup and usage:
113
114>>> from pathlib import Path
115>>> from adc_toolkit.data.validators.pandera import PanderaValidator
116>>> import pandas as pd
117>>>
118>>> # Create validator
119>>> validator = PanderaValidator.in_directory("config/validators")
120>>>
121>>> # Create sample data
122>>> df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]})
123>>>
124>>> # First validation: auto-generates schema
125>>> validated = validator.validate("raw.customers", df)
126>>> # Schema created at: config/validators/pandera_schemas/raw/customers.py
127
128Customizing auto-generated schemas:
129
130>>> # After first validation, edit the generated schema file:
131>>> # File: config/validators/pandera_schemas/raw/customers.py
132>>> #
133>>> # import pandera.pandas as pa
134>>> #
135>>> # schema = pa.DataFrameSchema({
136>>> #     "id": pa.Column(
137>>> #         "int64",
138>>> #         checks=[
139>>> #             pa.Check.greater_than(0),  # Add: IDs must be positive
140>>> #             pa.Check(lambda s: s.is_unique, element_wise=False),  # Add: unique
141>>> #         ],
142>>> #     ),
143>>> #     "name": pa.Column("object"),
144>>> #     "age": pa.Column(
145>>> #         "int64",
146>>> #         checks=[pa.Check.in_range(0, 120)],  # Add: realistic age range
147>>> #     ),
148>>> # })
149>>>
150>>> # Subsequent validations use customized schema
151>>> validated = validator.validate("raw.customers", df)
152
153Using custom parameters for fail-fast validation:
154
155>>> from adc_toolkit.data.validators.pandera import PanderaParameters
156>>>
157>>> # Create validator with fail-fast mode
158>>> params = PanderaParameters(lazy=False)
159>>> validator_debug = PanderaValidator(config_path="config/validators", parameters=params)
160>>>
161>>> try:
162...     validator_debug.validate("raw.customers", invalid_df)
163... except Exception as e:
164...     print(f"First error: {e.original_error}")
165
166Integration with ValidatedDataCatalog:
167
168>>> from adc_toolkit.data import ValidatedDataCatalog
169>>> from adc_toolkit.data.validators.pandera import PanderaValidator
170>>>
171>>> # Create catalog with Pandera validator
172>>> catalog = ValidatedDataCatalog.in_directory(
173...     path="config", validator=PanderaValidator.in_directory("config/validators")
174... )
175>>>
176>>> # Load data (automatically validated)
177>>> df = catalog.load("raw.customers")
178>>>
179>>> # Process data
180>>> processed_df = transform(df)
181>>>
182>>> # Save data (automatically validated before saving)
183>>> catalog.save("processed.customers", processed_df)
184
185Handling validation errors with comprehensive reporting:
186
187>>> from adc_toolkit.data.validators.pandera import PanderaValidator, PanderaValidationError
188>>>
189>>> validator = PanderaValidator.in_directory("config/validators")
190>>>
191>>> # Invalid data
192>>> invalid_df = pd.DataFrame(
193...     {
194...         "id": [1, -2, 3],  # Invalid: negative ID
195...         "name": ["Alice", "Bob", None],  # Invalid: null name
196...         "age": [25, 30, 150],  # Invalid: unrealistic age
197...     }
198... )
199>>>
200>>> try:
201...     validator.validate("raw.customers", invalid_df)
202... except PanderaValidationError as e:
203...     print(f"Validation failed for: {e.table_name}")
204...     print(f"Schema file: {e.schema_path}")
205...     print(f"All errors: {e.original_error}")
206...     # With lazy=True, all validation errors are included
207
208Data pipeline with validation at multiple stages:
209
210>>> def data_pipeline():
211...     validator = PanderaValidator.in_directory("config/validators")
212...
213...     # Load and validate raw data
214...     raw_data = load_raw_data()
215...     validated_raw = validator.validate("raw.input", raw_data)
216...
217...     # Transform with confidence
218...     transformed = transform(validated_raw)
219...
220...     # Validate intermediate results
221...     validated_intermediate = validator.validate("intermediate.transformed", transformed)
222...
223...     # Final processing
224...     final = aggregate(validated_intermediate)
225...
226...     # Validate output before downstream consumption
227...     validated_output = validator.validate("gold.output", final)
228...
229...     return validated_output
230
231Using with PySpark DataFrames:
232
233>>> from pyspark.sql import SparkSession
234>>>
235>>> spark = SparkSession.builder.getOrCreate()
236>>> spark_df = spark.createDataFrame([(1, "Alice", 25), (2, "Bob", 30)], ["id", "name", "age"])
237>>>
238>>> validator = PanderaValidator.in_directory("config/validators")
239>>> validated_spark = validator.validate("raw.spark_customers", spark_df)
240>>> # Auto-generates PySpark-compatible schema with pyspark.sql.types imports
241"""
242
243from adc_toolkit.data.validators.pandera.parameters import PanderaParameters
244from adc_toolkit.data.validators.pandera.validator import PanderaValidator
245
246
247__all__ = ["PanderaParameters", "PanderaValidator"]

@dataclass(frozen=True, slots=True, kw_only=True)

class PanderaParameters: View Source

 12@dataclass(frozen=True, slots=True, kw_only=True)
 13class PanderaParameters:
 14    """
 15    Configuration parameters for Pandera data validation.
 16
 17    This immutable dataclass encapsulates configuration options that control
 18    how Pandera validates DataFrames within the adc-toolkit validation workflow.
 19    It is designed to be passed to PanderaValidator instances to customize
 20    validation behavior.
 21
 22    The primary configuration option controls error collection strategy:
 23    whether to fail fast on the first validation error or to collect all
 24    validation errors before raising an exception. Lazy validation is
 25    recommended for production workflows as it provides comprehensive
 26    error reporting, making it easier to fix all issues in a single pass.
 27
 28    This class is immutable (frozen) to ensure validation parameters remain
 29    consistent throughout the validation lifecycle and to enable safe
 30    sharing across multiple validation operations.
 31
 32    Attributes
 33    ----------
 34    lazy : bool, default=True
 35        Controls the validation error collection strategy.
 36
 37        - If True (default): Collects all validation errors across all
 38          rows and columns before raising a SchemaErrors exception. This
 39          provides comprehensive error reporting, showing all violations
 40          in a single validation run.
 41        - If False: Raises a SchemaError immediately upon encountering
 42          the first validation failure. This "fail-fast" mode is useful
 43          for debugging or when you want to fix errors incrementally.
 44
 45        The lazy parameter is passed directly to Pandera's
 46        `DataFrameSchema.validate()` method.
 47
 48    See Also
 49    --------
 50    PanderaValidator : Validator that uses these parameters for data validation.
 51    pandera.DataFrameSchema.validate : Underlying Pandera validation method
 52        that receives the lazy parameter.
 53
 54    Notes
 55    -----
 56    This dataclass is configured with the following features:
 57
 58    - **frozen=True**: Makes instances immutable after creation. Attempting
 59      to modify attributes after instantiation raises FrozenInstanceError.
 60    - **slots=True**: Uses __slots__ for memory efficiency and faster
 61      attribute access by preventing dynamic attribute creation.
 62    - **kw_only=True**: Requires all parameters to be specified as keyword
 63      arguments, improving code clarity and preventing positional argument
 64      errors.
 65
 66    The immutability design ensures that validation parameters cannot be
 67    accidentally modified during the validation process, promoting
 68    predictable and reproducible validation behavior.
 69
 70    Examples
 71    --------
 72    Create a PanderaParameters instance with default lazy validation:
 73
 74    >>> from adc_toolkit.data.validators.pandera import PanderaParameters
 75    >>> params = PanderaParameters()
 76    >>> params.lazy
 77    True
 78
 79    Create parameters for fail-fast validation:
 80
 81    >>> params_strict = PanderaParameters(lazy=False)
 82    >>> params_strict.lazy
 83    False
 84
 85    Use with PanderaValidator for comprehensive error reporting:
 86
 87    >>> from adc_toolkit.data.validators.pandera import PanderaValidator
 88    >>> validator = PanderaValidator(config_path="config/validators", parameters=PanderaParameters(lazy=True))
 89
 90    Use with PanderaValidator for fail-fast debugging:
 91
 92    >>> validator_debug = PanderaValidator(config_path="config/validators", parameters=PanderaParameters(lazy=False))
 93
 94    Demonstrate immutability (frozen dataclass):
 95
 96    >>> params = PanderaParameters()
 97    >>> params.lazy = False  # doctest: +SKIP
 98    Traceback (most recent call last):
 99        ...
100    dataclasses.FrozenInstanceError: cannot assign to field 'lazy'
101
102    Compare parameter instances:
103
104    >>> params1 = PanderaParameters(lazy=True)
105    >>> params2 = PanderaParameters(lazy=True)
106    >>> params1 == params2
107    True
108    >>> params3 = PanderaParameters(lazy=False)
109    >>> params1 == params3
110    False
111    """
112
113    lazy: bool = True

Configuration parameters for Pandera data validation.

This immutable dataclass encapsulates configuration options that control how Pandera validates DataFrames within the adc-toolkit validation workflow. It is designed to be passed to PanderaValidator instances to customize validation behavior.

The primary configuration option controls error collection strategy: whether to fail fast on the first validation error or to collect all validation errors before raising an exception. Lazy validation is recommended for production workflows as it provides comprehensive error reporting, making it easier to fix all issues in a single pass.

This class is immutable (frozen) to ensure validation parameters remain consistent throughout the validation lifecycle and to enable safe sharing across multiple validation operations.

Attributes

lazy (bool, default=True): Controls the validation error collection strategy.
- If True (default): Collects all validation errors across all rows and columns before raising a SchemaErrors exception. This provides comprehensive error reporting, showing all violations in a single validation run.
- If False: Raises a SchemaError immediately upon encountering the first validation failure. This "fail-fast" mode is useful for debugging or when you want to fix errors incrementally.
The lazy parameter is passed directly to Pandera's DataFrameSchema.validate() method.

Notes

This dataclass is configured with the following features:

frozen=True: Makes instances immutable after creation. Attempting to modify attributes after instantiation raises FrozenInstanceError.
slots=True: Uses __slots__ for memory efficiency and faster attribute access by preventing dynamic attribute creation.
kw_only=True: Requires all parameters to be specified as keyword arguments, improving code clarity and preventing positional argument errors.

The immutability design ensures that validation parameters cannot be accidentally modified during the validation process, promoting predictable and reproducible validation behavior.

Examples

Create a PanderaParameters instance with default lazy validation:

>>> from adc_toolkit.data.validators.pandera import PanderaParameters
>>> params = PanderaParameters()
>>> params.lazy
True

Create parameters for fail-fast validation:

>>> params_strict = PanderaParameters(lazy=False)
>>> params_strict.lazy
False

Use with PanderaValidator for comprehensive error reporting:

>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> validator = PanderaValidator(config_path="config/validators", parameters=PanderaParameters(lazy=True))

Use with PanderaValidator for fail-fast debugging:

>>> validator_debug = PanderaValidator(config_path="config/validators", parameters=PanderaParameters(lazy=False))

Demonstrate immutability (frozen dataclass):

>>> params = PanderaParameters()
>>> params.lazy = False  # doctest: +SKIP
Traceback (most recent call last):
    ...
dataclasses.FrozenInstanceError: cannot assign to field 'lazy'

Compare parameter instances:

>>> params1 = PanderaParameters(lazy=True)
>>> params2 = PanderaParameters(lazy=True)
>>> params1 == params2
True
>>> params3 = PanderaParameters(lazy=False)
>>> params1 == params3
False

PanderaParameters(*, lazy: bool = True)

lazy: bool

class PanderaValidator: View Source

 68class PanderaValidator:
 69    """
 70    Pandera-based data validator with automatic schema generation.
 71
 72    PanderaValidator is a concrete implementation of the DataValidator protocol
 73    that uses Pandera (https://pandera.readthedocs.io/) for schema-based data
 74    validation. It provides a seamless validation workflow that combines automatic
 75    schema generation with manual customization capabilities.
 76
 77    The validator orchestrates a two-phase approach to data validation:
 78
 79    **Phase 1: Schema Management** (Auto-generation)
 80        On first validation of a dataset, the validator automatically generates
 81        a Pandera schema script by introspecting the data structure (column names,
 82        data types). The generated schema is saved as an editable Python file at
 83        ``{config_path}/pandera_schemas/{category}/{dataset}.py``, where the
 84        category and dataset name are derived from the validation name (e.g.,
 85        "raw.customers" creates ``raw/customers.py``).
 86
 87    **Phase 2: Validation Execution** (Rule Enforcement)
 88        On all validations (including first use), the validator loads the schema
 89        script and executes validation against the data using Pandera's
 90        ``DataFrameSchema.validate()`` method. If validation fails, it raises a
 91        detailed ``PanderaValidationError`` with comprehensive error information.
 92
 93    This design enables an iterative workflow:
 94
 95    1. Run validation immediately without manual schema creation
 96    2. Review auto-generated schemas and add custom validation rules
 97    3. Commit schemas to version control for team collaboration
 98    4. Evolve schemas as data requirements change over time
 99
100    The validator integrates seamlessly with ``ValidatedDataCatalog`` to provide
101    automatic validation on all data load and save operations, ensuring data
102    quality throughout the entire data pipeline.
103
104    Attributes
105    ----------
106    config_path : Path
107        The directory path where Pandera schema scripts are stored and loaded from.
108        Schema files are organized in a hierarchical structure under this path,
109        specifically at ``{config_path}/pandera_schemas/``. For example, if
110        config_path is ``Path("config/validators")``, schemas are stored at
111        ``config/validators/pandera_schemas/{category}/{dataset}.py``.
112    parameters : PanderaParameters
113        Configuration parameters controlling validation behavior. The primary
114        parameter is ``lazy``, which determines error collection strategy:
115
116        - ``lazy=True`` (default): Collects all validation errors across the
117          entire dataset before raising an exception, providing comprehensive
118          error reporting.
119        - ``lazy=False``: Raises an exception immediately upon encountering the
120          first validation failure, useful for debugging.
121
122        If None is provided during instantiation, defaults to
123        ``PanderaParameters()`` with default settings (``lazy=True``).
124
125    Parameters
126    ----------
127    config_path : str or Path
128        Path to the root configuration directory where Pandera schema scripts are
129        stored. The validator will create a ``pandera_schemas`` subdirectory under
130        this path to organize schema files. Can be provided as either a string or
131        pathlib.Path object.
132    parameters : PanderaParameters or None, optional
133        Configuration parameters for validation behavior. If None (default), uses
134        ``PanderaParameters()`` with default settings (``lazy=True`` for
135        comprehensive error reporting).
136
137    Raises
138    ------
139    TypeError
140        If config_path cannot be converted to a Path object.
141    OSError
142        If the config_path directory does not exist and cannot be created during
143        schema generation.
144
145    See Also
146    --------
147    PanderaParameters : Configuration parameters for validation behavior.
148    validate_data : Core validation function used internally by this validator.
149    adc_toolkit.data.abs.DataValidator : Protocol that this class implements.
150    adc_toolkit.data.validators.gx.GXValidator : Alternative validator using Great Expectations.
151    adc_toolkit.data.ValidatedDataCatalog : Data catalog with integrated validation.
152
153    Notes
154    -----
155    **Schema Script Organization**
156
157    Schema scripts are organized hierarchically based on validation names. For
158    a validation name like "raw.customers", the schema script is created at:
159
160    .. code-block:: text
161
162        {config_path}/pandera_schemas/raw/customers.py
163
164    This structure mirrors typical data lake or data warehouse naming conventions
165    (e.g., database.table) and supports large projects with many datasets.
166
167    **Schema Customization Workflow**
168
169    After first validation, edit the generated schema file to add custom checks:
170
171    .. code-block:: python
172
173        # {config_path}/pandera_schemas/raw/customers.py
174        import pandera.pandas as pa
175
176        schema = pa.DataFrameSchema(
177            {
178                "customer_id": pa.Column(
179                    "int64",
180                    checks=[
181                        pa.Check.greater_than(0),  # Must be positive
182                        pa.Check(lambda s: s.is_unique, element_wise=False),  # Must be unique
183                    ],
184                ),
185                "email": pa.Column(
186                    "object",
187                    checks=[
188                        pa.Check(lambda s: s.str.contains("@"), element_wise=True),
189                    ],
190                ),
191                "signup_date": pa.Column(
192                    "datetime64[ns]",
193                    checks=[
194                        pa.Check.less_than_or_equal_to(pd.Timestamp.now()),
195                    ],
196                ),
197            }
198        )
199
200    **Supported Data Types**
201
202    The validator supports:
203
204    - **pandas DataFrames**: Primary use case with full feature support
205    - **PySpark DataFrames**: Generates PySpark-compatible schemas (requires pyspark)
206
207    **Thread Safety**
208
209    This class is thread-safe for validation operations on existing schemas.
210    However, the initial schema auto-generation is not thread-safe. If multiple
211    threads validate the same dataset for the first time concurrently, race
212    conditions may occur. For concurrent scenarios, pre-generate schemas or
213    implement external locking.
214
215    **Performance Considerations**
216
217    - Schema scripts are dynamically imported on each validation call. For
218      high-frequency scenarios, consider caching validator instances.
219    - The ``lazy=True`` mode has slightly more overhead as it collects all errors,
220      but provides significantly better developer experience for fixing issues.
221    - Auto-generation only occurs once per dataset, so performance impact is
222      negligible after initial schema creation.
223
224    **Comparison with Great Expectations**
225
226    Use PanderaValidator when:
227
228    - You need lightweight, pandas-native validation
229    - You prefer Python-based schema definitions over YAML/JSON
230    - You want tight integration with type hints and static analysis
231    - Your team is comfortable with code-based configuration
232
233    Use GXValidator when:
234
235    - You need profiling and automatic expectation generation
236    - You want data documentation websites (Data Docs)
237    - You need enterprise features (cloud backends, data quality dashboards)
238    - You prefer declarative YAML/JSON configuration
239
240    Examples
241    --------
242    Create a validator using the factory method:
243
244    >>> from adc_toolkit.data.validators.pandera import PanderaValidator
245    >>> validator = PanderaValidator.in_directory("config/validators")
246
247    Create a validator using the constructor:
248
249    >>> from pathlib import Path
250    >>> validator = PanderaValidator(config_path=Path("config/validators"))
251
252    Create a validator with custom parameters for fail-fast mode:
253
254    >>> from adc_toolkit.data.validators.pandera import PanderaParameters
255    >>> params = PanderaParameters(lazy=False)
256    >>> validator = PanderaValidator(config_path="config/validators", parameters=params)
257
258    Basic validation workflow:
259
260    >>> import pandas as pd
261    >>> df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]})
262    >>> validator = PanderaValidator.in_directory("config/validators")
263    >>> validated_df = validator.validate("raw.customers", df)
264    >>> # First run: auto-generates schema at config/validators/pandera_schemas/raw/customers.py
265    >>> # Subsequent runs: uses existing schema
266
267    Handle validation failures with comprehensive error reporting:
268
269    >>> df_invalid = pd.DataFrame(
270    ...     {
271    ...         "id": [1, -2, 3],  # Invalid: negative ID
272    ...         "name": ["Alice", "Bob", None],  # Invalid: null name
273    ...         "age": [25, 30, 150],  # Invalid: unrealistic age
274    ...     }
275    ... )
276    >>> from adc_toolkit.data.validators.pandera.exceptions import PanderaValidationError
277    >>> try:
278    ...     validator.validate("raw.customers", df_invalid)
279    ... except PanderaValidationError as e:
280    ...     print(f"Validation failed for: {e.table_name}")
281    ...     print(f"Schema file: {e.schema_path}")
282    ...     print(f"Errors: {e.original_error}")
283    ...     # All validation errors are included (lazy=True)
284
285    Integration with ValidatedDataCatalog:
286
287    >>> from adc_toolkit.data import ValidatedDataCatalog
288    >>> catalog = ValidatedDataCatalog.in_directory(
289    ...     path="config", validator=PanderaValidator.in_directory("config/validators")
290    ... )
291    >>> df = catalog.load("raw.customers")  # Validates after loading
292    >>> catalog.save("processed.customers", df)  # Validates before saving
293
294    Iterative schema customization workflow:
295
296    >>> # Step 1: First validation auto-generates schema
297    >>> validator = PanderaValidator.in_directory("config/validators")
298    >>> validator.validate("raw.customers", df)
299    >>>
300    >>> # Step 2: Edit generated schema to add custom checks
301    >>> # File: config/validators/pandera_schemas/raw/customers.py
302    >>> # Add: pa.Check.greater_than(0) to "id" column
303    >>>
304    >>> # Step 3: Subsequent validations use customized schema
305    >>> validator.validate("raw.customers", df)  # Now enforces custom rules
306
307    Validate PySpark DataFrame:
308
309    >>> from pyspark.sql import SparkSession
310    >>> spark = SparkSession.builder.getOrCreate()
311    >>> spark_df = spark.createDataFrame([(1, "Alice", 25), (2, "Bob", 30)], ["id", "name", "age"])
312    >>> validator = PanderaValidator.in_directory("config/validators")
313    >>> validated_spark = validator.validate("raw.spark_customers", spark_df)
314    >>> # Generates PySpark-compatible schema with pyspark.sql.types imports
315
316    Use in a data quality pipeline:
317
318    >>> def quality_check_pipeline(input_df):
319    ...     validator = PanderaValidator.in_directory("config/validators")
320    ...
321    ...     # Validate raw input
322    ...     validated_input = validator.validate("raw.data", input_df)
323    ...
324    ...     # Transform data
325    ...     transformed = transform(validated_input)
326    ...
327    ...     # Validate transformed output
328    ...     validated_output = validator.validate("processed.data", transformed)
329    ...
330    ...     return validated_output
331
332    Multiple validators for different environments:
333
334    >>> dev_validator = PanderaValidator.in_directory("config/validators/dev")
335    >>> prod_validator = PanderaValidator.in_directory("config/validators/prod")
336    >>> # Use different validation rules for dev vs. production
337    """
338
339    def __init__(self, config_path: str | Path, parameters: PanderaParameters | None = None) -> None:
340        """
341        Initialize a PanderaValidator instance.
342
343        Constructs a new validator configured to use schema scripts from the
344        specified directory. The constructor sets up the schema storage location
345        and validation parameters, but does not perform any I/O operations or
346        validation at initialization time.
347
348        The validator creates a logical schema directory at
349        ``{config_path}/pandera_schemas/`` where all Pandera schema scripts will
350        be stored and loaded from. This subdirectory organization keeps Pandera
351        schemas separate from other configuration files and validation frameworks
352        (e.g., Great Expectations configurations).
353
354        Parameters
355        ----------
356        config_path : str or Path
357            Path to the root configuration directory. The validator will use a
358            ``pandera_schemas`` subdirectory under this path for storing and
359            loading schema scripts. Can be provided as either a string (which will
360            be converted to a Path) or a pathlib.Path object. The path can be
361            absolute or relative to the current working directory.
362
363            For example, if config_path is ``"config/validators"``, schema scripts
364            will be stored at ``config/validators/pandera_schemas/{category}/{dataset}.py``.
365        parameters : PanderaParameters or None, optional
366            Configuration parameters controlling validation behavior. If None
367            (default), uses ``PanderaParameters()`` with default settings
368            (``lazy=True`` for comprehensive error reporting). Provide a custom
369            ``PanderaParameters`` instance to configure validation strategy:
370
371            - ``PanderaParameters(lazy=True)``: Collect all errors (recommended)
372            - ``PanderaParameters(lazy=False)``: Fail-fast on first error
373
374        Returns
375        -------
376        None
377            Constructor does not return a value.
378
379        Raises
380        ------
381        TypeError
382            If config_path cannot be converted to a Path object (e.g., if an
383            invalid type is provided like int or dict).
384
385        See Also
386        --------
387        in_directory : Alternative factory method for creating validators.
388        validate : Perform validation on a dataset.
389        PanderaParameters : Configuration parameters for validation behavior.
390
391        Notes
392        -----
393        **Lazy Initialization**
394
395        The constructor does not create the ``pandera_schemas`` directory at
396        initialization time. The directory is created only when the first schema
397        script is generated during validation. This lazy approach avoids
398        unnecessary file system operations if the validator is created but never
399        used.
400
401        **Path Handling**
402
403        The constructor automatically converts string paths to pathlib.Path objects
404        and appends the ``pandera_schemas`` subdirectory. This means:
405
406        .. code-block:: python
407
408            validator = PanderaValidator(config_path="config")
409            # validator.config_path is Path("config/pandera_schemas")
410
411        **Immutability**
412
413        While the validator instance itself is mutable (standard Python object),
414        the ``parameters`` attribute uses a frozen dataclass (PanderaParameters),
415        ensuring validation behavior remains consistent throughout the validator's
416        lifecycle.
417
418        **No Validation at Initialization**
419
420        This constructor only sets up the validator configuration. No validation
421        occurs until the ``validate()`` method is called. This design allows
422        validators to be created cheaply and reused across multiple validation
423        operations.
424
425        Examples
426        --------
427        Create a validator with default parameters:
428
429        >>> from adc_toolkit.data.validators.pandera import PanderaValidator
430        >>> validator = PanderaValidator(config_path="config/validators")
431        >>> validator.config_path
432        PosixPath('config/validators/pandera_schemas')
433        >>> validator.parameters.lazy
434        True
435
436        Create a validator with custom parameters for fail-fast mode:
437
438        >>> from adc_toolkit.data.validators.pandera import PanderaParameters
439        >>> params = PanderaParameters(lazy=False)
440        >>> validator = PanderaValidator(config_path="config/validators", parameters=params)
441        >>> validator.parameters.lazy
442        False
443
444        Using pathlib.Path for config_path:
445
446        >>> from pathlib import Path
447        >>> config_dir = Path("config") / "validators"
448        >>> validator = PanderaValidator(config_path=config_dir)
449        >>> validator.config_path
450        PosixPath('config/validators/pandera_schemas')
451
452        Create multiple validators for different schema directories:
453
454        >>> dev_validator = PanderaValidator(config_path="config/dev/validators")
455        >>> prod_validator = PanderaValidator(config_path="config/prod/validators")
456        >>> # Each validator uses a separate schema directory
457
458        Reuse a validator for multiple validations:
459
460        >>> validator = PanderaValidator(config_path="config/validators")
461        >>> validated_df1 = validator.validate("dataset1", df1)
462        >>> validated_df2 = validator.validate("dataset2", df2)
463        >>> # Same validator instance, different datasets
464        """
465        self.config_path = Path(config_path) / "pandera_schemas"
466        self.parameters = parameters or PanderaParameters()
467
468    @classmethod
469    def in_directory(cls, path: str | Path, parameters: PanderaParameters | None = None) -> "PanderaValidator":
470        """
471        Create a PanderaValidator from a configuration directory (factory method).
472
473        This is the recommended factory method for creating PanderaValidator
474        instances. It provides a consistent interface with other toolkit components
475        (e.g., ``ValidatedDataCatalog.in_directory()``, ``KedroDataCatalog.in_directory()``)
476        and follows the factory pattern for object construction from configuration.
477
478        The method is functionally equivalent to calling the constructor directly,
479        but provides better semantic clarity in code that uses multiple toolkit
480        components with directory-based configuration.
481
482        Parameters
483        ----------
484        path : str or Path
485            Path to the root configuration directory where Pandera schema scripts
486            are stored. The validator will use a ``pandera_schemas`` subdirectory
487            under this path. Can be provided as either a string or pathlib.Path
488            object. The path can be absolute or relative to the current working
489            directory.
490
491            For example, if path is ``"config/validators"``, schema scripts will
492            be stored at ``config/validators/pandera_schemas/{category}/{dataset}.py``.
493        parameters : PanderaParameters or None, optional
494            Configuration parameters controlling validation behavior. If None
495            (default), uses ``PanderaParameters()`` with default settings
496            (``lazy=True`` for comprehensive error reporting). Provide a custom
497            ``PanderaParameters`` instance to configure validation strategy:
498
499            - ``PanderaParameters(lazy=True)``: Collect all errors (recommended)
500            - ``PanderaParameters(lazy=False)``: Fail-fast on first error
501
502        Returns
503        -------
504        PanderaValidator
505            A new validator instance configured to use schema scripts from the
506            specified directory. The returned validator is ready to use for
507            validation operations via the ``validate()`` method.
508
509        Raises
510        ------
511        TypeError
512            If path cannot be converted to a Path object (e.g., if an invalid
513            type is provided like int or dict).
514
515        See Also
516        --------
517        __init__ : Alternative constructor for creating validators.
518        validate : Perform validation on a dataset.
519        PanderaParameters : Configuration parameters for validation behavior.
520        adc_toolkit.data.ValidatedDataCatalog.in_directory : Similar factory method
521            for creating validated data catalogs.
522
523        Notes
524        -----
525        **Factory Pattern**
526
527        This method implements the factory pattern, providing a standard interface
528        for creating validators from directory-based configuration. The pattern is
529        used consistently across the toolkit:
530
531        .. code-block:: python
532
533            # Similar patterns across toolkit components
534            catalog = KedroDataCatalog.in_directory("config/")
535            validator = PanderaValidator.in_directory("config/validators")
536            gx_validator = GXValidator.in_directory("config/gx")
537
538        **Semantic Clarity**
539
540        Using ``in_directory()`` instead of the constructor makes code more
541        readable and self-documenting:
542
543        .. code-block:: python
544
545            # Clear intent: validator configured from this directory
546            validator = PanderaValidator.in_directory("config/validators")
547
548            # vs. less clear constructor call
549            validator = PanderaValidator("config/validators")
550
551        **Design Rationale**
552
553        The factory method pattern is preferred over direct constructor calls in
554        the toolkit because:
555
556        1. Provides consistent API across all components
557        2. Makes the configuration-from-directory pattern explicit
558        3. Allows future extension with additional factory methods
559        4. Improves code readability and maintainability
560
561        **Usage in ValidatedDataCatalog**
562
563        This method is commonly used when configuring ``ValidatedDataCatalog``
564        with a custom validator:
565
566        .. code-block:: python
567
568            from adc_toolkit.data import ValidatedDataCatalog
569            from adc_toolkit.data.validators.pandera import PanderaValidator
570
571            catalog = ValidatedDataCatalog.in_directory(
572                path="config", validator=PanderaValidator.in_directory("config/validators")
573            )
574
575        Examples
576        --------
577        Create a validator using the factory method (recommended):
578
579        >>> from adc_toolkit.data.validators.pandera import PanderaValidator
580        >>> validator = PanderaValidator.in_directory("config/validators")
581        >>> validator.config_path
582        PosixPath('config/validators/pandera_schemas')
583
584        Create a validator with custom parameters:
585
586        >>> from adc_toolkit.data.validators.pandera import PanderaParameters
587        >>> params = PanderaParameters(lazy=False)
588        >>> validator = PanderaValidator.in_directory(path="config/validators", parameters=params)
589        >>> validator.parameters.lazy
590        False
591
592        Using pathlib.Path:
593
594        >>> from pathlib import Path
595        >>> config_dir = Path("config") / "validators"
596        >>> validator = PanderaValidator.in_directory(path=config_dir)
597
598        Integration with ValidatedDataCatalog:
599
600        >>> from adc_toolkit.data import ValidatedDataCatalog
601        >>> catalog = ValidatedDataCatalog.in_directory(
602        ...     path="config", validator=PanderaValidator.in_directory("config/validators")
603        ... )
604        >>> # ValidatedDataCatalog uses the PanderaValidator for all load/save ops
605
606        Consistent API across different validators:
607
608        >>> from adc_toolkit.data.validators.pandera import PanderaValidator
609        >>> from adc_toolkit.data.validators.gx import GXValidator
610        >>> pandera_val = PanderaValidator.in_directory("config/pandera")
611        >>> gx_val = GXValidator.in_directory("config/gx")
612        >>> # Both use the same factory method pattern
613
614        Multiple validators for different environments:
615
616        >>> dev_validator = PanderaValidator.in_directory("config/dev/validators")
617        >>> staging_validator = PanderaValidator.in_directory("config/staging/validators")
618        >>> prod_validator = PanderaValidator.in_directory("config/prod/validators")
619        >>> # Each environment can have different validation rules
620
621        Dependency injection pattern:
622
623        >>> def create_pipeline(validator_path: str):
624        ...     validator = PanderaValidator.in_directory(validator_path)
625        ...     return DataPipeline(validator=validator)
626        >>> # Easy to swap validator configurations
627        >>> dev_pipeline = create_pipeline("config/dev/validators")
628        >>> prod_pipeline = create_pipeline("config/prod/validators")
629        """
630        return cls(path, parameters=parameters)
631
632    def validate(self, name: str, data: Data) -> Data:
633        """
634        Validate a dataset against its Pandera schema.
635
636        This is the primary validation method that implements the DataValidator
637        protocol. It orchestrates the complete validation workflow, from automatic
638        schema generation (if needed) to validation execution, providing seamless
639        data quality checking with minimal setup.
640
641        The method delegates to the ``validate_data`` function, which implements
642        a two-phase validation approach:
643
644        **Phase 1: Schema Preparation** (First Use Only)
645            If no schema script exists for the dataset name, automatically generate
646            one by introspecting the data structure. The generated schema is saved
647            at ``{self.config_path}/{category}/{dataset}.py`` and serves as an
648            editable template for adding custom validation rules.
649
650        **Phase 2: Validation Execution** (Every Use)
651            Load the schema script as a Python module, extract the
652            ``DataFrameSchema`` object, and execute validation using Pandera's
653            ``schema.validate(data, lazy=self.parameters.lazy)`` method. Return
654            validated data if all checks pass, or raise ``PanderaValidationError``
655            with comprehensive error details if validation fails.
656
657        This design enables rapid prototyping (no upfront schema creation required)
658        while supporting iterative refinement (schemas can be customized after
659        auto-generation). Schema scripts are version-controlled Python files,
660        facilitating team collaboration and schema evolution tracking.
661
662        Parameters
663        ----------
664        name : str
665            The dataset name/identifier that determines which schema script to use.
666            Should follow the convention ``"category.dataset_name"`` (e.g.,
667            ``"raw.customers"``, ``"processed.sales"``). The name serves multiple
668            purposes:
669
670            - Determines schema file location: ``{config_path}/{category}/{dataset}.py``
671            - Provides context in validation error messages
672            - Enables logical organization of schemas by data pipeline stage
673
674            Names with a single dot separator create a two-level directory
675            structure. For example, ``"raw.customers"`` creates a schema at
676            ``{self.config_path}/raw/customers.py``.
677        data : Data
678            The data object to validate. Must be a protocol-compliant Data object
679            (pandas DataFrame, PySpark DataFrame, etc.) with ``columns`` and
680            ``dtypes`` attributes. The data structure is validated against the
681            schema defined in (or auto-generated for) the corresponding schema
682            script.
683
684            Supported types:
685            - ``pandas.DataFrame``: Primary use case with full feature support
686            - ``pyspark.sql.DataFrame``: Requires pyspark installation
687
688        Returns
689        -------
690        Data
691            The validated data object. If validation passes all checks defined in
692            the schema, returns the original data object (potentially with
693            Pandera-applied type coercions if configured in the schema). The return
694            type matches the input data type (pandas in, pandas out; PySpark in,
695            PySpark out).
696
697            The returned data can be used immediately in downstream processing with
698            confidence that it meets all defined quality requirements.
699
700        Raises
701        ------
702        PanderaValidationError
703            Raised when data validation fails against the schema. This custom
704            exception wraps Pandera's underlying SchemaError or SchemaErrors and
705            enriches it with additional context:
706
707            Attributes of PanderaValidationError:
708            - ``table_name``: The dataset name that failed validation
709            - ``schema_path``: Full filesystem path to the schema script file
710            - ``original_error``: The underlying Pandera error with detailed
711              validation failure information (row indices, column names, observed
712              values, expected constraints)
713
714            With ``lazy=True`` (default), the exception includes all validation
715            errors across the entire dataset. With ``lazy=False``, it includes only
716            the first error encountered.
717        ValueError
718            Raised if the dataframe type is not supported (neither pandas nor
719            pyspark), originating from the schema compiler during auto-generation.
720        ModuleNotFoundError
721            Raised if the schema script cannot be imported, typically indicating
722            a Python syntax error in a manually edited schema file. Check the
723            schema file for syntax errors or import statement issues.
724        AttributeError
725            Raised if the schema script module does not define a ``schema``
726            attribute, indicating the schema file structure is invalid. The schema
727            file must contain ``schema = pa.DataFrameSchema(...)`` at module level.
728        OSError
729            Raised if there are file system permissions issues preventing schema
730            file creation (during auto-generation) or reading (during validation).
731
732        See Also
733        --------
734        validate_data : The underlying validation function called by this method.
735        PanderaParameters : Configuration parameters controlling validation behavior.
736        PanderaValidationError : Custom exception raised on validation failure.
737        in_directory : Factory method for creating validator instances.
738        adc_toolkit.data.ValidatedDataCatalog : Data catalog with integrated validation.
739
740        Notes
741        -----
742        **Validation Workflow**
743
744        The complete sequence executed by this method:
745
746        1. Check if schema file exists at ``{self.config_path}/{category}/{dataset}.py``
747        2. If missing, auto-generate schema by introspecting data structure
748        3. Load schema file as a Python module using dynamic import
749        4. Extract the ``schema`` DataFrameSchema object from the module
750        5. Call ``schema.validate(data, lazy=self.parameters.lazy)``
751        6. Return validated data if all checks pass
752        7. Raise PanderaValidationError with context if validation fails
753
754        **Schema Customization After Auto-Generation**
755
756        After first validation, edit the generated schema file to add domain-specific
757        validation rules:
758
759        .. code-block:: python
760
761            # {self.config_path}/raw/customers.py (auto-generated)
762            import pandera.pandas as pa
763
764            # Insert your additional checks to `checks` list parameter
765            schema = pa.DataFrameSchema(
766                {
767                    "customer_id": pa.Column(
768                        "int64",
769                        checks=[
770                            pa.Check.greater_than(0),  # Add: IDs must be positive
771                            pa.Check(lambda s: s.is_unique, element_wise=False),  # Add: unique
772                        ],
773                    ),
774                    "email": pa.Column(
775                        "object",
776                        checks=[
777                            pa.Check(lambda s: s.str.contains("@")),  # Add: valid email
778                        ],
779                    ),
780                    "age": pa.Column(
781                        "int64",
782                        checks=[
783                            pa.Check.in_range(0, 120),  # Add: realistic age range
784                        ],
785                    ),
786                }
787            )
788
789        **Error Reporting: Lazy vs. Fail-Fast**
790
791        The ``parameters.lazy`` setting significantly affects error reporting:
792
793        **Lazy Mode (lazy=True, default)**: Recommended for production
794            - Collects all validation errors across entire dataset
795            - Provides comprehensive error report in single validation run
796            - Higher overhead but better developer experience
797            - Example: "Found 47 validation errors in columns 'age', 'email'"
798
799        **Fail-Fast Mode (lazy=False)**: Useful for debugging
800            - Raises exception on first validation failure
801            - Lower overhead, faster failure detection
802            - Requires multiple validation runs to find all errors
803            - Example: "Row 23: age value 150 exceeds maximum 120"
804
805        **Performance Considerations**
806
807        - Schema scripts are dynamically imported on each validation call. For
808          high-frequency validation scenarios (e.g., streaming data), consider
809          caching the validator instance and reusing it across validations.
810        - Auto-generation only occurs once per dataset. After initial schema
811          creation, there's no performance penalty for the auto-generation feature.
812        - Large dataset validation can be expensive. Consider sampling strategies
813          for very large datasets if full validation is not required.
814
815        **Thread Safety**
816
817        This method is thread-safe for validation operations on existing schemas
818        (multiple threads can call ``validate()`` concurrently on different datasets).
819        However, the initial schema auto-generation is not thread-safe. If multiple
820        threads validate the same dataset for the first time concurrently, race
821        conditions may occur. For concurrent first-time validation, implement
822        external locking or pre-generate schemas.
823
824        **Integration with Data Pipelines**
825
826        This method integrates seamlessly with data pipeline workflows:
827
828        .. code-block:: python
829
830            def pipeline_stage(validator, input_data):
831                # Validate input from previous stage
832                validated_input = validator.validate("stage_input", input_data)
833
834                # Process with confidence that data meets requirements
835                processed = transform(validated_input)
836
837                # Validate output before passing to next stage
838                validated_output = validator.validate("stage_output", processed)
839
840                return validated_output
841
842        Examples
843        --------
844        Basic validation with auto-generated schema:
845
846        >>> import pandas as pd
847        >>> from adc_toolkit.data.validators.pandera import PanderaValidator
848        >>> validator = PanderaValidator.in_directory("config/validators")
849        >>> df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]})
850        >>> validated = validator.validate("raw.customers", df)
851        >>> # First run: auto-generates schema at config/validators/pandera_schemas/raw/customers.py
852        >>> # Subsequent runs: uses existing schema
853        >>> print(validated)
854           id     name  age
855        0   1    Alice   25
856        1   2      Bob   30
857        2   3  Charlie   35
858
859        Validation failure with comprehensive error reporting (lazy=True):
860
861        >>> df_invalid = pd.DataFrame(
862        ...     {
863        ...         "id": [1, -2, 3],  # Invalid: negative ID (if custom check added)
864        ...         "name": ["Alice", "Bob", None],  # Invalid: null name
865        ...         "age": [25, 30, 150],  # Invalid: unrealistic age (if custom check added)
866        ...     }
867        ... )
868        >>> from adc_toolkit.data.validators.pandera.exceptions import PanderaValidationError
869        >>> try:
870        ...     validator.validate("raw.customers", df_invalid)
871        ... except PanderaValidationError as e:
872        ...     print(f"Validation failed for table: {e.table_name}")
873        ...     print(f"Schema location: {e.schema_path}")
874        ...     print(f"Error details: {e.original_error}")
875        ...     # All validation errors are reported together (lazy=True)
876
877        Fail-fast validation for debugging:
878
879        >>> from adc_toolkit.data.validators.pandera import PanderaParameters
880        >>> validator_debug = PanderaValidator.in_directory(
881        ...     path="config/validators", parameters=PanderaParameters(lazy=False)
882        ... )
883        >>> try:
884        ...     validator_debug.validate("raw.customers", df_invalid)
885        ... except Exception as e:
886        ...     print(f"First error encountered: {e.original_error}")
887        ...     # Only the first validation error is reported (lazy=False)
888
889        Validate multiple datasets with same validator:
890
891        >>> validator = PanderaValidator.in_directory("config/validators")
892        >>> customers = validator.validate("raw.customers", customers_df)
893        >>> orders = validator.validate("raw.orders", orders_df)
894        >>> products = validator.validate("raw.products", products_df)
895        >>> # Reuse same validator instance for efficiency
896
897        Validation in a data processing pipeline:
898
899        >>> def process_customer_data():
900        ...     validator = PanderaValidator.in_directory("config/validators")
901        ...
902        ...     # Load raw data
903        ...     raw_df = load_raw_customers()
904        ...
905        ...     # Validate input
906        ...     validated_input = validator.validate("raw.customers", raw_df)
907        ...
908        ...     # Process with confidence
909        ...     processed_df = transform_customers(validated_input)
910        ...
911        ...     # Validate output
912        ...     validated_output = validator.validate("processed.customers", processed_df)
913        ...
914        ...     return validated_output
915
916        Using with PySpark DataFrame:
917
918        >>> from pyspark.sql import SparkSession
919        >>> spark = SparkSession.builder.getOrCreate()
920        >>> spark_df = spark.createDataFrame([(1, "Alice", 25), (2, "Bob", 30)], ["id", "name", "age"])
921        >>> validator = PanderaValidator.in_directory("config/validators")
922        >>> validated_spark = validator.validate("raw.spark_customers", spark_df)
923        >>> # Returns validated PySpark DataFrame
924
925        Integration with ValidatedDataCatalog (automatic validation):
926
927        >>> from adc_toolkit.data import ValidatedDataCatalog
928        >>> catalog = ValidatedDataCatalog.in_directory(
929        ...     path="config", validator=PanderaValidator.in_directory("config/validators")
930        ... )
931        >>> # ValidatedDataCatalog internally calls validator.validate()
932        >>> df = catalog.load("raw.customers")  # Validates after loading
933        >>> catalog.save("processed.customers", df)  # Validates before saving
934        """
935        return validate_data(name, data, self.config_path, self.parameters)

Pandera-based data validator with automatic schema generation.

PanderaValidator is a concrete implementation of the DataValidator protocol that uses Pandera (https://pandera.readthedocs.io/) for schema-based data validation. It provides a seamless validation workflow that combines automatic schema generation with manual customization capabilities.

The validator orchestrates a two-phase approach to data validation:

Phase 1: Schema Management (Auto-generation) On first validation of a dataset, the validator automatically generates a Pandera schema script by introspecting the data structure (column names, data types). The generated schema is saved as an editable Python file at {config_path}/pandera_schemas/{category}/{dataset}.py, where the category and dataset name are derived from the validation name (e.g., "raw.customers" creates raw/customers.py).

Phase 2: Validation Execution (Rule Enforcement) On all validations (including first use), the validator loads the schema script and executes validation against the data using Pandera's DataFrameSchema.validate() method. If validation fails, it raises a detailed PanderaValidationError with comprehensive error information.

This design enables an iterative workflow:

Run validation immediately without manual schema creation
Review auto-generated schemas and add custom validation rules
Commit schemas to version control for team collaboration
Evolve schemas as data requirements change over time

The validator integrates seamlessly with ValidatedDataCatalog to provide automatic validation on all data load and save operations, ensuring data quality throughout the entire data pipeline.

Attributes

config_path (Path): The directory path where Pandera schema scripts are stored and loaded from. Schema files are organized in a hierarchical structure under this path, specifically at {config_path}/pandera_schemas/. For example, if config_path is Path("config/validators"), schemas are stored at config/validators/pandera_schemas/{category}/{dataset}.py.
parameters (PanderaParameters): Configuration parameters controlling validation behavior. The primary parameter is lazy, which determines error collection strategy:
- lazy=True (default): Collects all validation errors across the entire dataset before raising an exception, providing comprehensive error reporting.
- lazy=False: Raises an exception immediately upon encountering the first validation failure, useful for debugging.
If None is provided during instantiation, defaults to PanderaParameters() with default settings (lazy=True).

Parameters

config_path (str or Path): Path to the root configuration directory where Pandera schema scripts are stored. The validator will create a pandera_schemas subdirectory under this path to organize schema files. Can be provided as either a string or pathlib.Path object.
parameters (PanderaParameters or None, optional): Configuration parameters for validation behavior. If None (default), uses PanderaParameters() with default settings (lazy=True for comprehensive error reporting).

Raises

TypeError: If config_path cannot be converted to a Path object.
OSError: If the config_path directory does not exist and cannot be created during schema generation.

Notes

Schema Script Organization

Schema scripts are organized hierarchically based on validation names. For a validation name like "raw.customers", the schema script is created at:

{config_path}/pandera_schemas/raw/customers.py

This structure mirrors typical data lake or data warehouse naming conventions (e.g., database.table) and supports large projects with many datasets.

Schema Customization Workflow

After first validation, edit the generated schema file to add custom checks:

# {config_path}/pandera_schemas/raw/customers.py
import pandera.pandas as pa

schema = pa.DataFrameSchema(
    {
        "customer_id": pa.Column(
            "int64",
            checks=[
                pa.Check.greater_than(0),  # Must be positive
                pa.Check(lambda s: s.is_unique, element_wise=False),  # Must be unique
            ],
        ),
        "email": pa.Column(
            "object",
            checks=[
                pa.Check(lambda s: s.str.contains("@"), element_wise=True),
            ],
        ),
        "signup_date": pa.Column(
            "datetime64[ns]",
            checks=[
                pa.Check.less_than_or_equal_to(pd.Timestamp.now()),
            ],
        ),
    }
)

Supported Data Types

The validator supports:

pandas DataFrames: Primary use case with full feature support
PySpark DataFrames: Generates PySpark-compatible schemas (requires pyspark)

Thread Safety

This class is thread-safe for validation operations on existing schemas. However, the initial schema auto-generation is not thread-safe. If multiple threads validate the same dataset for the first time concurrently, race conditions may occur. For concurrent scenarios, pre-generate schemas or implement external locking.

Performance Considerations

Schema scripts are dynamically imported on each validation call. For high-frequency scenarios, consider caching validator instances.
The lazy=True mode has slightly more overhead as it collects all errors, but provides significantly better developer experience for fixing issues.
Auto-generation only occurs once per dataset, so performance impact is negligible after initial schema creation.

Comparison with Great Expectations

Use PanderaValidator when:

You need lightweight, pandas-native validation
You prefer Python-based schema definitions over YAML/JSON
You want tight integration with type hints and static analysis
Your team is comfortable with code-based configuration

Use GXValidator when:

You need profiling and automatic expectation generation
You want data documentation websites (Data Docs)
You need enterprise features (cloud backends, data quality dashboards)
You prefer declarative YAML/JSON configuration

Examples

Create a validator using the factory method:

>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> validator = PanderaValidator.in_directory("config/validators")

Create a validator using the constructor:

>>> from pathlib import Path
>>> validator = PanderaValidator(config_path=Path("config/validators"))

Create a validator with custom parameters for fail-fast mode:

>>> from adc_toolkit.data.validators.pandera import PanderaParameters
>>> params = PanderaParameters(lazy=False)
>>> validator = PanderaValidator(config_path="config/validators", parameters=params)

Basic validation workflow:

>>> import pandas as pd
>>> df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]})
>>> validator = PanderaValidator.in_directory("config/validators")
>>> validated_df = validator.validate("raw.customers", df)
>>> # First run: auto-generates schema at config/validators/pandera_schemas/raw/customers.py
>>> # Subsequent runs: uses existing schema

Handle validation failures with comprehensive error reporting:

>>> df_invalid = pd.DataFrame(
...     {
...         "id": [1, -2, 3],  # Invalid: negative ID
...         "name": ["Alice", "Bob", None],  # Invalid: null name
...         "age": [25, 30, 150],  # Invalid: unrealistic age
...     }
... )
>>> from adc_toolkit.data.validators.pandera.exceptions import PanderaValidationError
>>> try:
...     validator.validate("raw.customers", df_invalid)
... except PanderaValidationError as e:
...     print(f"Validation failed for: {e.table_name}")
...     print(f"Schema file: {e.schema_path}")
...     print(f"Errors: {e.original_error}")
...     # All validation errors are included (lazy=True)

Integration with ValidatedDataCatalog:

>>> from adc_toolkit.data import ValidatedDataCatalog
>>> catalog = ValidatedDataCatalog.in_directory(
...     path="config", validator=PanderaValidator.in_directory("config/validators")
... )
>>> df = catalog.load("raw.customers")  # Validates after loading
>>> catalog.save("processed.customers", df)  # Validates before saving

Iterative schema customization workflow:

>>> # Step 1: First validation auto-generates schema
>>> validator = PanderaValidator.in_directory("config/validators")
>>> validator.validate("raw.customers", df)
>>>
>>> # Step 2: Edit generated schema to add custom checks
>>> # File: config/validators/pandera_schemas/raw/customers.py
>>> # Add: pa.Check.greater_than(0) to "id" column
>>>
>>> # Step 3: Subsequent validations use customized schema
>>> validator.validate("raw.customers", df)  # Now enforces custom rules

Validate PySpark DataFrame:

>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> spark_df = spark.createDataFrame([(1, "Alice", 25), (2, "Bob", 30)], ["id", "name", "age"])
>>> validator = PanderaValidator.in_directory("config/validators")
>>> validated_spark = validator.validate("raw.spark_customers", spark_df)
>>> # Generates PySpark-compatible schema with pyspark.sql.types imports

Use in a data quality pipeline:

>>> def quality_check_pipeline(input_df):
...     validator = PanderaValidator.in_directory("config/validators")
...
...     # Validate raw input
...     validated_input = validator.validate("raw.data", input_df)
...
...     # Transform data
...     transformed = transform(validated_input)
...
...     # Validate transformed output
...     validated_output = validator.validate("processed.data", transformed)
...
...     return validated_output

Multiple validators for different environments:

>>> dev_validator = PanderaValidator.in_directory("config/validators/dev")
>>> prod_validator = PanderaValidator.in_directory("config/validators/prod")
>>> # Use different validation rules for dev vs. production

PanderaValidator( config_path: str | pathlib.Path, parameters: PanderaParameters | None = None) View Source

339    def __init__(self, config_path: str | Path, parameters: PanderaParameters | None = None) -> None:
340        """
341        Initialize a PanderaValidator instance.
342
343        Constructs a new validator configured to use schema scripts from the
344        specified directory. The constructor sets up the schema storage location
345        and validation parameters, but does not perform any I/O operations or
346        validation at initialization time.
347
348        The validator creates a logical schema directory at
349        ``{config_path}/pandera_schemas/`` where all Pandera schema scripts will
350        be stored and loaded from. This subdirectory organization keeps Pandera
351        schemas separate from other configuration files and validation frameworks
352        (e.g., Great Expectations configurations).
353
354        Parameters
355        ----------
356        config_path : str or Path
357            Path to the root configuration directory. The validator will use a
358            ``pandera_schemas`` subdirectory under this path for storing and
359            loading schema scripts. Can be provided as either a string (which will
360            be converted to a Path) or a pathlib.Path object. The path can be
361            absolute or relative to the current working directory.
362
363            For example, if config_path is ``"config/validators"``, schema scripts
364            will be stored at ``config/validators/pandera_schemas/{category}/{dataset}.py``.
365        parameters : PanderaParameters or None, optional
366            Configuration parameters controlling validation behavior. If None
367            (default), uses ``PanderaParameters()`` with default settings
368            (``lazy=True`` for comprehensive error reporting). Provide a custom
369            ``PanderaParameters`` instance to configure validation strategy:
370
371            - ``PanderaParameters(lazy=True)``: Collect all errors (recommended)
372            - ``PanderaParameters(lazy=False)``: Fail-fast on first error
373
374        Returns
375        -------
376        None
377            Constructor does not return a value.
378
379        Raises
380        ------
381        TypeError
382            If config_path cannot be converted to a Path object (e.g., if an
383            invalid type is provided like int or dict).
384
385        See Also
386        --------
387        in_directory : Alternative factory method for creating validators.
388        validate : Perform validation on a dataset.
389        PanderaParameters : Configuration parameters for validation behavior.
390
391        Notes
392        -----
393        **Lazy Initialization**
394
395        The constructor does not create the ``pandera_schemas`` directory at
396        initialization time. The directory is created only when the first schema
397        script is generated during validation. This lazy approach avoids
398        unnecessary file system operations if the validator is created but never
399        used.
400
401        **Path Handling**
402
403        The constructor automatically converts string paths to pathlib.Path objects
404        and appends the ``pandera_schemas`` subdirectory. This means:
405
406        .. code-block:: python
407
408            validator = PanderaValidator(config_path="config")
409            # validator.config_path is Path("config/pandera_schemas")
410
411        **Immutability**
412
413        While the validator instance itself is mutable (standard Python object),
414        the ``parameters`` attribute uses a frozen dataclass (PanderaParameters),
415        ensuring validation behavior remains consistent throughout the validator's
416        lifecycle.
417
418        **No Validation at Initialization**
419
420        This constructor only sets up the validator configuration. No validation
421        occurs until the ``validate()`` method is called. This design allows
422        validators to be created cheaply and reused across multiple validation
423        operations.
424
425        Examples
426        --------
427        Create a validator with default parameters:
428
429        >>> from adc_toolkit.data.validators.pandera import PanderaValidator
430        >>> validator = PanderaValidator(config_path="config/validators")
431        >>> validator.config_path
432        PosixPath('config/validators/pandera_schemas')
433        >>> validator.parameters.lazy
434        True
435
436        Create a validator with custom parameters for fail-fast mode:
437
438        >>> from adc_toolkit.data.validators.pandera import PanderaParameters
439        >>> params = PanderaParameters(lazy=False)
440        >>> validator = PanderaValidator(config_path="config/validators", parameters=params)
441        >>> validator.parameters.lazy
442        False
443
444        Using pathlib.Path for config_path:
445
446        >>> from pathlib import Path
447        >>> config_dir = Path("config") / "validators"
448        >>> validator = PanderaValidator(config_path=config_dir)
449        >>> validator.config_path
450        PosixPath('config/validators/pandera_schemas')
451
452        Create multiple validators for different schema directories:
453
454        >>> dev_validator = PanderaValidator(config_path="config/dev/validators")
455        >>> prod_validator = PanderaValidator(config_path="config/prod/validators")
456        >>> # Each validator uses a separate schema directory
457
458        Reuse a validator for multiple validations:
459
460        >>> validator = PanderaValidator(config_path="config/validators")
461        >>> validated_df1 = validator.validate("dataset1", df1)
462        >>> validated_df2 = validator.validate("dataset2", df2)
463        >>> # Same validator instance, different datasets
464        """
465        self.config_path = Path(config_path) / "pandera_schemas"
466        self.parameters = parameters or PanderaParameters()

Initialize a PanderaValidator instance.

Constructs a new validator configured to use schema scripts from the specified directory. The constructor sets up the schema storage location and validation parameters, but does not perform any I/O operations or validation at initialization time.

The validator creates a logical schema directory at {config_path}/pandera_schemas/ where all Pandera schema scripts will be stored and loaded from. This subdirectory organization keeps Pandera schemas separate from other configuration files and validation frameworks (e.g., Great Expectations configurations).

Parameters

config_path (str or Path): Path to the root configuration directory. The validator will use a pandera_schemas subdirectory under this path for storing and loading schema scripts. Can be provided as either a string (which will be converted to a Path) or a pathlib.Path object. The path can be absolute or relative to the current working directory.

For example, if config_path is "config/validators", schema scripts will be stored at config/validators/pandera_schemas/{category}/{dataset}.py.
parameters (PanderaParameters or None, optional): Configuration parameters controlling validation behavior. If None (default), uses PanderaParameters() with default settings (lazy=True for comprehensive error reporting). Provide a custom PanderaParameters instance to configure validation strategy:
- PanderaParameters(lazy=True): Collect all errors (recommended)
- PanderaParameters(lazy=False): Fail-fast on first error

Returns

None: Constructor does not return a value.

Raises

TypeError: If config_path cannot be converted to a Path object (e.g., if an invalid type is provided like int or dict).

Notes

Lazy Initialization

The constructor does not create the pandera_schemas directory at initialization time. The directory is created only when the first schema script is generated during validation. This lazy approach avoids unnecessary file system operations if the validator is created but never used.

Path Handling

The constructor automatically converts string paths to pathlib.Path objects and appends the pandera_schemas subdirectory. This means:

validator = PanderaValidator(config_path="config")
# validator.config_path is Path("config/pandera_schemas")

Immutability

While the validator instance itself is mutable (standard Python object), the parameters attribute uses a frozen dataclass (PanderaParameters), ensuring validation behavior remains consistent throughout the validator's lifecycle.

No Validation at Initialization

This constructor only sets up the validator configuration. No validation occurs until the validate() method is called. This design allows validators to be created cheaply and reused across multiple validation operations.

Examples

Create a validator with default parameters:

>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> validator = PanderaValidator(config_path="config/validators")
>>> validator.config_path
PosixPath('config/validators/pandera_schemas')
>>> validator.parameters.lazy
True

Create a validator with custom parameters for fail-fast mode:

>>> from adc_toolkit.data.validators.pandera import PanderaParameters
>>> params = PanderaParameters(lazy=False)
>>> validator = PanderaValidator(config_path="config/validators", parameters=params)
>>> validator.parameters.lazy
False

Using pathlib.Path for config_path:

>>> from pathlib import Path
>>> config_dir = Path("config") / "validators"
>>> validator = PanderaValidator(config_path=config_dir)
>>> validator.config_path
PosixPath('config/validators/pandera_schemas')

Create multiple validators for different schema directories:

>>> dev_validator = PanderaValidator(config_path="config/dev/validators")
>>> prod_validator = PanderaValidator(config_path="config/prod/validators")
>>> # Each validator uses a separate schema directory

Reuse a validator for multiple validations:

>>> validator = PanderaValidator(config_path="config/validators")
>>> validated_df1 = validator.validate("dataset1", df1)
>>> validated_df2 = validator.validate("dataset2", df2)
>>> # Same validator instance, different datasets

config_path

parameters

@classmethod

def in_directory( cls, path: str | pathlib.Path, parameters: PanderaParameters | None = None) -> PanderaValidator: View Source

468    @classmethod
469    def in_directory(cls, path: str | Path, parameters: PanderaParameters | None = None) -> "PanderaValidator":
470        """
471        Create a PanderaValidator from a configuration directory (factory method).
472
473        This is the recommended factory method for creating PanderaValidator
474        instances. It provides a consistent interface with other toolkit components
475        (e.g., ``ValidatedDataCatalog.in_directory()``, ``KedroDataCatalog.in_directory()``)
476        and follows the factory pattern for object construction from configuration.
477
478        The method is functionally equivalent to calling the constructor directly,
479        but provides better semantic clarity in code that uses multiple toolkit
480        components with directory-based configuration.
481
482        Parameters
483        ----------
484        path : str or Path
485            Path to the root configuration directory where Pandera schema scripts
486            are stored. The validator will use a ``pandera_schemas`` subdirectory
487            under this path. Can be provided as either a string or pathlib.Path
488            object. The path can be absolute or relative to the current working
489            directory.
490
491            For example, if path is ``"config/validators"``, schema scripts will
492            be stored at ``config/validators/pandera_schemas/{category}/{dataset}.py``.
493        parameters : PanderaParameters or None, optional
494            Configuration parameters controlling validation behavior. If None
495            (default), uses ``PanderaParameters()`` with default settings
496            (``lazy=True`` for comprehensive error reporting). Provide a custom
497            ``PanderaParameters`` instance to configure validation strategy:
498
499            - ``PanderaParameters(lazy=True)``: Collect all errors (recommended)
500            - ``PanderaParameters(lazy=False)``: Fail-fast on first error
501
502        Returns
503        -------
504        PanderaValidator
505            A new validator instance configured to use schema scripts from the
506            specified directory. The returned validator is ready to use for
507            validation operations via the ``validate()`` method.
508
509        Raises
510        ------
511        TypeError
512            If path cannot be converted to a Path object (e.g., if an invalid
513            type is provided like int or dict).
514
515        See Also
516        --------
517        __init__ : Alternative constructor for creating validators.
518        validate : Perform validation on a dataset.
519        PanderaParameters : Configuration parameters for validation behavior.
520        adc_toolkit.data.ValidatedDataCatalog.in_directory : Similar factory method
521            for creating validated data catalogs.
522
523        Notes
524        -----
525        **Factory Pattern**
526
527        This method implements the factory pattern, providing a standard interface
528        for creating validators from directory-based configuration. The pattern is
529        used consistently across the toolkit:
530
531        .. code-block:: python
532
533            # Similar patterns across toolkit components
534            catalog = KedroDataCatalog.in_directory("config/")
535            validator = PanderaValidator.in_directory("config/validators")
536            gx_validator = GXValidator.in_directory("config/gx")
537
538        **Semantic Clarity**
539
540        Using ``in_directory()`` instead of the constructor makes code more
541        readable and self-documenting:
542
543        .. code-block:: python
544
545            # Clear intent: validator configured from this directory
546            validator = PanderaValidator.in_directory("config/validators")
547
548            # vs. less clear constructor call
549            validator = PanderaValidator("config/validators")
550
551        **Design Rationale**
552
553        The factory method pattern is preferred over direct constructor calls in
554        the toolkit because:
555
556        1. Provides consistent API across all components
557        2. Makes the configuration-from-directory pattern explicit
558        3. Allows future extension with additional factory methods
559        4. Improves code readability and maintainability
560
561        **Usage in ValidatedDataCatalog**
562
563        This method is commonly used when configuring ``ValidatedDataCatalog``
564        with a custom validator:
565
566        .. code-block:: python
567
568            from adc_toolkit.data import ValidatedDataCatalog
569            from adc_toolkit.data.validators.pandera import PanderaValidator
570
571            catalog = ValidatedDataCatalog.in_directory(
572                path="config", validator=PanderaValidator.in_directory("config/validators")
573            )
574
575        Examples
576        --------
577        Create a validator using the factory method (recommended):
578
579        >>> from adc_toolkit.data.validators.pandera import PanderaValidator
580        >>> validator = PanderaValidator.in_directory("config/validators")
581        >>> validator.config_path
582        PosixPath('config/validators/pandera_schemas')
583
584        Create a validator with custom parameters:
585
586        >>> from adc_toolkit.data.validators.pandera import PanderaParameters
587        >>> params = PanderaParameters(lazy=False)
588        >>> validator = PanderaValidator.in_directory(path="config/validators", parameters=params)
589        >>> validator.parameters.lazy
590        False
591
592        Using pathlib.Path:
593
594        >>> from pathlib import Path
595        >>> config_dir = Path("config") / "validators"
596        >>> validator = PanderaValidator.in_directory(path=config_dir)
597
598        Integration with ValidatedDataCatalog:
599
600        >>> from adc_toolkit.data import ValidatedDataCatalog
601        >>> catalog = ValidatedDataCatalog.in_directory(
602        ...     path="config", validator=PanderaValidator.in_directory("config/validators")
603        ... )
604        >>> # ValidatedDataCatalog uses the PanderaValidator for all load/save ops
605
606        Consistent API across different validators:
607
608        >>> from adc_toolkit.data.validators.pandera import PanderaValidator
609        >>> from adc_toolkit.data.validators.gx import GXValidator
610        >>> pandera_val = PanderaValidator.in_directory("config/pandera")
611        >>> gx_val = GXValidator.in_directory("config/gx")
612        >>> # Both use the same factory method pattern
613
614        Multiple validators for different environments:
615
616        >>> dev_validator = PanderaValidator.in_directory("config/dev/validators")
617        >>> staging_validator = PanderaValidator.in_directory("config/staging/validators")
618        >>> prod_validator = PanderaValidator.in_directory("config/prod/validators")
619        >>> # Each environment can have different validation rules
620
621        Dependency injection pattern:
622
623        >>> def create_pipeline(validator_path: str):
624        ...     validator = PanderaValidator.in_directory(validator_path)
625        ...     return DataPipeline(validator=validator)
626        >>> # Easy to swap validator configurations
627        >>> dev_pipeline = create_pipeline("config/dev/validators")
628        >>> prod_pipeline = create_pipeline("config/prod/validators")
629        """
630        return cls(path, parameters=parameters)

Create a PanderaValidator from a configuration directory (factory method).

This is the recommended factory method for creating PanderaValidator instances. It provides a consistent interface with other toolkit components (e.g., ValidatedDataCatalog.in_directory(), KedroDataCatalog.in_directory()) and follows the factory pattern for object construction from configuration.

The method is functionally equivalent to calling the constructor directly, but provides better semantic clarity in code that uses multiple toolkit components with directory-based configuration.

Parameters

path (str or Path): Path to the root configuration directory where Pandera schema scripts are stored. The validator will use a pandera_schemas subdirectory under this path. Can be provided as either a string or pathlib.Path object. The path can be absolute or relative to the current working directory.

For example, if path is "config/validators", schema scripts will be stored at config/validators/pandera_schemas/{category}/{dataset}.py.
parameters (PanderaParameters or None, optional): Configuration parameters controlling validation behavior. If None (default), uses PanderaParameters() with default settings (lazy=True for comprehensive error reporting). Provide a custom PanderaParameters instance to configure validation strategy:
- PanderaParameters(lazy=True): Collect all errors (recommended)
- PanderaParameters(lazy=False): Fail-fast on first error

Returns

PanderaValidator: A new validator instance configured to use schema scripts from the specified directory. The returned validator is ready to use for validation operations via the validate() method.

Raises

TypeError: If path cannot be converted to a Path object (e.g., if an invalid type is provided like int or dict).

Notes

Factory Pattern

This method implements the factory pattern, providing a standard interface for creating validators from directory-based configuration. The pattern is used consistently across the toolkit:

# Similar patterns across toolkit components
catalog = KedroDataCatalog.in_directory("config/")
validator = PanderaValidator.in_directory("config/validators")
gx_validator = GXValidator.in_directory("config/gx")

Semantic Clarity

Using in_directory() instead of the constructor makes code more readable and self-documenting:

# Clear intent: validator configured from this directory
validator = PanderaValidator.in_directory("config/validators")

# vs. less clear constructor call
validator = PanderaValidator("config/validators")

Design Rationale

The factory method pattern is preferred over direct constructor calls in the toolkit because:

Provides consistent API across all components
Makes the configuration-from-directory pattern explicit
Allows future extension with additional factory methods
Improves code readability and maintainability

Usage in ValidatedDataCatalog

This method is commonly used when configuring ValidatedDataCatalog with a custom validator:

from adc_toolkit.data import ValidatedDataCatalog
from adc_toolkit.data.validators.pandera import PanderaValidator

catalog = ValidatedDataCatalog.in_directory(
    path="config", validator=PanderaValidator.in_directory("config/validators")
)

Examples

Create a validator using the factory method (recommended):

>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> validator = PanderaValidator.in_directory("config/validators")
>>> validator.config_path
PosixPath('config/validators/pandera_schemas')

Create a validator with custom parameters:

>>> from adc_toolkit.data.validators.pandera import PanderaParameters
>>> params = PanderaParameters(lazy=False)
>>> validator = PanderaValidator.in_directory(path="config/validators", parameters=params)
>>> validator.parameters.lazy
False

Using pathlib.Path:

>>> from pathlib import Path
>>> config_dir = Path("config") / "validators"
>>> validator = PanderaValidator.in_directory(path=config_dir)

Integration with ValidatedDataCatalog:

>>> from adc_toolkit.data import ValidatedDataCatalog
>>> catalog = ValidatedDataCatalog.in_directory(
...     path="config", validator=PanderaValidator.in_directory("config/validators")
... )
>>> # ValidatedDataCatalog uses the PanderaValidator for all load/save ops

Consistent API across different validators:

>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> from adc_toolkit.data.validators.gx import GXValidator
>>> pandera_val = PanderaValidator.in_directory("config/pandera")
>>> gx_val = GXValidator.in_directory("config/gx")
>>> # Both use the same factory method pattern

Multiple validators for different environments:

>>> dev_validator = PanderaValidator.in_directory("config/dev/validators")
>>> staging_validator = PanderaValidator.in_directory("config/staging/validators")
>>> prod_validator = PanderaValidator.in_directory("config/prod/validators")
>>> # Each environment can have different validation rules

Dependency injection pattern:

>>> def create_pipeline(validator_path: str):
...     validator = PanderaValidator.in_directory(validator_path)
...     return DataPipeline(validator=validator)
>>> # Easy to swap validator configurations
>>> dev_pipeline = create_pipeline("config/dev/validators")
>>> prod_pipeline = create_pipeline("config/prod/validators")

def validate( self, name: str, data: adc_toolkit.data.abs.Data) -> adc_toolkit.data.abs.Data: View Source

632    def validate(self, name: str, data: Data) -> Data:
633        """
634        Validate a dataset against its Pandera schema.
635
636        This is the primary validation method that implements the DataValidator
637        protocol. It orchestrates the complete validation workflow, from automatic
638        schema generation (if needed) to validation execution, providing seamless
639        data quality checking with minimal setup.
640
641        The method delegates to the ``validate_data`` function, which implements
642        a two-phase validation approach:
643
644        **Phase 1: Schema Preparation** (First Use Only)
645            If no schema script exists for the dataset name, automatically generate
646            one by introspecting the data structure. The generated schema is saved
647            at ``{self.config_path}/{category}/{dataset}.py`` and serves as an
648            editable template for adding custom validation rules.
649
650        **Phase 2: Validation Execution** (Every Use)
651            Load the schema script as a Python module, extract the
652            ``DataFrameSchema`` object, and execute validation using Pandera's
653            ``schema.validate(data, lazy=self.parameters.lazy)`` method. Return
654            validated data if all checks pass, or raise ``PanderaValidationError``
655            with comprehensive error details if validation fails.
656
657        This design enables rapid prototyping (no upfront schema creation required)
658        while supporting iterative refinement (schemas can be customized after
659        auto-generation). Schema scripts are version-controlled Python files,
660        facilitating team collaboration and schema evolution tracking.
661
662        Parameters
663        ----------
664        name : str
665            The dataset name/identifier that determines which schema script to use.
666            Should follow the convention ``"category.dataset_name"`` (e.g.,
667            ``"raw.customers"``, ``"processed.sales"``). The name serves multiple
668            purposes:
669
670            - Determines schema file location: ``{config_path}/{category}/{dataset}.py``
671            - Provides context in validation error messages
672            - Enables logical organization of schemas by data pipeline stage
673
674            Names with a single dot separator create a two-level directory
675            structure. For example, ``"raw.customers"`` creates a schema at
676            ``{self.config_path}/raw/customers.py``.
677        data : Data
678            The data object to validate. Must be a protocol-compliant Data object
679            (pandas DataFrame, PySpark DataFrame, etc.) with ``columns`` and
680            ``dtypes`` attributes. The data structure is validated against the
681            schema defined in (or auto-generated for) the corresponding schema
682            script.
683
684            Supported types:
685            - ``pandas.DataFrame``: Primary use case with full feature support
686            - ``pyspark.sql.DataFrame``: Requires pyspark installation
687
688        Returns
689        -------
690        Data
691            The validated data object. If validation passes all checks defined in
692            the schema, returns the original data object (potentially with
693            Pandera-applied type coercions if configured in the schema). The return
694            type matches the input data type (pandas in, pandas out; PySpark in,
695            PySpark out).
696
697            The returned data can be used immediately in downstream processing with
698            confidence that it meets all defined quality requirements.
699
700        Raises
701        ------
702        PanderaValidationError
703            Raised when data validation fails against the schema. This custom
704            exception wraps Pandera's underlying SchemaError or SchemaErrors and
705            enriches it with additional context:
706
707            Attributes of PanderaValidationError:
708            - ``table_name``: The dataset name that failed validation
709            - ``schema_path``: Full filesystem path to the schema script file
710            - ``original_error``: The underlying Pandera error with detailed
711              validation failure information (row indices, column names, observed
712              values, expected constraints)
713
714            With ``lazy=True`` (default), the exception includes all validation
715            errors across the entire dataset. With ``lazy=False``, it includes only
716            the first error encountered.
717        ValueError
718            Raised if the dataframe type is not supported (neither pandas nor
719            pyspark), originating from the schema compiler during auto-generation.
720        ModuleNotFoundError
721            Raised if the schema script cannot be imported, typically indicating
722            a Python syntax error in a manually edited schema file. Check the
723            schema file for syntax errors or import statement issues.
724        AttributeError
725            Raised if the schema script module does not define a ``schema``
726            attribute, indicating the schema file structure is invalid. The schema
727            file must contain ``schema = pa.DataFrameSchema(...)`` at module level.
728        OSError
729            Raised if there are file system permissions issues preventing schema
730            file creation (during auto-generation) or reading (during validation).
731
732        See Also
733        --------
734        validate_data : The underlying validation function called by this method.
735        PanderaParameters : Configuration parameters controlling validation behavior.
736        PanderaValidationError : Custom exception raised on validation failure.
737        in_directory : Factory method for creating validator instances.
738        adc_toolkit.data.ValidatedDataCatalog : Data catalog with integrated validation.
739
740        Notes
741        -----
742        **Validation Workflow**
743
744        The complete sequence executed by this method:
745
746        1. Check if schema file exists at ``{self.config_path}/{category}/{dataset}.py``
747        2. If missing, auto-generate schema by introspecting data structure
748        3. Load schema file as a Python module using dynamic import
749        4. Extract the ``schema`` DataFrameSchema object from the module
750        5. Call ``schema.validate(data, lazy=self.parameters.lazy)``
751        6. Return validated data if all checks pass
752        7. Raise PanderaValidationError with context if validation fails
753
754        **Schema Customization After Auto-Generation**
755
756        After first validation, edit the generated schema file to add domain-specific
757        validation rules:
758
759        .. code-block:: python
760
761            # {self.config_path}/raw/customers.py (auto-generated)
762            import pandera.pandas as pa
763
764            # Insert your additional checks to `checks` list parameter
765            schema = pa.DataFrameSchema(
766                {
767                    "customer_id": pa.Column(
768                        "int64",
769                        checks=[
770                            pa.Check.greater_than(0),  # Add: IDs must be positive
771                            pa.Check(lambda s: s.is_unique, element_wise=False),  # Add: unique
772                        ],
773                    ),
774                    "email": pa.Column(
775                        "object",
776                        checks=[
777                            pa.Check(lambda s: s.str.contains("@")),  # Add: valid email
778                        ],
779                    ),
780                    "age": pa.Column(
781                        "int64",
782                        checks=[
783                            pa.Check.in_range(0, 120),  # Add: realistic age range
784                        ],
785                    ),
786                }
787            )
788
789        **Error Reporting: Lazy vs. Fail-Fast**
790
791        The ``parameters.lazy`` setting significantly affects error reporting:
792
793        **Lazy Mode (lazy=True, default)**: Recommended for production
794            - Collects all validation errors across entire dataset
795            - Provides comprehensive error report in single validation run
796            - Higher overhead but better developer experience
797            - Example: "Found 47 validation errors in columns 'age', 'email'"
798
799        **Fail-Fast Mode (lazy=False)**: Useful for debugging
800            - Raises exception on first validation failure
801            - Lower overhead, faster failure detection
802            - Requires multiple validation runs to find all errors
803            - Example: "Row 23: age value 150 exceeds maximum 120"
804
805        **Performance Considerations**
806
807        - Schema scripts are dynamically imported on each validation call. For
808          high-frequency validation scenarios (e.g., streaming data), consider
809          caching the validator instance and reusing it across validations.
810        - Auto-generation only occurs once per dataset. After initial schema
811          creation, there's no performance penalty for the auto-generation feature.
812        - Large dataset validation can be expensive. Consider sampling strategies
813          for very large datasets if full validation is not required.
814
815        **Thread Safety**
816
817        This method is thread-safe for validation operations on existing schemas
818        (multiple threads can call ``validate()`` concurrently on different datasets).
819        However, the initial schema auto-generation is not thread-safe. If multiple
820        threads validate the same dataset for the first time concurrently, race
821        conditions may occur. For concurrent first-time validation, implement
822        external locking or pre-generate schemas.
823
824        **Integration with Data Pipelines**
825
826        This method integrates seamlessly with data pipeline workflows:
827
828        .. code-block:: python
829
830            def pipeline_stage(validator, input_data):
831                # Validate input from previous stage
832                validated_input = validator.validate("stage_input", input_data)
833
834                # Process with confidence that data meets requirements
835                processed = transform(validated_input)
836
837                # Validate output before passing to next stage
838                validated_output = validator.validate("stage_output", processed)
839
840                return validated_output
841
842        Examples
843        --------
844        Basic validation with auto-generated schema:
845
846        >>> import pandas as pd
847        >>> from adc_toolkit.data.validators.pandera import PanderaValidator
848        >>> validator = PanderaValidator.in_directory("config/validators")
849        >>> df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]})
850        >>> validated = validator.validate("raw.customers", df)
851        >>> # First run: auto-generates schema at config/validators/pandera_schemas/raw/customers.py
852        >>> # Subsequent runs: uses existing schema
853        >>> print(validated)
854           id     name  age
855        0   1    Alice   25
856        1   2      Bob   30
857        2   3  Charlie   35
858
859        Validation failure with comprehensive error reporting (lazy=True):
860
861        >>> df_invalid = pd.DataFrame(
862        ...     {
863        ...         "id": [1, -2, 3],  # Invalid: negative ID (if custom check added)
864        ...         "name": ["Alice", "Bob", None],  # Invalid: null name
865        ...         "age": [25, 30, 150],  # Invalid: unrealistic age (if custom check added)
866        ...     }
867        ... )
868        >>> from adc_toolkit.data.validators.pandera.exceptions import PanderaValidationError
869        >>> try:
870        ...     validator.validate("raw.customers", df_invalid)
871        ... except PanderaValidationError as e:
872        ...     print(f"Validation failed for table: {e.table_name}")
873        ...     print(f"Schema location: {e.schema_path}")
874        ...     print(f"Error details: {e.original_error}")
875        ...     # All validation errors are reported together (lazy=True)
876
877        Fail-fast validation for debugging:
878
879        >>> from adc_toolkit.data.validators.pandera import PanderaParameters
880        >>> validator_debug = PanderaValidator.in_directory(
881        ...     path="config/validators", parameters=PanderaParameters(lazy=False)
882        ... )
883        >>> try:
884        ...     validator_debug.validate("raw.customers", df_invalid)
885        ... except Exception as e:
886        ...     print(f"First error encountered: {e.original_error}")
887        ...     # Only the first validation error is reported (lazy=False)
888
889        Validate multiple datasets with same validator:
890
891        >>> validator = PanderaValidator.in_directory("config/validators")
892        >>> customers = validator.validate("raw.customers", customers_df)
893        >>> orders = validator.validate("raw.orders", orders_df)
894        >>> products = validator.validate("raw.products", products_df)
895        >>> # Reuse same validator instance for efficiency
896
897        Validation in a data processing pipeline:
898
899        >>> def process_customer_data():
900        ...     validator = PanderaValidator.in_directory("config/validators")
901        ...
902        ...     # Load raw data
903        ...     raw_df = load_raw_customers()
904        ...
905        ...     # Validate input
906        ...     validated_input = validator.validate("raw.customers", raw_df)
907        ...
908        ...     # Process with confidence
909        ...     processed_df = transform_customers(validated_input)
910        ...
911        ...     # Validate output
912        ...     validated_output = validator.validate("processed.customers", processed_df)
913        ...
914        ...     return validated_output
915
916        Using with PySpark DataFrame:
917
918        >>> from pyspark.sql import SparkSession
919        >>> spark = SparkSession.builder.getOrCreate()
920        >>> spark_df = spark.createDataFrame([(1, "Alice", 25), (2, "Bob", 30)], ["id", "name", "age"])
921        >>> validator = PanderaValidator.in_directory("config/validators")
922        >>> validated_spark = validator.validate("raw.spark_customers", spark_df)
923        >>> # Returns validated PySpark DataFrame
924
925        Integration with ValidatedDataCatalog (automatic validation):
926
927        >>> from adc_toolkit.data import ValidatedDataCatalog
928        >>> catalog = ValidatedDataCatalog.in_directory(
929        ...     path="config", validator=PanderaValidator.in_directory("config/validators")
930        ... )
931        >>> # ValidatedDataCatalog internally calls validator.validate()
932        >>> df = catalog.load("raw.customers")  # Validates after loading
933        >>> catalog.save("processed.customers", df)  # Validates before saving
934        """
935        return validate_data(name, data, self.config_path, self.parameters)

Validate a dataset against its Pandera schema.

This is the primary validation method that implements the DataValidator protocol. It orchestrates the complete validation workflow, from automatic schema generation (if needed) to validation execution, providing seamless data quality checking with minimal setup.

The method delegates to the validate_data function, which implements a two-phase validation approach:

Phase 1: Schema Preparation (First Use Only) If no schema script exists for the dataset name, automatically generate one by introspecting the data structure. The generated schema is saved at {self.config_path}/{category}/{dataset}.py and serves as an editable template for adding custom validation rules.

Phase 2: Validation Execution (Every Use) Load the schema script as a Python module, extract the DataFrameSchema object, and execute validation using Pandera's schema.validate(data, lazy=self.parameters.lazy) method. Return validated data if all checks pass, or raise PanderaValidationError with comprehensive error details if validation fails.

This design enables rapid prototyping (no upfront schema creation required) while supporting iterative refinement (schemas can be customized after auto-generation). Schema scripts are version-controlled Python files, facilitating team collaboration and schema evolution tracking.

Parameters

name (str): The dataset name/identifier that determines which schema script to use. Should follow the convention "category.dataset_name" (e.g., "raw.customers", "processed.sales"). The name serves multiple purposes:
- Determines schema file location: {config_path}/{category}/{dataset}.py
- Provides context in validation error messages
- Enables logical organization of schemas by data pipeline stage
Names with a single dot separator create a two-level directory structure. For example, "raw.customers" creates a schema at {self.config_path}/raw/customers.py.
data (Data): The data object to validate. Must be a protocol-compliant Data object (pandas DataFrame, PySpark DataFrame, etc.) with columns and dtypes attributes. The data structure is validated against the schema defined in (or auto-generated for) the corresponding schema script.

Supported types:
- pandas.DataFrame: Primary use case with full feature support
- pyspark.sql.DataFrame: Requires pyspark installation

Returns

Data: The validated data object. If validation passes all checks defined in the schema, returns the original data object (potentially with Pandera-applied type coercions if configured in the schema). The return type matches the input data type (pandas in, pandas out; PySpark in, PySpark out).

The returned data can be used immediately in downstream processing with confidence that it meets all defined quality requirements.

Raises

PanderaValidationError: Raised when data validation fails against the schema. This custom exception wraps Pandera's underlying SchemaError or SchemaErrors and enriches it with additional context:

Attributes of PanderaValidationError:

table_name: The dataset name that failed validation
schema_path: Full filesystem path to the schema script file
original_error: The underlying Pandera error with detailed validation failure information (row indices, column names, observed values, expected constraints)

With lazy=True (default), the exception includes all validation errors across the entire dataset. With lazy=False, it includes only the first error encountered.

ValueError: Raised if the dataframe type is not supported (neither pandas nor pyspark), originating from the schema compiler during auto-generation.
ModuleNotFoundError: Raised if the schema script cannot be imported, typically indicating a Python syntax error in a manually edited schema file. Check the schema file for syntax errors or import statement issues.
AttributeError: Raised if the schema script module does not define a schema attribute, indicating the schema file structure is invalid. The schema file must contain schema = pa.DataFrameSchema(...) at module level.
OSError: Raised if there are file system permissions issues preventing schema file creation (during auto-generation) or reading (during validation).

Notes

Validation Workflow

The complete sequence executed by this method:

Check if schema file exists at {self.config_path}/{category}/{dataset}.py
If missing, auto-generate schema by introspecting data structure
Load schema file as a Python module using dynamic import
Extract the schema DataFrameSchema object from the module
Call schema.validate(data, lazy=self.parameters.lazy)
Return validated data if all checks pass
Raise PanderaValidationError with context if validation fails

Schema Customization After Auto-Generation

After first validation, edit the generated schema file to add domain-specific validation rules:

# {self.config_path}/raw/customers.py (auto-generated)
import pandera.pandas as pa

# Insert your additional checks to `checks` list parameter
schema = pa.DataFrameSchema(
    {
        "customer_id": pa.Column(
            "int64",
            checks=[
                pa.Check.greater_than(0),  # Add: IDs must be positive
                pa.Check(lambda s: s.is_unique, element_wise=False),  # Add: unique
            ],
        ),
        "email": pa.Column(
            "object",
            checks=[
                pa.Check(lambda s: s.str.contains("@")),  # Add: valid email
            ],
        ),
        "age": pa.Column(
            "int64",
            checks=[
                pa.Check.in_range(0, 120),  # Add: realistic age range
            ],
        ),
    }
)

Error Reporting: Lazy vs. Fail-Fast

The parameters.lazy setting significantly affects error reporting:

Lazy Mode (lazy=True, default): Recommended for production - Collects all validation errors across entire dataset - Provides comprehensive error report in single validation run - Higher overhead but better developer experience - Example: "Found 47 validation errors in columns 'age', 'email'"

Fail-Fast Mode (lazy=False): Useful for debugging - Raises exception on first validation failure - Lower overhead, faster failure detection - Requires multiple validation runs to find all errors - Example: "Row 23: age value 150 exceeds maximum 120"

Performance Considerations

Schema scripts are dynamically imported on each validation call. For high-frequency validation scenarios (e.g., streaming data), consider caching the validator instance and reusing it across validations.
Auto-generation only occurs once per dataset. After initial schema creation, there's no performance penalty for the auto-generation feature.
Large dataset validation can be expensive. Consider sampling strategies for very large datasets if full validation is not required.

Thread Safety

This method is thread-safe for validation operations on existing schemas (multiple threads can call validate() concurrently on different datasets). However, the initial schema auto-generation is not thread-safe. If multiple threads validate the same dataset for the first time concurrently, race conditions may occur. For concurrent first-time validation, implement external locking or pre-generate schemas.

Integration with Data Pipelines

This method integrates seamlessly with data pipeline workflows:

def pipeline_stage(validator, input_data):
    # Validate input from previous stage
    validated_input = validator.validate("stage_input", input_data)

    # Process with confidence that data meets requirements
    processed = transform(validated_input)

    # Validate output before passing to next stage
    validated_output = validator.validate("stage_output", processed)

    return validated_output

Examples

Basic validation with auto-generated schema:

>>> import pandas as pd
>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> validator = PanderaValidator.in_directory("config/validators")
>>> df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]})
>>> validated = validator.validate("raw.customers", df)
>>> # First run: auto-generates schema at config/validators/pandera_schemas/raw/customers.py
>>> # Subsequent runs: uses existing schema
>>> print(validated)
   id     name  age
0   1    Alice   25
1   2      Bob   30
2   3  Charlie   35

Validation failure with comprehensive error reporting (lazy=True):

>>> df_invalid = pd.DataFrame(
...     {
...         "id": [1, -2, 3],  # Invalid: negative ID (if custom check added)
...         "name": ["Alice", "Bob", None],  # Invalid: null name
...         "age": [25, 30, 150],  # Invalid: unrealistic age (if custom check added)
...     }
... )
>>> from adc_toolkit.data.validators.pandera.exceptions import PanderaValidationError
>>> try:
...     validator.validate("raw.customers", df_invalid)
... except PanderaValidationError as e:
...     print(f"Validation failed for table: {e.table_name}")
...     print(f"Schema location: {e.schema_path}")
...     print(f"Error details: {e.original_error}")
...     # All validation errors are reported together (lazy=True)

Fail-fast validation for debugging:

>>> from adc_toolkit.data.validators.pandera import PanderaParameters
>>> validator_debug = PanderaValidator.in_directory(
...     path="config/validators", parameters=PanderaParameters(lazy=False)
... )
>>> try:
...     validator_debug.validate("raw.customers", df_invalid)
... except Exception as e:
...     print(f"First error encountered: {e.original_error}")
...     # Only the first validation error is reported (lazy=False)

Validate multiple datasets with same validator:

>>> validator = PanderaValidator.in_directory("config/validators")
>>> customers = validator.validate("raw.customers", customers_df)
>>> orders = validator.validate("raw.orders", orders_df)
>>> products = validator.validate("raw.products", products_df)
>>> # Reuse same validator instance for efficiency

Validation in a data processing pipeline:

>>> def process_customer_data():
...     validator = PanderaValidator.in_directory("config/validators")
...
...     # Load raw data
...     raw_df = load_raw_customers()
...
...     # Validate input
...     validated_input = validator.validate("raw.customers", raw_df)
...
...     # Process with confidence
...     processed_df = transform_customers(validated_input)
...
...     # Validate output
...     validated_output = validator.validate("processed.customers", processed_df)
...
...     return validated_output

Using with PySpark DataFrame:

>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> spark_df = spark.createDataFrame([(1, "Alice", 25), (2, "Bob", 30)], ["id", "name", "age"])
>>> validator = PanderaValidator.in_directory("config/validators")
>>> validated_spark = validator.validate("raw.spark_customers", spark_df)
>>> # Returns validated PySpark DataFrame

Integration with ValidatedDataCatalog (automatic validation):

>>> from adc_toolkit.data import ValidatedDataCatalog
>>> catalog = ValidatedDataCatalog.in_directory(
...     path="config", validator=PanderaValidator.in_directory("config/validators")
... )
>>> # ValidatedDataCatalog internally calls validator.validate()
>>> df = catalog.load("raw.customers")  # Validates after loading
>>> catalog.save("processed.customers", df)  # Validates before saving