adc_toolkit.data
Data handling module for the adc-toolkit.
This module provides the core data management infrastructure for adc-toolkit projects, combining configuration-driven data catalogs with automatic data validation to ensure data quality throughout ML and analytics pipelines.
The module's centerpiece is the ValidatedDataCatalog class, which transparently validates all data loading and saving operations. By enforcing data quality at catalog boundaries, it catches schema drift, data corruption, and constraint violations early in the pipeline, preventing invalid data from propagating to downstream systems.
The architecture uses protocol-based dependency injection, allowing flexible combinations of catalog implementations (for I/O) and validator implementations (for quality checks). Default implementations provide production-ready functionality with Kedro-based catalogs and Great Expectations or Pandera validators.
Classes
ValidatedDataCatalog
Main user-facing API combining a data catalog with automatic validation.
Factory method: ValidatedDataCatalog.in_directory(path).
Protocols
Data Protocol for data objects (requires columns and dtypes properties). DataCatalog Protocol for catalog implementations (load, save, in_directory methods). DataValidator Protocol for validator implementations (validate, in_directory methods).
Submodules
catalogs Data catalog implementations, including KedroDataCatalog. validators Data validator implementations: GXValidator, PanderaValidator, NoValidator. abs Protocol definitions for Data, DataCatalog, and DataValidator. catalog ValidatedDataCatalog implementation. default_attributes Factory functions for default catalog and validator instances.
See Also
adc_toolkit.data.catalog.ValidatedDataCatalog: Primary data catalog API.
adc_toolkit.data.catalogs.kedro.KedroDataCatalog: Kedro catalog implementation.
adc_toolkit.data.validators.gx.GXValidator: Great Expectations validator.
adc_toolkit.data.validators.pandera.PanderaValidator: Pandera validator.
Notes
The module follows these design patterns:
- Factory Pattern: Use
in_directory(path)class methods to instantiate catalogs and validators from configuration directories. - Strategy Pattern: Swap catalog and validator implementations without changing downstream code.
- Protocol-based Design: Type safety through structural subtyping rather than inheritance.
- Dependency Injection: Pass catalog and validator as constructor arguments for testability and flexibility.
Configuration-driven approach enables:
- Declarative dataset definitions (where data lives, how to load/save it)
- Environment-specific configurations (dev, staging, production)
- Separation of data access logic from business logic
- Reproducible data pipelines with version-controlled configurations
Data validation workflow:
- On load:
catalog.load()->validator.validate()-> return validated data - On save:
validator.validate()->catalog.save()-> no invalid data persisted
This ensures data quality is enforced at every catalog boundary.
Examples
Basic usage with default catalog and validator:
>>> from adc_toolkit.data import ValidatedDataCatalog
>>> catalog = ValidatedDataCatalog.in_directory("config/data")
>>> df = catalog.load("customer_data") # Automatically validated
>>> processed = transform(df)
>>> catalog.save("processed_data", processed) # Validated before saving
Using custom validator:
>>> from adc_toolkit.data import ValidatedDataCatalog
>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> catalog = ValidatedDataCatalog.in_directory("config/data", validator_class=PanderaValidator)
Working directly with protocols for type annotations:
>>> from adc_toolkit.data.abs import DataCatalog, DataValidator, Data
>>> def pipeline(catalog: DataCatalog, validator: DataValidator) -> Data:
... raw = catalog.load("raw_data")
... validated = validator.validate("raw_data", raw)
... return validated
Complete pipeline example:
>>> from adc_toolkit.data import ValidatedDataCatalog
>>>
>>> # Initialize validated catalog
>>> catalog = ValidatedDataCatalog.in_directory("config/production")
>>>
>>> # Load and process with automatic validation
>>> raw_sales = catalog.load("raw_sales")
>>> cleaned_sales = raw_sales.dropna()
>>> catalog.save("cleaned_sales", cleaned_sales)
>>>
>>> aggregated = cleaned_sales.groupby("region").sum()
>>> catalog.save("sales_summary", aggregated)
>>> # All loads validated on read, all saves validated on write
1""" 2Data handling module for the adc-toolkit. 3 4This module provides the core data management infrastructure for adc-toolkit 5projects, combining configuration-driven data catalogs with automatic data 6validation to ensure data quality throughout ML and analytics pipelines. 7 8The module's centerpiece is the ValidatedDataCatalog class, which transparently 9validates all data loading and saving operations. By enforcing data quality at 10catalog boundaries, it catches schema drift, data corruption, and constraint 11violations early in the pipeline, preventing invalid data from propagating to 12downstream systems. 13 14The architecture uses protocol-based dependency injection, allowing flexible 15combinations of catalog implementations (for I/O) and validator implementations 16(for quality checks). Default implementations provide production-ready 17functionality with Kedro-based catalogs and Great Expectations or Pandera 18validators. 19 20Classes 21------- 22ValidatedDataCatalog 23 Main user-facing API combining a data catalog with automatic validation. 24 Factory method: `ValidatedDataCatalog.in_directory(path)`. 25 26Protocols 27--------- 28Data 29 Protocol for data objects (requires columns and dtypes properties). 30DataCatalog 31 Protocol for catalog implementations (load, save, in_directory methods). 32DataValidator 33 Protocol for validator implementations (validate, in_directory methods). 34 35Submodules 36---------- 37catalogs 38 Data catalog implementations, including KedroDataCatalog. 39validators 40 Data validator implementations: GXValidator, PanderaValidator, NoValidator. 41abs 42 Protocol definitions for Data, DataCatalog, and DataValidator. 43catalog 44 ValidatedDataCatalog implementation. 45default_attributes 46 Factory functions for default catalog and validator instances. 47 48See Also 49-------- 50adc_toolkit.data.catalog.ValidatedDataCatalog : Primary data catalog API. 51adc_toolkit.data.catalogs.kedro.KedroDataCatalog : Kedro catalog implementation. 52adc_toolkit.data.validators.gx.GXValidator : Great Expectations validator. 53adc_toolkit.data.validators.pandera.PanderaValidator : Pandera validator. 54 55Notes 56----- 57The module follows these design patterns: 58 59- **Factory Pattern**: Use `in_directory(path)` class methods to instantiate 60 catalogs and validators from configuration directories. 61- **Strategy Pattern**: Swap catalog and validator implementations without 62 changing downstream code. 63- **Protocol-based Design**: Type safety through structural subtyping rather 64 than inheritance. 65- **Dependency Injection**: Pass catalog and validator as constructor arguments 66 for testability and flexibility. 67 68Configuration-driven approach enables: 69 70- Declarative dataset definitions (where data lives, how to load/save it) 71- Environment-specific configurations (dev, staging, production) 72- Separation of data access logic from business logic 73- Reproducible data pipelines with version-controlled configurations 74 75Data validation workflow: 76 77- On load: `catalog.load()` -> `validator.validate()` -> return validated data 78- On save: `validator.validate()` -> `catalog.save()` -> no invalid data persisted 79 80This ensures data quality is enforced at every catalog boundary. 81 82Examples 83-------- 84Basic usage with default catalog and validator: 85 86>>> from adc_toolkit.data import ValidatedDataCatalog 87>>> catalog = ValidatedDataCatalog.in_directory("config/data") 88>>> df = catalog.load("customer_data") # Automatically validated 89>>> processed = transform(df) 90>>> catalog.save("processed_data", processed) # Validated before saving 91 92Using custom validator: 93 94>>> from adc_toolkit.data import ValidatedDataCatalog 95>>> from adc_toolkit.data.validators.pandera import PanderaValidator 96>>> catalog = ValidatedDataCatalog.in_directory("config/data", validator_class=PanderaValidator) 97 98Working directly with protocols for type annotations: 99 100>>> from adc_toolkit.data.abs import DataCatalog, DataValidator, Data 101>>> def pipeline(catalog: DataCatalog, validator: DataValidator) -> Data: 102... raw = catalog.load("raw_data") 103... validated = validator.validate("raw_data", raw) 104... return validated 105 106Complete pipeline example: 107 108>>> from adc_toolkit.data import ValidatedDataCatalog 109>>> 110>>> # Initialize validated catalog 111>>> catalog = ValidatedDataCatalog.in_directory("config/production") 112>>> 113>>> # Load and process with automatic validation 114>>> raw_sales = catalog.load("raw_sales") 115>>> cleaned_sales = raw_sales.dropna() 116>>> catalog.save("cleaned_sales", cleaned_sales) 117>>> 118>>> aggregated = cleaned_sales.groupby("region").sum() 119>>> catalog.save("sales_summary", aggregated) 120>>> # All loads validated on read, all saves validated on write 121"""