adc_toolkit
ADC Toolkit: A Python Framework for Validated Data Pipelines
The adc-toolkit provides a structured approach to data handling in data science and machine learning projects. It combines configuration-driven data catalogs with automatic schema validation, ensuring data quality throughout your pipeline.
The toolkit's core value proposition is seamless validated data I/O: load data with automatic validation, save data with automatic validation, and detect schema drift without writing manual validation code.
Key Features
- ValidatedDataCatalog: Main abstraction combining Kedro DataCatalog (I/O)
with automatic data validation.
Factory method:
ValidatedDataCatalog.in_directory(path). - Dual Validator Support: Choose between Great Expectations (GX) for powerful expectation suites or Pandera for lightweight Python-based schema validation.
- Auto Schema Detection: Automatically generates and enforces schemas on first data load. Schema is "frozen" to prevent silent data drift in subsequent operations.
- Processing Pipeline: Chainable, reusable data transformation steps with a fluent API.
- Flexible Logging: Unified logging interface with support for Python logging and Loguru backends.
- Hydra Configuration: YAML-based hierarchical configuration management for reproducible pipelines.
- Cloud Support: AWS, GCP, and Azure data context support for Great Expectations.
- CLI Tools: Scaffold catalog structure with a single command:
adc-toolkit init-catalog.
Quick Start
Installation
# Basic installation
pip install adc-toolkit
# With Kedro and Great Expectations (recommended)
pip install adc-toolkit[kedro,gx]
# With Kedro and Pandera
pip install adc-toolkit[kedro,pandera]
Initialize Catalog Structure
adc-toolkit init-catalog ./config
Load and Save Data with Validation
from adc_toolkit.data.catalog import ValidatedDataCatalog
catalog = ValidatedDataCatalog.in_directory("./config")
df = catalog.load("my_dataset") # Validated after loading
catalog.save("processed_data", df) # Validated before saving
Modules
- data: Core data module providing the ValidatedDataCatalog for validated data pipelines. Includes Kedro catalog integration and validators (Great Expectations, Pandera, NoValidator).
- logger: Flexible logging infrastructure with a unified interface.
Supports both standard Python logging and Loguru backends.
Main export:
Logger. - processing: Data processing utilities with chainable transformation
pipelines. Main classes:
ProcessingPipeline,PipelineStep. - eda: Exploratory data analysis utilities for time series and cross-sectional data. Uses Hydra for configuration-driven EDA. Note: This module is partially implemented and shows a FutureWarning on import.
- cli: Command-line interface for toolkit operations. Provides
init-catalogcommand for scaffolding configuration directories. - configuration: Hydra-based configuration management with base templates and local overrides. Configuration files are in YAML format.
- utils: Shared utility functions including custom exceptions, configuration loaders, module management, and filesystem operations.
Examples
Basic Usage
from adc_toolkit.data.catalog import ValidatedDataCatalog
catalog = ValidatedDataCatalog.in_directory("./config")
df = catalog.load("customer_data") # Automatically validated
processed = df.dropna()
catalog.save("clean_customers", processed) # Validated before saving
Custom Validators
Using Pandera validator:
from adc_toolkit.data.catalog import ValidatedDataCatalog
from adc_toolkit.data.validators.pandera import PanderaValidator
catalog = ValidatedDataCatalog.in_directory("./config", validator_class=PanderaValidator)
Using Great Expectations validator:
from adc_toolkit.data.catalog import ValidatedDataCatalog
from adc_toolkit.data.validators.gx import GXValidator
catalog = ValidatedDataCatalog.in_directory("./config", validator_class=GXValidator)
Processing Pipeline
from adc_toolkit.processing.pipeline import ProcessingPipeline
from adc_toolkit.processing.steps.pandas import (
remove_duplicates,
fill_missing_values,
make_columns_snake_case,
)
pipeline = ProcessingPipeline()
pipeline = (
pipeline.add(remove_duplicates, subset=["id"])
.add(fill_missing_values, method="median")
.add(make_columns_snake_case)
)
processed_df = pipeline.run(df)
Logging
from adc_toolkit.logger import Logger
logger = Logger()
logger.info("Processing started")
Logger.set_level("debug") # Set global log level
CLI Usage
# Initialize catalog structure
adc-toolkit init-catalog ./config
# Overwrite existing files
adc-toolkit init-catalog ./config --overwrite
# Skip credentials file
adc-toolkit init-catalog ./config --no-credentials
Complete Pipeline
from adc_toolkit.data.catalog import ValidatedDataCatalog
from adc_toolkit.processing.pipeline import ProcessingPipeline
from adc_toolkit.processing.steps.pandas import remove_duplicates
from adc_toolkit.logger import Logger
logger = Logger()
logger.info("Starting data pipeline")
# Load with validation
catalog = ValidatedDataCatalog.in_directory("./config")
raw_data = catalog.load("sales_raw")
# Process
pipeline = ProcessingPipeline().add(remove_duplicates)
clean_data = pipeline.run(raw_data)
# Save with validation
catalog.save("sales_clean", clean_data)
logger.info("Pipeline complete")
See Also
adc_toolkit.data.catalog.ValidatedDataCatalog: Main data catalog API.adc_toolkit.data.catalogs.kedro.KedroDataCatalog: Kedro catalog implementation.adc_toolkit.data.validators.gx.GXValidator: Great Expectations validator.adc_toolkit.data.validators.pandera.PanderaValidator: Pandera validator.adc_toolkit.logger.Logger: Logging interface.adc_toolkit.processing.pipeline.ProcessingPipeline: Data transformation pipelines.adc_toolkit.cli.main: CLI entry point.
Notes
Optional Dependencies
The toolkit uses optional dependency groups to keep the base installation lightweight. Install only what you need:
| Group | Description |
|---|---|
kedro |
Kedro DataCatalog for data I/O (required for ValidatedDataCatalog) |
gx |
Great Expectations validation |
pandera |
Pandera validation |
eda |
Exploratory data analysis tools |
spark |
PySpark support |
gcp |
Google Cloud Platform integration |
logging |
Loguru logging backend |
preprocessing |
scikit-learn transformations |
Install with: pip install adc-toolkit[kedro,gx] or similar.
Design Patterns
- Factory Pattern: Use
in_directory(path)class methods to instantiate catalogs and validators from configuration directories. - Strategy Pattern: Swap catalog and validator implementations without changing downstream code.
- Protocol-based Design: Type safety through structural subtyping (PEP 544).
- Dependency Injection: Pass catalog and validator as constructor arguments.
Configuration Structure
The toolkit expects this configuration directory structure:
config/
├── base/
│ ├── globals.yml # Global variables (bucket paths, dataset types)
│ └── catalog.yml # Kedro dataset definitions
└── local/
└── credentials.yml # Secrets and credentials (gitignored)
Limitations
- The EDA module is partially implemented and shows a FutureWarning on import.
- Great Expectations is constrained to version <1.0.0 due to API changes in GX 1.0.
- The
kedrooptional dependency is required for ValidatedDataCatalog to function.
Version Information
- Current version: 1.1.0
- Python support: 3.10, 3.11, 3.12, 3.13
1__version__ = "1.1.0" 2 3__doc__ = f"""# ADC Toolkit: A Python Framework for Validated Data Pipelines 4 5The adc-toolkit provides a structured approach to data handling in data science 6and machine learning projects. It combines configuration-driven data catalogs 7with automatic schema validation, ensuring data quality throughout your pipeline. 8 9The toolkit's core value proposition is seamless validated data I/O: load data 10with automatic validation, save data with automatic validation, and detect 11schema drift without writing manual validation code. 12 13--- 14 15## Key Features 16 17- **ValidatedDataCatalog**: Main abstraction combining Kedro DataCatalog (I/O) 18 with automatic data validation. 19 Factory method: `ValidatedDataCatalog.in_directory(path)`. 20- **Dual Validator Support**: Choose between Great Expectations (GX) for 21 powerful expectation suites or Pandera for lightweight Python-based schema 22 validation. 23- **Auto Schema Detection**: Automatically generates and enforces schemas on 24 first data load. Schema is "frozen" to prevent silent data drift in 25 subsequent operations. 26- **Processing Pipeline**: Chainable, reusable data transformation steps with 27 a fluent API. 28- **Flexible Logging**: Unified logging interface with support for Python 29 logging and Loguru backends. 30- **Hydra Configuration**: YAML-based hierarchical configuration management 31 for reproducible pipelines. 32- **Cloud Support**: AWS, GCP, and Azure data context support for Great 33 Expectations. 34- **CLI Tools**: Scaffold catalog structure with a single command: 35 `adc-toolkit init-catalog`. 36 37--- 38 39## Quick Start 40 41### Installation 42 43```bash 44# Basic installation 45pip install adc-toolkit 46 47# With Kedro and Great Expectations (recommended) 48pip install adc-toolkit[kedro,gx] 49 50# With Kedro and Pandera 51pip install adc-toolkit[kedro,pandera] 52``` 53 54### Initialize Catalog Structure 55 56```bash 57adc-toolkit init-catalog ./config 58``` 59 60### Load and Save Data with Validation 61 62```python 63from adc_toolkit.data.catalog import ValidatedDataCatalog 64 65catalog = ValidatedDataCatalog.in_directory("./config") 66df = catalog.load("my_dataset") # Validated after loading 67catalog.save("processed_data", df) # Validated before saving 68``` 69 70--- 71 72## Modules 73 74- **data**: Core data module providing the ValidatedDataCatalog for validated 75 data pipelines. Includes Kedro catalog integration and validators (Great 76 Expectations, Pandera, NoValidator). 77- **logger**: Flexible logging infrastructure with a unified interface. 78 Supports both standard Python logging and Loguru backends. 79 Main export: `Logger`. 80- **processing**: Data processing utilities with chainable transformation 81 pipelines. Main classes: `ProcessingPipeline`, `PipelineStep`. 82- **eda**: Exploratory data analysis utilities for time series and 83 cross-sectional data. Uses Hydra for configuration-driven EDA. 84 Note: This module is partially implemented and shows a FutureWarning on 85 import. 86- **cli**: Command-line interface for toolkit operations. Provides 87 `init-catalog` command for scaffolding configuration directories. 88- **configuration**: Hydra-based configuration management with base templates 89 and local overrides. Configuration files are in YAML format. 90- **utils**: Shared utility functions including custom exceptions, 91 configuration loaders, module management, and filesystem operations. 92 93--- 94 95## Examples 96 97### Basic Usage 98 99```python 100from adc_toolkit.data.catalog import ValidatedDataCatalog 101 102catalog = ValidatedDataCatalog.in_directory("./config") 103df = catalog.load("customer_data") # Automatically validated 104processed = df.dropna() 105catalog.save("clean_customers", processed) # Validated before saving 106``` 107 108### Custom Validators 109 110**Using Pandera validator:** 111 112```python 113from adc_toolkit.data.catalog import ValidatedDataCatalog 114from adc_toolkit.data.validators.pandera import PanderaValidator 115 116catalog = ValidatedDataCatalog.in_directory("./config", validator_class=PanderaValidator) 117``` 118 119**Using Great Expectations validator:** 120 121```python 122from adc_toolkit.data.catalog import ValidatedDataCatalog 123from adc_toolkit.data.validators.gx import GXValidator 124 125catalog = ValidatedDataCatalog.in_directory("./config", validator_class=GXValidator) 126``` 127 128### Processing Pipeline 129 130```python 131from adc_toolkit.processing.pipeline import ProcessingPipeline 132from adc_toolkit.processing.steps.pandas import ( 133 remove_duplicates, 134 fill_missing_values, 135 make_columns_snake_case, 136) 137 138pipeline = ProcessingPipeline() 139pipeline = ( 140 pipeline.add(remove_duplicates, subset=["id"]) 141 .add(fill_missing_values, method="median") 142 .add(make_columns_snake_case) 143) 144processed_df = pipeline.run(df) 145``` 146 147### Logging 148 149```python 150from adc_toolkit.logger import Logger 151 152logger = Logger() 153logger.info("Processing started") 154Logger.set_level("debug") # Set global log level 155``` 156 157### CLI Usage 158 159```bash 160# Initialize catalog structure 161adc-toolkit init-catalog ./config 162 163# Overwrite existing files 164adc-toolkit init-catalog ./config --overwrite 165 166# Skip credentials file 167adc-toolkit init-catalog ./config --no-credentials 168``` 169 170### Complete Pipeline 171 172```python 173from adc_toolkit.data.catalog import ValidatedDataCatalog 174from adc_toolkit.processing.pipeline import ProcessingPipeline 175from adc_toolkit.processing.steps.pandas import remove_duplicates 176from adc_toolkit.logger import Logger 177 178logger = Logger() 179logger.info("Starting data pipeline") 180 181# Load with validation 182catalog = ValidatedDataCatalog.in_directory("./config") 183raw_data = catalog.load("sales_raw") 184 185# Process 186pipeline = ProcessingPipeline().add(remove_duplicates) 187clean_data = pipeline.run(raw_data) 188 189# Save with validation 190catalog.save("sales_clean", clean_data) 191logger.info("Pipeline complete") 192``` 193 194--- 195 196## See Also 197 198- `adc_toolkit.data.catalog.ValidatedDataCatalog`: Main data catalog API. 199- `adc_toolkit.data.catalogs.kedro.KedroDataCatalog`: Kedro catalog implementation. 200- `adc_toolkit.data.validators.gx.GXValidator`: Great Expectations validator. 201- `adc_toolkit.data.validators.pandera.PanderaValidator`: Pandera validator. 202- `adc_toolkit.logger.Logger`: Logging interface. 203- `adc_toolkit.processing.pipeline.ProcessingPipeline`: Data transformation pipelines. 204- `adc_toolkit.cli.main`: CLI entry point. 205 206--- 207 208## Notes 209 210### Optional Dependencies 211 212The toolkit uses optional dependency groups to keep the base installation 213lightweight. Install only what you need: 214 215| Group | Description | 216|-----------------|----------------------------------------------------------| 217| `kedro` | Kedro DataCatalog for data I/O (required for ValidatedDataCatalog) | 218| `gx` | Great Expectations validation | 219| `pandera` | Pandera validation | 220| `eda` | Exploratory data analysis tools | 221| `spark` | PySpark support | 222| `gcp` | Google Cloud Platform integration | 223| `logging` | Loguru logging backend | 224| `preprocessing` | scikit-learn transformations | 225 226Install with: `pip install adc-toolkit[kedro,gx]` or similar. 227 228### Design Patterns 229 230- **Factory Pattern**: Use `in_directory(path)` class methods to instantiate 231 catalogs and validators from configuration directories. 232- **Strategy Pattern**: Swap catalog and validator implementations without 233 changing downstream code. 234- **Protocol-based Design**: Type safety through structural subtyping (PEP 544). 235- **Dependency Injection**: Pass catalog and validator as constructor arguments. 236 237### Configuration Structure 238 239The toolkit expects this configuration directory structure: 240 241``` 242config/ 243├── base/ 244│ ├── globals.yml # Global variables (bucket paths, dataset types) 245│ └── catalog.yml # Kedro dataset definitions 246└── local/ 247 └── credentials.yml # Secrets and credentials (gitignored) 248``` 249 250### Limitations 251 252- The EDA module is partially implemented and shows a FutureWarning on import. 253- Great Expectations is constrained to version <1.0.0 due to API changes in GX 1.0. 254- The `kedro` optional dependency is required for ValidatedDataCatalog to function. 255 256### Version Information 257 258- **Current version**: {__version__} 259- **Python support**: 3.10, 3.11, 3.12, 3.13 260"""