adc_toolkit

ADC Toolkit: A Python Framework for Validated Data Pipelines

The adc-toolkit provides a structured approach to data handling in data science and machine learning projects. It combines configuration-driven data catalogs with automatic schema validation, ensuring data quality throughout your pipeline.

The toolkit's core value proposition is seamless validated data I/O: load data with automatic validation, save data with automatic validation, and detect schema drift without writing manual validation code.


Key Features

  • ValidatedDataCatalog: Main abstraction combining Kedro DataCatalog (I/O) with automatic data validation. Factory method: ValidatedDataCatalog.in_directory(path).
  • Dual Validator Support: Choose between Great Expectations (GX) for powerful expectation suites or Pandera for lightweight Python-based schema validation.
  • Auto Schema Detection: Automatically generates and enforces schemas on first data load. Schema is "frozen" to prevent silent data drift in subsequent operations.
  • Processing Pipeline: Chainable, reusable data transformation steps with a fluent API.
  • Flexible Logging: Unified logging interface with support for Python logging and Loguru backends.
  • Hydra Configuration: YAML-based hierarchical configuration management for reproducible pipelines.
  • Cloud Support: AWS, GCP, and Azure data context support for Great Expectations.
  • CLI Tools: Scaffold catalog structure with a single command: adc-toolkit init-catalog.

Quick Start

Installation

# Basic installation
pip install adc-toolkit

# With Kedro and Great Expectations (recommended)
pip install adc-toolkit[kedro,gx]

# With Kedro and Pandera
pip install adc-toolkit[kedro,pandera]

Initialize Catalog Structure

adc-toolkit init-catalog ./config

Load and Save Data with Validation

from adc_toolkit.data.catalog import ValidatedDataCatalog

catalog = ValidatedDataCatalog.in_directory("./config")
df = catalog.load("my_dataset")  # Validated after loading
catalog.save("processed_data", df)  # Validated before saving

Modules

  • data: Core data module providing the ValidatedDataCatalog for validated data pipelines. Includes Kedro catalog integration and validators (Great Expectations, Pandera, NoValidator).
  • logger: Flexible logging infrastructure with a unified interface. Supports both standard Python logging and Loguru backends. Main export: Logger.
  • processing: Data processing utilities with chainable transformation pipelines. Main classes: ProcessingPipeline, PipelineStep.
  • eda: Exploratory data analysis utilities for time series and cross-sectional data. Uses Hydra for configuration-driven EDA. Note: This module is partially implemented and shows a FutureWarning on import.
  • cli: Command-line interface for toolkit operations. Provides init-catalog command for scaffolding configuration directories.
  • configuration: Hydra-based configuration management with base templates and local overrides. Configuration files are in YAML format.
  • utils: Shared utility functions including custom exceptions, configuration loaders, module management, and filesystem operations.

Examples

Basic Usage

from adc_toolkit.data.catalog import ValidatedDataCatalog

catalog = ValidatedDataCatalog.in_directory("./config")
df = catalog.load("customer_data")  # Automatically validated
processed = df.dropna()
catalog.save("clean_customers", processed)  # Validated before saving

Custom Validators

Using Pandera validator:

from adc_toolkit.data.catalog import ValidatedDataCatalog
from adc_toolkit.data.validators.pandera import PanderaValidator

catalog = ValidatedDataCatalog.in_directory("./config", validator_class=PanderaValidator)

Using Great Expectations validator:

from adc_toolkit.data.catalog import ValidatedDataCatalog
from adc_toolkit.data.validators.gx import GXValidator

catalog = ValidatedDataCatalog.in_directory("./config", validator_class=GXValidator)

Processing Pipeline

from adc_toolkit.processing.pipeline import ProcessingPipeline
from adc_toolkit.processing.steps.pandas import (
    remove_duplicates,
    fill_missing_values,
    make_columns_snake_case,
)

pipeline = ProcessingPipeline()
pipeline = (
    pipeline.add(remove_duplicates, subset=["id"])
    .add(fill_missing_values, method="median")
    .add(make_columns_snake_case)
)
processed_df = pipeline.run(df)

Logging

from adc_toolkit.logger import Logger

logger = Logger()
logger.info("Processing started")
Logger.set_level("debug")  # Set global log level

CLI Usage

# Initialize catalog structure
adc-toolkit init-catalog ./config

# Overwrite existing files
adc-toolkit init-catalog ./config --overwrite

# Skip credentials file
adc-toolkit init-catalog ./config --no-credentials

Complete Pipeline

from adc_toolkit.data.catalog import ValidatedDataCatalog
from adc_toolkit.processing.pipeline import ProcessingPipeline
from adc_toolkit.processing.steps.pandas import remove_duplicates
from adc_toolkit.logger import Logger

logger = Logger()
logger.info("Starting data pipeline")

# Load with validation
catalog = ValidatedDataCatalog.in_directory("./config")
raw_data = catalog.load("sales_raw")

# Process
pipeline = ProcessingPipeline().add(remove_duplicates)
clean_data = pipeline.run(raw_data)

# Save with validation
catalog.save("sales_clean", clean_data)
logger.info("Pipeline complete")

See Also


Notes

Optional Dependencies

The toolkit uses optional dependency groups to keep the base installation lightweight. Install only what you need:

Group Description
kedro Kedro DataCatalog for data I/O (required for ValidatedDataCatalog)
gx Great Expectations validation
pandera Pandera validation
eda Exploratory data analysis tools
spark PySpark support
gcp Google Cloud Platform integration
logging Loguru logging backend
preprocessing scikit-learn transformations

Install with: pip install adc-toolkit[kedro,gx] or similar.

Design Patterns

  • Factory Pattern: Use in_directory(path) class methods to instantiate catalogs and validators from configuration directories.
  • Strategy Pattern: Swap catalog and validator implementations without changing downstream code.
  • Protocol-based Design: Type safety through structural subtyping (PEP 544).
  • Dependency Injection: Pass catalog and validator as constructor arguments.

Configuration Structure

The toolkit expects this configuration directory structure:

config/
├── base/
│   ├── globals.yml      # Global variables (bucket paths, dataset types)
│   └── catalog.yml      # Kedro dataset definitions
└── local/
    └── credentials.yml  # Secrets and credentials (gitignored)

Limitations

  • The EDA module is partially implemented and shows a FutureWarning on import.
  • Great Expectations is constrained to version <1.0.0 due to API changes in GX 1.0.
  • The kedro optional dependency is required for ValidatedDataCatalog to function.

Version Information

  • Current version: 1.1.0
  • Python support: 3.10, 3.11, 3.12, 3.13
  1__version__ = "1.1.0"
  2
  3__doc__ = f"""# ADC Toolkit: A Python Framework for Validated Data Pipelines
  4
  5The adc-toolkit provides a structured approach to data handling in data science
  6and machine learning projects. It combines configuration-driven data catalogs
  7with automatic schema validation, ensuring data quality throughout your pipeline.
  8
  9The toolkit's core value proposition is seamless validated data I/O: load data
 10with automatic validation, save data with automatic validation, and detect
 11schema drift without writing manual validation code.
 12
 13---
 14
 15## Key Features
 16
 17- **ValidatedDataCatalog**: Main abstraction combining Kedro DataCatalog (I/O)
 18  with automatic data validation.
 19  Factory method: `ValidatedDataCatalog.in_directory(path)`.
 20- **Dual Validator Support**: Choose between Great Expectations (GX) for
 21  powerful expectation suites or Pandera for lightweight Python-based schema
 22  validation.
 23- **Auto Schema Detection**: Automatically generates and enforces schemas on
 24  first data load. Schema is "frozen" to prevent silent data drift in
 25  subsequent operations.
 26- **Processing Pipeline**: Chainable, reusable data transformation steps with
 27  a fluent API.
 28- **Flexible Logging**: Unified logging interface with support for Python
 29  logging and Loguru backends.
 30- **Hydra Configuration**: YAML-based hierarchical configuration management
 31  for reproducible pipelines.
 32- **Cloud Support**: AWS, GCP, and Azure data context support for Great
 33  Expectations.
 34- **CLI Tools**: Scaffold catalog structure with a single command:
 35  `adc-toolkit init-catalog`.
 36
 37---
 38
 39## Quick Start
 40
 41### Installation
 42
 43```bash
 44# Basic installation
 45pip install adc-toolkit
 46
 47# With Kedro and Great Expectations (recommended)
 48pip install adc-toolkit[kedro,gx]
 49
 50# With Kedro and Pandera
 51pip install adc-toolkit[kedro,pandera]
 52```
 53
 54### Initialize Catalog Structure
 55
 56```bash
 57adc-toolkit init-catalog ./config
 58```
 59
 60### Load and Save Data with Validation
 61
 62```python
 63from adc_toolkit.data.catalog import ValidatedDataCatalog
 64
 65catalog = ValidatedDataCatalog.in_directory("./config")
 66df = catalog.load("my_dataset")  # Validated after loading
 67catalog.save("processed_data", df)  # Validated before saving
 68```
 69
 70---
 71
 72## Modules
 73
 74- **data**: Core data module providing the ValidatedDataCatalog for validated
 75  data pipelines. Includes Kedro catalog integration and validators (Great
 76  Expectations, Pandera, NoValidator).
 77- **logger**: Flexible logging infrastructure with a unified interface.
 78  Supports both standard Python logging and Loguru backends.
 79  Main export: `Logger`.
 80- **processing**: Data processing utilities with chainable transformation
 81  pipelines. Main classes: `ProcessingPipeline`, `PipelineStep`.
 82- **eda**: Exploratory data analysis utilities for time series and
 83  cross-sectional data. Uses Hydra for configuration-driven EDA.
 84  Note: This module is partially implemented and shows a FutureWarning on
 85  import.
 86- **cli**: Command-line interface for toolkit operations. Provides
 87  `init-catalog` command for scaffolding configuration directories.
 88- **configuration**: Hydra-based configuration management with base templates
 89  and local overrides. Configuration files are in YAML format.
 90- **utils**: Shared utility functions including custom exceptions,
 91  configuration loaders, module management, and filesystem operations.
 92
 93---
 94
 95## Examples
 96
 97### Basic Usage
 98
 99```python
100from adc_toolkit.data.catalog import ValidatedDataCatalog
101
102catalog = ValidatedDataCatalog.in_directory("./config")
103df = catalog.load("customer_data")  # Automatically validated
104processed = df.dropna()
105catalog.save("clean_customers", processed)  # Validated before saving
106```
107
108### Custom Validators
109
110**Using Pandera validator:**
111
112```python
113from adc_toolkit.data.catalog import ValidatedDataCatalog
114from adc_toolkit.data.validators.pandera import PanderaValidator
115
116catalog = ValidatedDataCatalog.in_directory("./config", validator_class=PanderaValidator)
117```
118
119**Using Great Expectations validator:**
120
121```python
122from adc_toolkit.data.catalog import ValidatedDataCatalog
123from adc_toolkit.data.validators.gx import GXValidator
124
125catalog = ValidatedDataCatalog.in_directory("./config", validator_class=GXValidator)
126```
127
128### Processing Pipeline
129
130```python
131from adc_toolkit.processing.pipeline import ProcessingPipeline
132from adc_toolkit.processing.steps.pandas import (
133    remove_duplicates,
134    fill_missing_values,
135    make_columns_snake_case,
136)
137
138pipeline = ProcessingPipeline()
139pipeline = (
140    pipeline.add(remove_duplicates, subset=["id"])
141    .add(fill_missing_values, method="median")
142    .add(make_columns_snake_case)
143)
144processed_df = pipeline.run(df)
145```
146
147### Logging
148
149```python
150from adc_toolkit.logger import Logger
151
152logger = Logger()
153logger.info("Processing started")
154Logger.set_level("debug")  # Set global log level
155```
156
157### CLI Usage
158
159```bash
160# Initialize catalog structure
161adc-toolkit init-catalog ./config
162
163# Overwrite existing files
164adc-toolkit init-catalog ./config --overwrite
165
166# Skip credentials file
167adc-toolkit init-catalog ./config --no-credentials
168```
169
170### Complete Pipeline
171
172```python
173from adc_toolkit.data.catalog import ValidatedDataCatalog
174from adc_toolkit.processing.pipeline import ProcessingPipeline
175from adc_toolkit.processing.steps.pandas import remove_duplicates
176from adc_toolkit.logger import Logger
177
178logger = Logger()
179logger.info("Starting data pipeline")
180
181# Load with validation
182catalog = ValidatedDataCatalog.in_directory("./config")
183raw_data = catalog.load("sales_raw")
184
185# Process
186pipeline = ProcessingPipeline().add(remove_duplicates)
187clean_data = pipeline.run(raw_data)
188
189# Save with validation
190catalog.save("sales_clean", clean_data)
191logger.info("Pipeline complete")
192```
193
194---
195
196## See Also
197
198- `adc_toolkit.data.catalog.ValidatedDataCatalog`: Main data catalog API.
199- `adc_toolkit.data.catalogs.kedro.KedroDataCatalog`: Kedro catalog implementation.
200- `adc_toolkit.data.validators.gx.GXValidator`: Great Expectations validator.
201- `adc_toolkit.data.validators.pandera.PanderaValidator`: Pandera validator.
202- `adc_toolkit.logger.Logger`: Logging interface.
203- `adc_toolkit.processing.pipeline.ProcessingPipeline`: Data transformation pipelines.
204- `adc_toolkit.cli.main`: CLI entry point.
205
206---
207
208## Notes
209
210### Optional Dependencies
211
212The toolkit uses optional dependency groups to keep the base installation
213lightweight. Install only what you need:
214
215| Group           | Description                                              |
216|-----------------|----------------------------------------------------------|
217| `kedro`         | Kedro DataCatalog for data I/O (required for ValidatedDataCatalog) |
218| `gx`            | Great Expectations validation                            |
219| `pandera`       | Pandera validation                                       |
220| `eda`           | Exploratory data analysis tools                          |
221| `spark`         | PySpark support                                          |
222| `gcp`           | Google Cloud Platform integration                        |
223| `logging`       | Loguru logging backend                                   |
224| `preprocessing` | scikit-learn transformations                             |
225
226Install with: `pip install adc-toolkit[kedro,gx]` or similar.
227
228### Design Patterns
229
230- **Factory Pattern**: Use `in_directory(path)` class methods to instantiate
231  catalogs and validators from configuration directories.
232- **Strategy Pattern**: Swap catalog and validator implementations without
233  changing downstream code.
234- **Protocol-based Design**: Type safety through structural subtyping (PEP 544).
235- **Dependency Injection**: Pass catalog and validator as constructor arguments.
236
237### Configuration Structure
238
239The toolkit expects this configuration directory structure:
240
241```
242config/
243├── base/
244│   ├── globals.yml      # Global variables (bucket paths, dataset types)
245│   └── catalog.yml      # Kedro dataset definitions
246└── local/
247    └── credentials.yml  # Secrets and credentials (gitignored)
248```
249
250### Limitations
251
252- The EDA module is partially implemented and shows a FutureWarning on import.
253- Great Expectations is constrained to version <1.0.0 due to API changes in GX 1.0.
254- The `kedro` optional dependency is required for ValidatedDataCatalog to function.
255
256### Version Information
257
258- **Current version**: {__version__}
259- **Python support**: 3.10, 3.11, 3.12, 3.13
260"""