adc_toolkit.data.catalogs
Data catalog implementations for the adc-toolkit.
This module provides concrete implementations of the DataCatalog protocol, enabling configuration-driven data I/O operations in ML and analytics pipelines. Data catalogs abstract away the details of data storage, file formats, and access patterns, providing a simple name-based API for loading and saving datasets.
The module currently includes Kedro-based catalog implementations, which support YAML-based dataset definitions, multiple file formats (CSV, Parquet, JSON, Excel, Pickle, HDF5), diverse storage backends (local filesystem, S3, GCS, Azure Blob), and advanced features like versioning, partitioning, and dynamic SQL queries.
Submodules
kedro Kedro-based data catalog implementation with KedroDataCatalog class and scaffolding utilities for creating catalog directory structures.
See Also
adc_toolkit.data.catalogs.kedro.KedroDataCatalog: Main Kedro catalog class.
adc_toolkit.data.abs.DataCatalog: Protocol definition for catalogs.
adc_toolkit.data.catalog.ValidatedDataCatalog: Catalog with automatic validation.
Notes
Data catalogs provide several key benefits:
- Configuration-Driven: Dataset locations, formats, and parameters defined in YAML files rather than hardcoded in application logic.
- Environment Flexibility: Different configurations for dev, staging, and production environments without code changes.
- Reproducibility: Version-controlled configurations ensure consistent data access across team members and deployments.
- Testability: Easy to mock or swap catalogs for unit testing.
- Separation of Concerns: Data access logic decoupled from business logic.
The catalog pattern is particularly valuable in data science and ML workflows where data sources, formats, and locations frequently change across environments and project phases.
Catalog Configuration Structure
Catalogs typically use a hierarchical configuration structure:
- base/: Shared dataset definitions for all environments
- local/: Environment-specific overrides and credentials (gitignored)
- globals.yml: Global variables and parameters
- credentials.yml: Credentials for databases and cloud storage (gitignored)
This structure enables configuration composition: base definitions provide defaults, while local overrides customize behavior for specific environments.
Supported Dataset Types
Kedro-based catalogs support diverse dataset types:
- Tabular: CSV, Parquet, Excel, Feather, ORC
- Serialized: Pickle, JSON, YAML, HDF5
- Database: SQL queries, table reads/writes
- Big Data: Spark DataFrames with various formats
- Cloud Storage: S3, GCS, Azure Blob via fsspec
- Specialized: NetworkX graphs, Matplotlib figures, text files
Each dataset type has configurable load and save arguments for fine-grained control over I/O behavior.
Examples
Create a Kedro catalog from configuration directory:
>>> from adc_toolkit.data.catalogs.kedro import KedroDataCatalog
>>> catalog = KedroDataCatalog.in_directory("config/catalog")
>>> df = catalog.load("training_data")
>>> catalog.save("predictions", predictions_df)
Initialize catalog directory structure with templates:
>>> from adc_toolkit.data.catalogs.kedro import KedroDataCatalog
>>> result = KedroDataCatalog.init_catalog("./config/catalog")
>>> print(f"Created {len(result.created_files)} configuration files")
Use catalog in a data pipeline:
>>> from adc_toolkit.data.catalogs.kedro import KedroDataCatalog
>>>
>>> def etl_pipeline(catalog: KedroDataCatalog) -> None:
... # Load raw data
... raw = catalog.load("raw_sales")
...
... # Transform
... cleaned = raw.dropna()
... enriched = enrich_with_features(cleaned)
...
... # Save intermediate and final results
... catalog.save("cleaned_sales", cleaned)
... catalog.save("enriched_sales", enriched)
>>>
>>> catalog = KedroDataCatalog.in_directory("config/production")
>>> etl_pipeline(catalog)
Load data with dynamic SQL query parameters:
>>> catalog = KedroDataCatalog.in_directory("config/catalog")
>>> # Query defined in catalog.yml: SELECT * FROM sales WHERE year={year}
>>> sales_2024 = catalog.load("sales_data", year=2024)
>>> sales_2023 = catalog.load("sales_data", year=2023)
1""" 2Data catalog implementations for the adc-toolkit. 3 4This module provides concrete implementations of the DataCatalog protocol, 5enabling configuration-driven data I/O operations in ML and analytics pipelines. 6Data catalogs abstract away the details of data storage, file formats, and 7access patterns, providing a simple name-based API for loading and saving 8datasets. 9 10The module currently includes Kedro-based catalog implementations, which support 11YAML-based dataset definitions, multiple file formats (CSV, Parquet, JSON, Excel, 12Pickle, HDF5), diverse storage backends (local filesystem, S3, GCS, Azure Blob), 13and advanced features like versioning, partitioning, and dynamic SQL queries. 14 15Submodules 16---------- 17kedro 18 Kedro-based data catalog implementation with KedroDataCatalog class and 19 scaffolding utilities for creating catalog directory structures. 20 21See Also 22-------- 23adc_toolkit.data.catalogs.kedro.KedroDataCatalog : Main Kedro catalog class. 24adc_toolkit.data.abs.DataCatalog : Protocol definition for catalogs. 25adc_toolkit.data.catalog.ValidatedDataCatalog : Catalog with automatic validation. 26 27Notes 28----- 29Data catalogs provide several key benefits: 30 31- **Configuration-Driven**: Dataset locations, formats, and parameters defined 32 in YAML files rather than hardcoded in application logic. 33- **Environment Flexibility**: Different configurations for dev, staging, and 34 production environments without code changes. 35- **Reproducibility**: Version-controlled configurations ensure consistent 36 data access across team members and deployments. 37- **Testability**: Easy to mock or swap catalogs for unit testing. 38- **Separation of Concerns**: Data access logic decoupled from business logic. 39 40The catalog pattern is particularly valuable in data science and ML workflows 41where data sources, formats, and locations frequently change across environments 42and project phases. 43 44Catalog Configuration Structure 45-------------------------------- 46Catalogs typically use a hierarchical configuration structure: 47 48- **base/**: Shared dataset definitions for all environments 49- **local/**: Environment-specific overrides and credentials (gitignored) 50- **globals.yml**: Global variables and parameters 51- **credentials.yml**: Credentials for databases and cloud storage (gitignored) 52 53This structure enables configuration composition: base definitions provide 54defaults, while local overrides customize behavior for specific environments. 55 56Supported Dataset Types 57----------------------- 58Kedro-based catalogs support diverse dataset types: 59 60- **Tabular**: CSV, Parquet, Excel, Feather, ORC 61- **Serialized**: Pickle, JSON, YAML, HDF5 62- **Database**: SQL queries, table reads/writes 63- **Big Data**: Spark DataFrames with various formats 64- **Cloud Storage**: S3, GCS, Azure Blob via fsspec 65- **Specialized**: NetworkX graphs, Matplotlib figures, text files 66 67Each dataset type has configurable load and save arguments for fine-grained 68control over I/O behavior. 69 70Examples 71-------- 72Create a Kedro catalog from configuration directory: 73 74>>> from adc_toolkit.data.catalogs.kedro import KedroDataCatalog 75>>> catalog = KedroDataCatalog.in_directory("config/catalog") 76>>> df = catalog.load("training_data") 77>>> catalog.save("predictions", predictions_df) 78 79Initialize catalog directory structure with templates: 80 81>>> from adc_toolkit.data.catalogs.kedro import KedroDataCatalog 82>>> result = KedroDataCatalog.init_catalog("./config/catalog") 83>>> print(f"Created {len(result.created_files)} configuration files") 84 85Use catalog in a data pipeline: 86 87>>> from adc_toolkit.data.catalogs.kedro import KedroDataCatalog 88>>> 89>>> def etl_pipeline(catalog: KedroDataCatalog) -> None: 90... # Load raw data 91... raw = catalog.load("raw_sales") 92... 93... # Transform 94... cleaned = raw.dropna() 95... enriched = enrich_with_features(cleaned) 96... 97... # Save intermediate and final results 98... catalog.save("cleaned_sales", cleaned) 99... catalog.save("enriched_sales", enriched) 100>>> 101>>> catalog = KedroDataCatalog.in_directory("config/production") 102>>> etl_pipeline(catalog) 103 104Load data with dynamic SQL query parameters: 105 106>>> catalog = KedroDataCatalog.in_directory("config/catalog") 107>>> # Query defined in catalog.yml: SELECT * FROM sales WHERE year={year} 108>>> sales_2024 = catalog.load("sales_data", year=2024) 109>>> sales_2023 = catalog.load("sales_data", year=2023) 110"""