adc_toolkit.data.catalogs

Data catalog implementations for the adc-toolkit.

This module provides concrete implementations of the DataCatalog protocol, enabling configuration-driven data I/O operations in ML and analytics pipelines. Data catalogs abstract away the details of data storage, file formats, and access patterns, providing a simple name-based API for loading and saving datasets.

The module currently includes Kedro-based catalog implementations, which support YAML-based dataset definitions, multiple file formats (CSV, Parquet, JSON, Excel, Pickle, HDF5), diverse storage backends (local filesystem, S3, GCS, Azure Blob), and advanced features like versioning, partitioning, and dynamic SQL queries.

Submodules

kedro Kedro-based data catalog implementation with KedroDataCatalog class and scaffolding utilities for creating catalog directory structures.

Notes

Data catalogs provide several key benefits:

Configuration-Driven: Dataset locations, formats, and parameters defined in YAML files rather than hardcoded in application logic.
Environment Flexibility: Different configurations for dev, staging, and production environments without code changes.
Reproducibility: Version-controlled configurations ensure consistent data access across team members and deployments.
Testability: Easy to mock or swap catalogs for unit testing.
Separation of Concerns: Data access logic decoupled from business logic.

The catalog pattern is particularly valuable in data science and ML workflows where data sources, formats, and locations frequently change across environments and project phases.

Catalog Configuration Structure

Catalogs typically use a hierarchical configuration structure:

base/: Shared dataset definitions for all environments
local/: Environment-specific overrides and credentials (gitignored)
globals.yml: Global variables and parameters
credentials.yml: Credentials for databases and cloud storage (gitignored)

This structure enables configuration composition: base definitions provide defaults, while local overrides customize behavior for specific environments.

Supported Dataset Types

Kedro-based catalogs support diverse dataset types:

Tabular: CSV, Parquet, Excel, Feather, ORC
Serialized: Pickle, JSON, YAML, HDF5
Database: SQL queries, table reads/writes
Big Data: Spark DataFrames with various formats
Cloud Storage: S3, GCS, Azure Blob via fsspec
Specialized: NetworkX graphs, Matplotlib figures, text files

Each dataset type has configurable load and save arguments for fine-grained control over I/O behavior.

Examples

Create a Kedro catalog from configuration directory:

>>> from adc_toolkit.data.catalogs.kedro import KedroDataCatalog
>>> catalog = KedroDataCatalog.in_directory("config/catalog")
>>> df = catalog.load("training_data")
>>> catalog.save("predictions", predictions_df)

Initialize catalog directory structure with templates:

>>> from adc_toolkit.data.catalogs.kedro import KedroDataCatalog
>>> result = KedroDataCatalog.init_catalog("./config/catalog")
>>> print(f"Created {len(result.created_files)} configuration files")

Use catalog in a data pipeline:

>>> from adc_toolkit.data.catalogs.kedro import KedroDataCatalog
>>>
>>> def etl_pipeline(catalog: KedroDataCatalog) -> None:
...     # Load raw data
...     raw = catalog.load("raw_sales")
...
...     # Transform
...     cleaned = raw.dropna()
...     enriched = enrich_with_features(cleaned)
...
...     # Save intermediate and final results
...     catalog.save("cleaned_sales", cleaned)
...     catalog.save("enriched_sales", enriched)
>>>
>>> catalog = KedroDataCatalog.in_directory("config/production")
>>> etl_pipeline(catalog)

Load data with dynamic SQL query parameters:

>>> catalog = KedroDataCatalog.in_directory("config/catalog")
>>> # Query defined in catalog.yml: SELECT * FROM sales WHERE year={year}
>>> sales_2024 = catalog.load("sales_data", year=2024)
>>> sales_2023 = catalog.load("sales_data", year=2023)

View Source

  1"""
  2Data catalog implementations for the adc-toolkit.
  3
  4This module provides concrete implementations of the DataCatalog protocol,
  5enabling configuration-driven data I/O operations in ML and analytics pipelines.
  6Data catalogs abstract away the details of data storage, file formats, and
  7access patterns, providing a simple name-based API for loading and saving
  8datasets.
  9
 10The module currently includes Kedro-based catalog implementations, which support
 11YAML-based dataset definitions, multiple file formats (CSV, Parquet, JSON, Excel,
 12Pickle, HDF5), diverse storage backends (local filesystem, S3, GCS, Azure Blob),
 13and advanced features like versioning, partitioning, and dynamic SQL queries.
 14
 15Submodules
 16----------
 17kedro
 18    Kedro-based data catalog implementation with KedroDataCatalog class and
 19    scaffolding utilities for creating catalog directory structures.
 20
 21See Also
 22--------
 23adc_toolkit.data.catalogs.kedro.KedroDataCatalog : Main Kedro catalog class.
 24adc_toolkit.data.abs.DataCatalog : Protocol definition for catalogs.
 25adc_toolkit.data.catalog.ValidatedDataCatalog : Catalog with automatic validation.
 26
 27Notes
 28-----
 29Data catalogs provide several key benefits:
 30
 31- **Configuration-Driven**: Dataset locations, formats, and parameters defined
 32  in YAML files rather than hardcoded in application logic.
 33- **Environment Flexibility**: Different configurations for dev, staging, and
 34  production environments without code changes.
 35- **Reproducibility**: Version-controlled configurations ensure consistent
 36  data access across team members and deployments.
 37- **Testability**: Easy to mock or swap catalogs for unit testing.
 38- **Separation of Concerns**: Data access logic decoupled from business logic.
 39
 40The catalog pattern is particularly valuable in data science and ML workflows
 41where data sources, formats, and locations frequently change across environments
 42and project phases.
 43
 44Catalog Configuration Structure
 45--------------------------------
 46Catalogs typically use a hierarchical configuration structure:
 47
 48- **base/**: Shared dataset definitions for all environments
 49- **local/**: Environment-specific overrides and credentials (gitignored)
 50- **globals.yml**: Global variables and parameters
 51- **credentials.yml**: Credentials for databases and cloud storage (gitignored)
 52
 53This structure enables configuration composition: base definitions provide
 54defaults, while local overrides customize behavior for specific environments.
 55
 56Supported Dataset Types
 57-----------------------
 58Kedro-based catalogs support diverse dataset types:
 59
 60- **Tabular**: CSV, Parquet, Excel, Feather, ORC
 61- **Serialized**: Pickle, JSON, YAML, HDF5
 62- **Database**: SQL queries, table reads/writes
 63- **Big Data**: Spark DataFrames with various formats
 64- **Cloud Storage**: S3, GCS, Azure Blob via fsspec
 65- **Specialized**: NetworkX graphs, Matplotlib figures, text files
 66
 67Each dataset type has configurable load and save arguments for fine-grained
 68control over I/O behavior.
 69
 70Examples
 71--------
 72Create a Kedro catalog from configuration directory:
 73
 74>>> from adc_toolkit.data.catalogs.kedro import KedroDataCatalog
 75>>> catalog = KedroDataCatalog.in_directory("config/catalog")
 76>>> df = catalog.load("training_data")
 77>>> catalog.save("predictions", predictions_df)
 78
 79Initialize catalog directory structure with templates:
 80
 81>>> from adc_toolkit.data.catalogs.kedro import KedroDataCatalog
 82>>> result = KedroDataCatalog.init_catalog("./config/catalog")
 83>>> print(f"Created {len(result.created_files)} configuration files")
 84
 85Use catalog in a data pipeline:
 86
 87>>> from adc_toolkit.data.catalogs.kedro import KedroDataCatalog
 88>>>
 89>>> def etl_pipeline(catalog: KedroDataCatalog) -> None:
 90...     # Load raw data
 91...     raw = catalog.load("raw_sales")
 92...
 93...     # Transform
 94...     cleaned = raw.dropna()
 95...     enriched = enrich_with_features(cleaned)
 96...
 97...     # Save intermediate and final results
 98...     catalog.save("cleaned_sales", cleaned)
 99...     catalog.save("enriched_sales", enriched)
100>>>
101>>> catalog = KedroDataCatalog.in_directory("config/production")
102>>> etl_pipeline(catalog)
103
104Load data with dynamic SQL query parameters:
105
106>>> catalog = KedroDataCatalog.in_directory("config/catalog")
107>>> # Query defined in catalog.yml: SELECT * FROM sales WHERE year={year}
108>>> sales_2024 = catalog.load("sales_data", year=2024)
109>>> sales_2023 = catalog.load("sales_data", year=2023)
110"""