adc_toolkit.data.default_attributes
Factory functions for creating default catalog and validator instances.
This module provides factory functions that return sensible default implementations of the data catalog and data validator abstractions used throughout the adc-toolkit. These defaults follow a priority-based selection strategy and use lazy imports to avoid requiring all optional dependencies.
Default Selection Logic
Data Catalog: Always returns KedroDataCatalog if the kedro package is
installed. This provides YAML-based configuration and supports multiple data
formats (CSV, Parquet, JSON, etc.).
Data Validator: Follows a priority hierarchy:
1. GXValidator (Great Expectations) - preferred default if installed
2. PanderaValidator (Pandera) - fallback if GX is not installed
3. ImportError - raised if neither validation package is available
The lazy import mechanism ensures that only the actually-used implementation is imported, allowing users to install only the optional dependencies they need.
Functions
default_catalog(config_path) Return the default data catalog implementation (KedroDataCatalog). default_validator(config_path) Return the default data validator implementation (GX or Pandera).
See Also
adc_toolkit.data.abs.DataCatalog: Abstract base class for data catalogs.
adc_toolkit.data.abs.DataValidator: Abstract base class for data validators.
adc_toolkit.data.ValidatedDataCatalog: Main validated data catalog abstraction.
Notes
These factory functions are primarily used by ValidatedDataCatalog.in_directory()
to automatically construct a validated catalog with sensible defaults when the user
doesn't explicitly specify catalog or validator implementations.
Users can always bypass these defaults by directly instantiating specific
implementations (e.g., KedroDataCatalog, GXValidator, PanderaValidator,
or NoValidator).
Examples
The default factories are typically used indirectly through ValidatedDataCatalog:
>>> from adc_toolkit.data import ValidatedDataCatalog
>>> catalog = ValidatedDataCatalog.in_directory("config/")
>>> # This uses default_catalog() and default_validator() internally
Direct usage of the factory functions:
>>> from adc_toolkit.data.default_attributes import default_catalog, default_validator
>>> catalog = default_catalog("config/")
>>> validator = default_validator("config/")
>>> # Manually construct ValidatedDataCatalog with defaults
>>> from adc_toolkit.data import ValidatedDataCatalog
>>> validated_catalog = ValidatedDataCatalog(catalog, validator)
1""" 2Factory functions for creating default catalog and validator instances. 3 4This module provides factory functions that return sensible default implementations 5of the data catalog and data validator abstractions used throughout the adc-toolkit. 6These defaults follow a priority-based selection strategy and use lazy imports to 7avoid requiring all optional dependencies. 8 9Default Selection Logic 10----------------------- 11**Data Catalog**: Always returns ``KedroDataCatalog`` if the kedro package is 12installed. This provides YAML-based configuration and supports multiple data 13formats (CSV, Parquet, JSON, etc.). 14 15**Data Validator**: Follows a priority hierarchy: 16 1. ``GXValidator`` (Great Expectations) - preferred default if installed 17 2. ``PanderaValidator`` (Pandera) - fallback if GX is not installed 18 3. ``ImportError`` - raised if neither validation package is available 19 20The lazy import mechanism ensures that only the actually-used implementation is 21imported, allowing users to install only the optional dependencies they need. 22 23Functions 24--------- 25default_catalog(config_path) 26 Return the default data catalog implementation (KedroDataCatalog). 27default_validator(config_path) 28 Return the default data validator implementation (GX or Pandera). 29 30See Also 31-------- 32adc_toolkit.data.abs.DataCatalog : Abstract base class for data catalogs. 33adc_toolkit.data.abs.DataValidator : Abstract base class for data validators. 34adc_toolkit.data.ValidatedDataCatalog : Main validated data catalog abstraction. 35 36Notes 37----- 38These factory functions are primarily used by ``ValidatedDataCatalog.in_directory()`` 39to automatically construct a validated catalog with sensible defaults when the user 40doesn't explicitly specify catalog or validator implementations. 41 42Users can always bypass these defaults by directly instantiating specific 43implementations (e.g., ``KedroDataCatalog``, ``GXValidator``, ``PanderaValidator``, 44or ``NoValidator``). 45 46Examples 47-------- 48The default factories are typically used indirectly through ValidatedDataCatalog: 49 50>>> from adc_toolkit.data import ValidatedDataCatalog 51>>> catalog = ValidatedDataCatalog.in_directory("config/") 52>>> # This uses default_catalog() and default_validator() internally 53 54Direct usage of the factory functions: 55 56>>> from adc_toolkit.data.default_attributes import default_catalog, default_validator 57>>> catalog = default_catalog("config/") 58>>> validator = default_validator("config/") 59>>> # Manually construct ValidatedDataCatalog with defaults 60>>> from adc_toolkit.data import ValidatedDataCatalog 61>>> validated_catalog = ValidatedDataCatalog(catalog, validator) 62""" 63 64import warnings 65from importlib.util import find_spec 66from pathlib import Path 67 68from adc_toolkit.data.abs import DataCatalog, DataValidator 69 70 71def default_catalog(config_path: str | Path) -> DataCatalog: 72 """ 73 Return the default data catalog implementation initialized from configuration. 74 75 This factory function provides the default ``DataCatalog`` implementation for 76 the adc-toolkit. It uses a lazy import mechanism to check for the kedro package 77 and returns a ``KedroDataCatalog`` instance if available. 78 79 The function performs runtime package detection using ``importlib.util.find_spec`` 80 to avoid hard dependencies on kedro. This allows users to install only the 81 catalog implementations they need. 82 83 Parameters 84 ---------- 85 config_path : str or pathlib.Path 86 Path to the configuration directory containing the data catalog YAML file. 87 For ``KedroDataCatalog``, this directory should contain a ``catalog.yaml`` 88 file that defines dataset configurations in Kedro format. The path can be 89 either absolute or relative to the current working directory. 90 91 Returns 92 ------- 93 DataCatalog 94 An instance of ``KedroDataCatalog`` initialized with the configuration 95 found in the specified directory. The returned object implements the 96 ``DataCatalog`` abstract interface, providing ``load()`` and ``save()`` 97 methods for data I/O operations. 98 99 Raises 100 ------ 101 ImportError 102 If the kedro package is not installed. The error message provides 103 installation instructions using the uv package manager (formerly poetry). 104 Users can install kedro by running ``uv sync --group kedro`` or implement 105 their own custom ``DataCatalog`` subclass. 106 107 See Also 108 -------- 109 adc_toolkit.data.catalogs.kedro.KedroDataCatalog : The Kedro-based catalog implementation. 110 adc_toolkit.data.abs.DataCatalog : Abstract base class for data catalogs. 111 adc_toolkit.data.ValidatedDataCatalog : Validated catalog using this default. 112 113 Notes 114 ----- 115 **Lazy Import Mechanism**: The function uses ``importlib.util.find_spec`` to 116 check for kedro's availability before importing. This allows the module to be 117 imported even when kedro is not installed, with the ImportError only raised 118 when the function is actually called. 119 120 **Alternative Implementations**: Users who don't want to use Kedro can: 121 1. Implement a custom ``DataCatalog`` subclass 122 2. Directly instantiate their catalog and pass it to ``ValidatedDataCatalog`` 123 124 **Configuration Format**: The KedroDataCatalog expects a ``catalog.yaml`` file 125 in the specified directory. See the Kedro documentation for the full 126 specification of the catalog configuration format. 127 128 Examples 129 -------- 130 Basic usage to get a default catalog: 131 132 >>> from adc_toolkit.data.default_attributes import default_catalog 133 >>> catalog = default_catalog("path/to/config") 134 >>> # catalog is now a KedroDataCatalog instance 135 >>> df = catalog.load("my_dataset") 136 137 Using with ValidatedDataCatalog (typical usage pattern): 138 139 >>> from adc_toolkit.data import ValidatedDataCatalog 140 >>> validated_cat = ValidatedDataCatalog.in_directory("config/") 141 >>> # This internally calls default_catalog("config/") 142 143 Handling the ImportError when kedro is not installed: 144 145 >>> try: 146 ... catalog = default_catalog("config/") 147 ... except ImportError as e: 148 ... print("Kedro not installed, using custom catalog") 149 ... catalog = MyCustomCatalog("config/") 150 151 Working with Path objects: 152 153 >>> from pathlib import Path 154 >>> config_dir = Path(__file__).parent / "config" 155 >>> catalog = default_catalog(config_dir) 156 """ 157 is_kedro_installed = find_spec("kedro") is not None 158 if not is_kedro_installed: 159 raise ImportError( 160 "Default data catalog is KedroDataCatalog. " 161 "You must install kedro to use KedroDataCatalog. " 162 "Run `uv sync --group kedro` to do so." 163 "Alternatively, you can implement your own data catalog." 164 ) 165 166 from adc_toolkit.data.catalogs.kedro import KedroDataCatalog 167 168 return KedroDataCatalog(config_path) 169 170 171def default_validator(config_path: str | Path) -> DataValidator: 172 """ 173 Return the default data validator implementation with priority-based selection. 174 175 This factory function provides the default ``DataValidator`` implementation for 176 the adc-toolkit by attempting to load validation libraries in priority order. 177 It implements a fallback chain: Great Expectations (preferred) → Pandera 178 (fallback) → ImportError (if neither is available). 179 180 The function uses lazy imports and runtime package detection to check for 181 available validation libraries, allowing users to install only the validator 182 they prefer. When Great Expectations is not available but Pandera is, a 183 warning is issued to inform users they are using the fallback implementation. 184 185 Priority Selection Logic 186 ------------------------ 187 1. **GXValidator (Great Expectations)**: Preferred default. Provides comprehensive 188 data validation with extensive built-in expectations, profiling capabilities, 189 and data documentation features. 190 191 2. **PanderaValidator (Pandera)**: Fallback option. Provides DataFrame schema 192 validation with a more lightweight, Pythonic API. Used automatically when 193 Great Expectations is not installed. 194 195 3. **ImportError**: Raised when neither validation library is available, with 196 detailed installation instructions. 197 198 Parameters 199 ---------- 200 config_path : str or pathlib.Path 201 Path to the configuration directory containing validator configuration files. 202 The expected file structure depends on the validator: 203 204 - **GXValidator**: Expects a Great Expectations project structure with 205 ``great_expectations.yml`` or expectations suite configurations. 206 - **PanderaValidator**: Expects Pandera schema definition files (Python 207 modules or YAML files depending on configuration). 208 209 The path can be either absolute or relative to the current working directory. 210 211 Returns 212 ------- 213 DataValidator 214 An instance of either ``GXValidator`` or ``PanderaValidator`` (in priority 215 order), initialized with the configuration found in the specified directory. 216 The returned object implements the ``DataValidator`` abstract interface, 217 providing ``validate()`` methods for data quality checks. 218 219 Raises 220 ------ 221 ImportError 222 Raised when neither the great_expectations nor pandera packages are 223 installed. The error message provides detailed installation instructions 224 for both options using the uv package manager, and also mentions the 225 alternative of implementing a custom validator or using ``NoValidator`` 226 (though the latter is not recommended for production use). 227 228 Warns 229 ----- 230 UserWarning 231 Issued when Great Expectations is not installed but Pandera is available. 232 This warning informs users that they are using the fallback validator 233 implementation rather than the preferred default. The warning includes 234 stacklevel=2 to show the calling code location rather than the factory 235 function itself. 236 237 See Also 238 -------- 239 adc_toolkit.data.validators.gx.GXValidator : Great Expectations validator implementation. 240 adc_toolkit.data.validators.pandera.PanderaValidator : Pandera validator implementation. 241 adc_toolkit.data.validators.no_validator.NoValidator : No-op validator (not recommended). 242 adc_toolkit.data.abs.DataValidator : Abstract base class for data validators. 243 adc_toolkit.data.ValidatedDataCatalog : Validated catalog using this default. 244 245 Notes 246 ----- 247 **Lazy Import Mechanism**: The function uses ``importlib.util.find_spec`` to 248 check for package availability before importing. This allows the module to be 249 imported even when validation libraries are not installed, with the ImportError 250 only raised when the function is actually called. 251 252 **Installation Options**: Users should install the validation library that best 253 fits their needs: 254 255 - For comprehensive validation and data documentation: ``uv sync --group gx`` 256 - For lightweight DataFrame validation: ``uv sync --group pandera`` 257 - For both (if needed): ``uv sync --group gx --group pandera`` 258 259 **Alternative Implementations**: Users who don't want to use the defaults can: 260 261 1. Implement a custom ``DataValidator`` subclass 262 2. Use the ``NoValidator`` class (bypasses all validation, not recommended) 263 3. Directly instantiate a specific validator and pass it to ``ValidatedDataCatalog`` 264 265 **Warning Behavior**: The fallback warning uses ``stacklevel=2`` to ensure the 266 warning appears to originate from the user's code that called this function, 267 not from within the factory function itself. This makes it easier for users 268 to identify where the fallback is being triggered. 269 270 Examples 271 -------- 272 Basic usage to get a default validator: 273 274 >>> from adc_toolkit.data.default_attributes import default_validator 275 >>> validator = default_validator("path/to/config") 276 >>> # validator is either GXValidator or PanderaValidator 277 >>> validated_df = validator.validate("my_dataset", df) 278 279 Using with ValidatedDataCatalog (typical usage pattern): 280 281 >>> from adc_toolkit.data import ValidatedDataCatalog 282 >>> validated_cat = ValidatedDataCatalog.in_directory("config/") 283 >>> # This internally calls default_validator("config/") 284 >>> df = validated_cat.load("my_dataset") # Validates after loading 285 286 Handling the fallback warning: 287 288 >>> import warnings 289 >>> warnings.filterwarnings("ignore", message=".*PanderaValidator.*") 290 >>> validator = default_validator("config/") 291 >>> # Warning is suppressed if only Pandera is installed 292 293 Explicitly choosing a validator to avoid the default behavior: 294 295 >>> from adc_toolkit.data.validators.pandera import PanderaValidator 296 >>> from adc_toolkit.data.validators.gx import GXValidator 297 >>> # Choose GXValidator explicitly 298 >>> validator = GXValidator.in_directory("config/") 299 300 Handling the ImportError when no validators are installed: 301 302 >>> from adc_toolkit.data.validators.no_validator import NoValidator 303 >>> try: 304 ... validator = default_validator("config/") 305 ... except ImportError: 306 ... print("No validators installed, using NoValidator") 307 ... validator = NoValidator() 308 309 Working with Path objects: 310 311 >>> from pathlib import Path 312 >>> config_dir = Path(__file__).parent / "config" 313 >>> validator = default_validator(config_dir) 314 """ 315 is_great_expectations_installed = find_spec("great_expectations") is not None 316 is_pandera_installed = find_spec("pandera") is not None 317 318 if is_great_expectations_installed: 319 from adc_toolkit.data.validators.gx import GXValidator 320 321 return GXValidator.in_directory(config_path) 322 elif is_pandera_installed: 323 warnings.warn( 324 "Default data validator is GXValidator. " 325 "Great Expectations is not installed. " 326 "Using PanderaValidator instead.", 327 stacklevel=2, 328 ) 329 from adc_toolkit.data.validators.pandera import PanderaValidator 330 331 return PanderaValidator.in_directory(config_path) 332 else: 333 raise ImportError( 334 "Default data validators are GXValidator and PanderaValidator. " 335 "You must install either great_expectations or pandera to use them. " 336 "Neither package is installed. " 337 "Run `uv sync --group gx` or " 338 "`uv sync --group pandera` to do so. " 339 "Alternatively, you can implement your own data validator. " 340 "If you don't want to validate data, use NoValidator class (not recommended)." 341 )
72def default_catalog(config_path: str | Path) -> DataCatalog: 73 """ 74 Return the default data catalog implementation initialized from configuration. 75 76 This factory function provides the default ``DataCatalog`` implementation for 77 the adc-toolkit. It uses a lazy import mechanism to check for the kedro package 78 and returns a ``KedroDataCatalog`` instance if available. 79 80 The function performs runtime package detection using ``importlib.util.find_spec`` 81 to avoid hard dependencies on kedro. This allows users to install only the 82 catalog implementations they need. 83 84 Parameters 85 ---------- 86 config_path : str or pathlib.Path 87 Path to the configuration directory containing the data catalog YAML file. 88 For ``KedroDataCatalog``, this directory should contain a ``catalog.yaml`` 89 file that defines dataset configurations in Kedro format. The path can be 90 either absolute or relative to the current working directory. 91 92 Returns 93 ------- 94 DataCatalog 95 An instance of ``KedroDataCatalog`` initialized with the configuration 96 found in the specified directory. The returned object implements the 97 ``DataCatalog`` abstract interface, providing ``load()`` and ``save()`` 98 methods for data I/O operations. 99 100 Raises 101 ------ 102 ImportError 103 If the kedro package is not installed. The error message provides 104 installation instructions using the uv package manager (formerly poetry). 105 Users can install kedro by running ``uv sync --group kedro`` or implement 106 their own custom ``DataCatalog`` subclass. 107 108 See Also 109 -------- 110 adc_toolkit.data.catalogs.kedro.KedroDataCatalog : The Kedro-based catalog implementation. 111 adc_toolkit.data.abs.DataCatalog : Abstract base class for data catalogs. 112 adc_toolkit.data.ValidatedDataCatalog : Validated catalog using this default. 113 114 Notes 115 ----- 116 **Lazy Import Mechanism**: The function uses ``importlib.util.find_spec`` to 117 check for kedro's availability before importing. This allows the module to be 118 imported even when kedro is not installed, with the ImportError only raised 119 when the function is actually called. 120 121 **Alternative Implementations**: Users who don't want to use Kedro can: 122 1. Implement a custom ``DataCatalog`` subclass 123 2. Directly instantiate their catalog and pass it to ``ValidatedDataCatalog`` 124 125 **Configuration Format**: The KedroDataCatalog expects a ``catalog.yaml`` file 126 in the specified directory. See the Kedro documentation for the full 127 specification of the catalog configuration format. 128 129 Examples 130 -------- 131 Basic usage to get a default catalog: 132 133 >>> from adc_toolkit.data.default_attributes import default_catalog 134 >>> catalog = default_catalog("path/to/config") 135 >>> # catalog is now a KedroDataCatalog instance 136 >>> df = catalog.load("my_dataset") 137 138 Using with ValidatedDataCatalog (typical usage pattern): 139 140 >>> from adc_toolkit.data import ValidatedDataCatalog 141 >>> validated_cat = ValidatedDataCatalog.in_directory("config/") 142 >>> # This internally calls default_catalog("config/") 143 144 Handling the ImportError when kedro is not installed: 145 146 >>> try: 147 ... catalog = default_catalog("config/") 148 ... except ImportError as e: 149 ... print("Kedro not installed, using custom catalog") 150 ... catalog = MyCustomCatalog("config/") 151 152 Working with Path objects: 153 154 >>> from pathlib import Path 155 >>> config_dir = Path(__file__).parent / "config" 156 >>> catalog = default_catalog(config_dir) 157 """ 158 is_kedro_installed = find_spec("kedro") is not None 159 if not is_kedro_installed: 160 raise ImportError( 161 "Default data catalog is KedroDataCatalog. " 162 "You must install kedro to use KedroDataCatalog. " 163 "Run `uv sync --group kedro` to do so." 164 "Alternatively, you can implement your own data catalog." 165 ) 166 167 from adc_toolkit.data.catalogs.kedro import KedroDataCatalog 168 169 return KedroDataCatalog(config_path)
Return the default data catalog implementation initialized from configuration.
This factory function provides the default DataCatalog implementation for
the adc-toolkit. It uses a lazy import mechanism to check for the kedro package
and returns a KedroDataCatalog instance if available.
The function performs runtime package detection using importlib.util.find_spec
to avoid hard dependencies on kedro. This allows users to install only the
catalog implementations they need.
Parameters
- config_path (str or pathlib.Path):
Path to the configuration directory containing the data catalog YAML file.
For
KedroDataCatalog, this directory should contain acatalog.yamlfile that defines dataset configurations in Kedro format. The path can be either absolute or relative to the current working directory.
Returns
- DataCatalog: An instance of
KedroDataCataloginitialized with the configuration found in the specified directory. The returned object implements theDataCatalogabstract interface, providingload()andsave()methods for data I/O operations.
Raises
- ImportError: If the kedro package is not installed. The error message provides
installation instructions using the uv package manager (formerly poetry).
Users can install kedro by running
uv sync --group kedroor implement their own customDataCatalogsubclass.
See Also
adc_toolkit.data.catalogs.kedro.KedroDataCatalog: The Kedro-based catalog implementation.
adc_toolkit.data.abs.DataCatalog: Abstract base class for data catalogs.
adc_toolkit.data.ValidatedDataCatalog: Validated catalog using this default.
Notes
Lazy Import Mechanism: The function uses importlib.util.find_spec to
check for kedro's availability before importing. This allows the module to be
imported even when kedro is not installed, with the ImportError only raised
when the function is actually called.
Alternative Implementations: Users who don't want to use Kedro can:
1. Implement a custom DataCatalog subclass
2. Directly instantiate their catalog and pass it to ValidatedDataCatalog
Configuration Format: The KedroDataCatalog expects a catalog.yaml file
in the specified directory. See the Kedro documentation for the full
specification of the catalog configuration format.
Examples
Basic usage to get a default catalog:
>>> from adc_toolkit.data.default_attributes import default_catalog
>>> catalog = default_catalog("path/to/config")
>>> # catalog is now a KedroDataCatalog instance
>>> df = catalog.load("my_dataset")
Using with ValidatedDataCatalog (typical usage pattern):
>>> from adc_toolkit.data import ValidatedDataCatalog
>>> validated_cat = ValidatedDataCatalog.in_directory("config/")
>>> # This internally calls default_catalog("config/")
Handling the ImportError when kedro is not installed:
>>> try:
... catalog = default_catalog("config/")
... except ImportError as e:
... print("Kedro not installed, using custom catalog")
... catalog = MyCustomCatalog("config/")
Working with Path objects:
>>> from pathlib import Path
>>> config_dir = Path(__file__).parent / "config"
>>> catalog = default_catalog(config_dir)
172def default_validator(config_path: str | Path) -> DataValidator: 173 """ 174 Return the default data validator implementation with priority-based selection. 175 176 This factory function provides the default ``DataValidator`` implementation for 177 the adc-toolkit by attempting to load validation libraries in priority order. 178 It implements a fallback chain: Great Expectations (preferred) → Pandera 179 (fallback) → ImportError (if neither is available). 180 181 The function uses lazy imports and runtime package detection to check for 182 available validation libraries, allowing users to install only the validator 183 they prefer. When Great Expectations is not available but Pandera is, a 184 warning is issued to inform users they are using the fallback implementation. 185 186 Priority Selection Logic 187 ------------------------ 188 1. **GXValidator (Great Expectations)**: Preferred default. Provides comprehensive 189 data validation with extensive built-in expectations, profiling capabilities, 190 and data documentation features. 191 192 2. **PanderaValidator (Pandera)**: Fallback option. Provides DataFrame schema 193 validation with a more lightweight, Pythonic API. Used automatically when 194 Great Expectations is not installed. 195 196 3. **ImportError**: Raised when neither validation library is available, with 197 detailed installation instructions. 198 199 Parameters 200 ---------- 201 config_path : str or pathlib.Path 202 Path to the configuration directory containing validator configuration files. 203 The expected file structure depends on the validator: 204 205 - **GXValidator**: Expects a Great Expectations project structure with 206 ``great_expectations.yml`` or expectations suite configurations. 207 - **PanderaValidator**: Expects Pandera schema definition files (Python 208 modules or YAML files depending on configuration). 209 210 The path can be either absolute or relative to the current working directory. 211 212 Returns 213 ------- 214 DataValidator 215 An instance of either ``GXValidator`` or ``PanderaValidator`` (in priority 216 order), initialized with the configuration found in the specified directory. 217 The returned object implements the ``DataValidator`` abstract interface, 218 providing ``validate()`` methods for data quality checks. 219 220 Raises 221 ------ 222 ImportError 223 Raised when neither the great_expectations nor pandera packages are 224 installed. The error message provides detailed installation instructions 225 for both options using the uv package manager, and also mentions the 226 alternative of implementing a custom validator or using ``NoValidator`` 227 (though the latter is not recommended for production use). 228 229 Warns 230 ----- 231 UserWarning 232 Issued when Great Expectations is not installed but Pandera is available. 233 This warning informs users that they are using the fallback validator 234 implementation rather than the preferred default. The warning includes 235 stacklevel=2 to show the calling code location rather than the factory 236 function itself. 237 238 See Also 239 -------- 240 adc_toolkit.data.validators.gx.GXValidator : Great Expectations validator implementation. 241 adc_toolkit.data.validators.pandera.PanderaValidator : Pandera validator implementation. 242 adc_toolkit.data.validators.no_validator.NoValidator : No-op validator (not recommended). 243 adc_toolkit.data.abs.DataValidator : Abstract base class for data validators. 244 adc_toolkit.data.ValidatedDataCatalog : Validated catalog using this default. 245 246 Notes 247 ----- 248 **Lazy Import Mechanism**: The function uses ``importlib.util.find_spec`` to 249 check for package availability before importing. This allows the module to be 250 imported even when validation libraries are not installed, with the ImportError 251 only raised when the function is actually called. 252 253 **Installation Options**: Users should install the validation library that best 254 fits their needs: 255 256 - For comprehensive validation and data documentation: ``uv sync --group gx`` 257 - For lightweight DataFrame validation: ``uv sync --group pandera`` 258 - For both (if needed): ``uv sync --group gx --group pandera`` 259 260 **Alternative Implementations**: Users who don't want to use the defaults can: 261 262 1. Implement a custom ``DataValidator`` subclass 263 2. Use the ``NoValidator`` class (bypasses all validation, not recommended) 264 3. Directly instantiate a specific validator and pass it to ``ValidatedDataCatalog`` 265 266 **Warning Behavior**: The fallback warning uses ``stacklevel=2`` to ensure the 267 warning appears to originate from the user's code that called this function, 268 not from within the factory function itself. This makes it easier for users 269 to identify where the fallback is being triggered. 270 271 Examples 272 -------- 273 Basic usage to get a default validator: 274 275 >>> from adc_toolkit.data.default_attributes import default_validator 276 >>> validator = default_validator("path/to/config") 277 >>> # validator is either GXValidator or PanderaValidator 278 >>> validated_df = validator.validate("my_dataset", df) 279 280 Using with ValidatedDataCatalog (typical usage pattern): 281 282 >>> from adc_toolkit.data import ValidatedDataCatalog 283 >>> validated_cat = ValidatedDataCatalog.in_directory("config/") 284 >>> # This internally calls default_validator("config/") 285 >>> df = validated_cat.load("my_dataset") # Validates after loading 286 287 Handling the fallback warning: 288 289 >>> import warnings 290 >>> warnings.filterwarnings("ignore", message=".*PanderaValidator.*") 291 >>> validator = default_validator("config/") 292 >>> # Warning is suppressed if only Pandera is installed 293 294 Explicitly choosing a validator to avoid the default behavior: 295 296 >>> from adc_toolkit.data.validators.pandera import PanderaValidator 297 >>> from adc_toolkit.data.validators.gx import GXValidator 298 >>> # Choose GXValidator explicitly 299 >>> validator = GXValidator.in_directory("config/") 300 301 Handling the ImportError when no validators are installed: 302 303 >>> from adc_toolkit.data.validators.no_validator import NoValidator 304 >>> try: 305 ... validator = default_validator("config/") 306 ... except ImportError: 307 ... print("No validators installed, using NoValidator") 308 ... validator = NoValidator() 309 310 Working with Path objects: 311 312 >>> from pathlib import Path 313 >>> config_dir = Path(__file__).parent / "config" 314 >>> validator = default_validator(config_dir) 315 """ 316 is_great_expectations_installed = find_spec("great_expectations") is not None 317 is_pandera_installed = find_spec("pandera") is not None 318 319 if is_great_expectations_installed: 320 from adc_toolkit.data.validators.gx import GXValidator 321 322 return GXValidator.in_directory(config_path) 323 elif is_pandera_installed: 324 warnings.warn( 325 "Default data validator is GXValidator. " 326 "Great Expectations is not installed. " 327 "Using PanderaValidator instead.", 328 stacklevel=2, 329 ) 330 from adc_toolkit.data.validators.pandera import PanderaValidator 331 332 return PanderaValidator.in_directory(config_path) 333 else: 334 raise ImportError( 335 "Default data validators are GXValidator and PanderaValidator. " 336 "You must install either great_expectations or pandera to use them. " 337 "Neither package is installed. " 338 "Run `uv sync --group gx` or " 339 "`uv sync --group pandera` to do so. " 340 "Alternatively, you can implement your own data validator. " 341 "If you don't want to validate data, use NoValidator class (not recommended)." 342 )
Return the default data validator implementation with priority-based selection.
This factory function provides the default DataValidator implementation for
the adc-toolkit by attempting to load validation libraries in priority order.
It implements a fallback chain: Great Expectations (preferred) → Pandera
(fallback) → ImportError (if neither is available).
The function uses lazy imports and runtime package detection to check for available validation libraries, allowing users to install only the validator they prefer. When Great Expectations is not available but Pandera is, a warning is issued to inform users they are using the fallback implementation.
Priority Selection Logic
GXValidator (Great Expectations): Preferred default. Provides comprehensive data validation with extensive built-in expectations, profiling capabilities, and data documentation features.
PanderaValidator (Pandera): Fallback option. Provides DataFrame schema validation with a more lightweight, Pythonic API. Used automatically when Great Expectations is not installed.
ImportError: Raised when neither validation library is available, with detailed installation instructions.
Parameters
config_path (str or pathlib.Path): Path to the configuration directory containing validator configuration files. The expected file structure depends on the validator:
- GXValidator: Expects a Great Expectations project structure with
great_expectations.ymlor expectations suite configurations. - PanderaValidator: Expects Pandera schema definition files (Python modules or YAML files depending on configuration).
The path can be either absolute or relative to the current working directory.
- GXValidator: Expects a Great Expectations project structure with
Returns
- DataValidator: An instance of either
GXValidatororPanderaValidator(in priority order), initialized with the configuration found in the specified directory. The returned object implements theDataValidatorabstract interface, providingvalidate()methods for data quality checks.
Raises
- ImportError: Raised when neither the great_expectations nor pandera packages are
installed. The error message provides detailed installation instructions
for both options using the uv package manager, and also mentions the
alternative of implementing a custom validator or using
NoValidator(though the latter is not recommended for production use).
Warns
- UserWarning: Issued when Great Expectations is not installed but Pandera is available. This warning informs users that they are using the fallback validator implementation rather than the preferred default. The warning includes stacklevel=2 to show the calling code location rather than the factory function itself.
See Also
adc_toolkit.data.validators.gx.GXValidator: Great Expectations validator implementation.
adc_toolkit.data.validators.pandera.PanderaValidator: Pandera validator implementation.
adc_toolkit.data.validators.no_validator.NoValidator: No-op validator (not recommended).
adc_toolkit.data.abs.DataValidator: Abstract base class for data validators.
adc_toolkit.data.ValidatedDataCatalog: Validated catalog using this default.
Notes
Lazy Import Mechanism: The function uses importlib.util.find_spec to
check for package availability before importing. This allows the module to be
imported even when validation libraries are not installed, with the ImportError
only raised when the function is actually called.
Installation Options: Users should install the validation library that best fits their needs:
- For comprehensive validation and data documentation:
uv sync --group gx - For lightweight DataFrame validation:
uv sync --group pandera - For both (if needed):
uv sync --group gx --group pandera
Alternative Implementations: Users who don't want to use the defaults can:
- Implement a custom
DataValidatorsubclass - Use the
NoValidatorclass (bypasses all validation, not recommended) - Directly instantiate a specific validator and pass it to
ValidatedDataCatalog
Warning Behavior: The fallback warning uses stacklevel=2 to ensure the
warning appears to originate from the user's code that called this function,
not from within the factory function itself. This makes it easier for users
to identify where the fallback is being triggered.
Examples
Basic usage to get a default validator:
>>> from adc_toolkit.data.default_attributes import default_validator
>>> validator = default_validator("path/to/config")
>>> # validator is either GXValidator or PanderaValidator
>>> validated_df = validator.validate("my_dataset", df)
Using with ValidatedDataCatalog (typical usage pattern):
>>> from adc_toolkit.data import ValidatedDataCatalog
>>> validated_cat = ValidatedDataCatalog.in_directory("config/")
>>> # This internally calls default_validator("config/")
>>> df = validated_cat.load("my_dataset") # Validates after loading
Handling the fallback warning:
>>> import warnings
>>> warnings.filterwarnings("ignore", message=".*PanderaValidator.*")
>>> validator = default_validator("config/")
>>> # Warning is suppressed if only Pandera is installed
Explicitly choosing a validator to avoid the default behavior:
>>> from adc_toolkit.data.validators.pandera import PanderaValidator
>>> from adc_toolkit.data.validators.gx import GXValidator
>>> # Choose GXValidator explicitly
>>> validator = GXValidator.in_directory("config/")
Handling the ImportError when no validators are installed:
>>> from adc_toolkit.data.validators.no_validator import NoValidator
>>> try:
... validator = default_validator("config/")
... except ImportError:
... print("No validators installed, using NoValidator")
... validator = NoValidator()
Working with Path objects:
>>> from pathlib import Path
>>> config_dir = Path(__file__).parent / "config"
>>> validator = default_validator(config_dir)