adc_toolkit.data.validators.table_utils.table_properties

Extract DataFrame properties for validation and catalog operations.

This module provides utility functions for extracting metadata and type information from data objects conforming to the Data protocol. These utilities support both pandas and PySpark DataFrames, enabling framework-agnostic data handling.

The functions in this module are primarily used by validators and data catalogs to inspect DataFrame structure, determine appropriate processing strategies, and ensure compatibility with validation schemas.

Functions

extract_dataframe_type(data) Determine the DataFrame framework type (pandas, pyspark, etc.). extract_dataframe_schema(data) Extract column names and data types as a dictionary. extract_dataframe_schema_spark_native_format(data) Extract Spark DataFrame schema in native Spark format.

Examples

Determine DataFrame type:

>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2, 3]})
>>> extract_dataframe_type(df)
'pandas'

Extract schema from a pandas DataFrame:

>>> df = pd.DataFrame({"col1": [1, 2], "col2": [3.0, 4.0]})
>>> extract_dataframe_schema(df)
{'col1': 'int64', 'col2': 'float64'}
See Also

adc_toolkit.data.abs.Data: Protocol defining the Data interface.
adc_toolkit.data.validators.gx.batch_managers: Uses these utilities for datasource detection.
adc_toolkit.data.validators.pandera: Uses these utilities for schema compilation.

  1"""
  2Extract DataFrame properties for validation and catalog operations.
  3
  4This module provides utility functions for extracting metadata and type information
  5from data objects conforming to the Data protocol. These utilities support both
  6pandas and PySpark DataFrames, enabling framework-agnostic data handling.
  7
  8The functions in this module are primarily used by validators and data catalogs
  9to inspect DataFrame structure, determine appropriate processing strategies, and
 10ensure compatibility with validation schemas.
 11
 12Functions
 13---------
 14extract_dataframe_type(data)
 15    Determine the DataFrame framework type (pandas, pyspark, etc.).
 16extract_dataframe_schema(data)
 17    Extract column names and data types as a dictionary.
 18extract_dataframe_schema_spark_native_format(data)
 19    Extract Spark DataFrame schema in native Spark format.
 20
 21Examples
 22--------
 23Determine DataFrame type:
 24
 25>>> import pandas as pd
 26>>> df = pd.DataFrame({"a": [1, 2, 3]})
 27>>> extract_dataframe_type(df)
 28'pandas'
 29
 30Extract schema from a pandas DataFrame:
 31
 32>>> df = pd.DataFrame({"col1": [1, 2], "col2": [3.0, 4.0]})
 33>>> extract_dataframe_schema(df)
 34{'col1': 'int64', 'col2': 'float64'}
 35
 36See Also
 37--------
 38adc_toolkit.data.abs.Data : Protocol defining the Data interface.
 39adc_toolkit.data.validators.gx.batch_managers : Uses these utilities for datasource detection.
 40adc_toolkit.data.validators.pandera : Uses these utilities for schema compilation.
 41"""
 42
 43from adc_toolkit.data.abs import Data
 44
 45
 46def extract_dataframe_type(data: Data) -> str:
 47    """
 48    Determine the DataFrame framework type from its module name.
 49
 50    This function identifies the data processing framework (pandas, PySpark, etc.)
 51    by inspecting the module path of the data object's type. It extracts the
 52    top-level module name, which typically indicates the framework being used.
 53
 54    The function is framework-agnostic and works with any data object conforming
 55    to the Data protocol. It is commonly used to determine which processing
 56    strategy or validator to apply to a dataset.
 57
 58    Parameters
 59    ----------
 60    data : Data
 61        A data object conforming to the Data protocol. Typically a pandas
 62        DataFrame, Spark DataFrame, or other compatible data structure with
 63        `columns` and `dtypes` properties.
 64
 65    Returns
 66    -------
 67    str
 68        The top-level module name identifying the DataFrame framework.
 69        Common return values include:
 70        - "pandas" for pandas DataFrames
 71        - "pyspark" for PySpark DataFrames
 72        - Other framework names for alternative implementations
 73
 74    See Also
 75    --------
 76    extract_dataframe_schema : Extract column names and types from a DataFrame.
 77    extract_dataframe_schema_spark_native_format : Extract Spark schema in native format.
 78
 79    Notes
 80    -----
 81    The function operates by:
 82    1. Getting the type of the data object using `type(data)`
 83    2. Accessing the `__module__` attribute, which contains the full module path
 84       (e.g., "pandas.core.frame", "pyspark.sql.dataframe")
 85    3. Splitting on "." and returning the first component (the framework name)
 86
 87    This approach is robust across different versions of the same framework, as
 88    the top-level module name typically remains stable even when internal module
 89    structure changes.
 90
 91    The function does not validate that the data object actually conforms to
 92    the Data protocol. It simply extracts the module name from whatever object
 93    is passed.
 94
 95    Examples
 96    --------
 97    Identify a pandas DataFrame:
 98
 99    >>> import pandas as pd
100    >>> df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
101    >>> extract_dataframe_type(df)
102    'pandas'
103
104    Identify a PySpark DataFrame:
105
106    >>> from pyspark.sql import SparkSession
107    >>> spark = SparkSession.builder.getOrCreate()
108    >>> spark_df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"])
109    >>> extract_dataframe_type(spark_df)
110    'pyspark'
111
112    Use in conditional logic to apply framework-specific processing:
113
114    >>> if extract_dataframe_type(df) == "pandas":
115    ...     # Apply pandas-specific operations
116    ...     result = df.groupby("category").sum()
117    ... elif extract_dataframe_type(df) == "pyspark":
118    ...     # Apply Spark-specific operations
119    ...     result = df.groupBy("category").sum()
120
121    Determine appropriate validator based on DataFrame type:
122
123    >>> framework = extract_dataframe_type(data)
124    >>> if framework == "pandas":
125    ...     datasource = PandasDatasource(data_context)
126    ... elif framework == "pyspark":
127    ...     datasource = SparkDFDatasource(data_context)
128    """
129    return type(data).__module__.split(".")[0]
130
131
132def extract_dataframe_schema(data: Data) -> dict[str, str]:
133    """
134    Extract DataFrame schema as a dictionary mapping column names to type strings.
135
136    This function extracts the complete schema information from a DataFrame by
137    converting the `dtypes` attribute into a dictionary with column names as keys
138    and string representations of data types as values. This format is
139    framework-agnostic and works with both pandas and PySpark DataFrames.
140
141    The resulting dictionary is useful for schema comparison, validation setup,
142    logging, and generating schema documentation. It provides a simple,
143    serializable representation of the DataFrame's structure.
144
145    Parameters
146    ----------
147    data : Data
148        A data object conforming to the Data protocol with a `dtypes` attribute.
149        Typically a pandas DataFrame (with dtypes as a pandas.Series) or a
150        PySpark DataFrame (with dtypes as a list of tuples).
151
152    Returns
153    -------
154    dict of str to str
155        A dictionary mapping each column name to its data type as a string.
156        For pandas DataFrames, types include "int64", "float64", "object", etc.
157        For PySpark DataFrames, types include "bigint", "double", "string", etc.
158        The specific type strings depend on the DataFrame framework being used.
159
160    See Also
161    --------
162    extract_dataframe_type : Determine the DataFrame framework type.
163    extract_dataframe_schema_spark_native_format : Extract Spark schema with native type names.
164
165    Notes
166    -----
167    The function operates differently depending on the DataFrame type:
168
169    For pandas DataFrames:
170    - `data.dtypes` returns a pandas.Series with column names as index and
171      numpy/pandas dtype objects as values
172    - Converting to dict gives {column: dtype_object}
173    - String conversion yields familiar type names like "int64", "float64"
174
175    For PySpark DataFrames:
176    - `data.dtypes` returns a list of tuples: [(column_name, type_string), ...]
177    - Converting to dict gives {column: type_string}
178    - Type strings are in Spark SQL format like "bigint", "double", "string"
179
180    The function uses `dict(data.dtypes)` which works for both pandas (Series)
181    and PySpark (list of tuples), making it framework-agnostic. The additional
182    `str()` conversion ensures type objects are converted to readable strings.
183
184    This function does not validate the data or check for schema consistency.
185    It simply extracts and formats whatever schema information is present in
186    the data object.
187
188    Examples
189    --------
190    Extract schema from a pandas DataFrame:
191
192    >>> import pandas as pd
193    >>> df = pd.DataFrame({"id": [1, 2, 3], "value": [10.5, 20.3, 30.7], "name": ["Alice", "Bob", "Charlie"]})
194    >>> schema = extract_dataframe_schema(df)
195    >>> schema
196    {'id': 'int64', 'value': 'float64', 'name': 'object'}
197
198    Extract schema from a PySpark DataFrame:
199
200    >>> from pyspark.sql import SparkSession
201    >>> spark = SparkSession.builder.getOrCreate()
202    >>> spark_df = spark.createDataFrame([(1, 10.5, "Alice"), (2, 20.3, "Bob")], ["id", "value", "name"])
203    >>> schema = extract_dataframe_schema(spark_df)
204    >>> schema
205    {'id': 'bigint', 'value': 'double', 'name': 'string'}
206
207    Compare schemas between DataFrames:
208
209    >>> df1 = pd.DataFrame({"a": [1, 2], "b": [3.0, 4.0]})
210    >>> df2 = pd.DataFrame({"a": [1, 2], "b": ["x", "y"]})
211    >>> schema1 = extract_dataframe_schema(df1)
212    >>> schema2 = extract_dataframe_schema(df2)
213    >>> schema1 == schema2
214    False
215    >>> schema1["b"], schema2["b"]
216    ('float64', 'object')
217
218    Use for validation schema generation:
219
220    >>> current_schema = extract_dataframe_schema(df)
221    >>> expected_schema = {"id": "int64", "value": "float64", "name": "object"}
222    >>> if current_schema != expected_schema:
223    ...     raise ValueError(f"Schema mismatch: {current_schema} != {expected_schema}")
224
225    Log schema information:
226
227    >>> import logging
228    >>> schema = extract_dataframe_schema(df)
229    >>> logging.info(f"Processing DataFrame with schema: {schema}")
230    INFO:root:Processing DataFrame with schema: {'id': 'int64', 'value': 'float64', 'name': 'object'}
231    """
232    return {col_name: str(col_type) for col_name, col_type in dict(data.dtypes).items()}
233
234
235def extract_dataframe_schema_spark_native_format(data: Data) -> dict[str, str]:
236    """
237    Extract Spark DataFrame schema using native Spark type names.
238
239    This function extracts schema information from a PySpark DataFrame using
240    Spark's native schema representation. It accesses the `schema` attribute
241    (a StructType object) and extracts the type name for each field, removing
242    the trailing "()" suffix that Spark type objects include when converted
243    to strings.
244
245    This function is specifically designed for PySpark DataFrames and produces
246    type names in Spark's native format (e.g., "LongType", "StringType",
247    "DoubleType") rather than the SQL format used by `data.dtypes`
248    (e.g., "bigint", "string", "double").
249
250    Use this function when you need Spark-native type names for compatibility
251    with Great Expectations Spark datasources, custom Spark validation logic,
252    or when working with Spark's type system directly.
253
254    Parameters
255    ----------
256    data : Data
257        A PySpark DataFrame with a `schema` attribute. The schema should be
258        a StructType object containing StructField objects with name and
259        dataType attributes. Using this function with non-Spark data objects
260        will raise an AttributeError.
261
262    Returns
263    -------
264    dict of str to str
265        A dictionary mapping each column name to its Spark-native type name.
266        Type names follow Spark's naming convention with the "Type" suffix:
267        - "LongType" for 64-bit integers
268        - "IntegerType" for 32-bit integers
269        - "DoubleType" for double-precision floats
270        - "StringType" for strings
271        - "BooleanType" for booleans
272        - And other Spark data types
273
274    Raises
275    ------
276    AttributeError
277        If the data object does not have a `schema` attribute (i.e., it is
278        not a PySpark DataFrame or compatible Spark data structure).
279
280    See Also
281    --------
282    extract_dataframe_schema : Extract schema in a framework-agnostic format.
283    extract_dataframe_type : Determine the DataFrame framework type.
284
285    Notes
286    -----
287    The function operates by:
288    1. Accessing `data.schema`, which returns a StructType object for Spark DataFrames
289    2. Iterating over the StructField objects in the schema
290    3. For each field, extracting the name and converting dataType to string
291    4. Removing the trailing "()" from the type string representation
292       (e.g., "LongType()" becomes "LongType")
293
294    Spark type names differ from SQL type names:
295    - `data.dtypes` returns [("col", "bigint"), ...] (SQL format)
296    - This function returns {"col": "LongType"} (native Spark format)
297
298    The SQL format is generally preferred for portability, but the native
299    format is needed when:
300    - Instantiating Spark DataType objects programmatically
301    - Working with Great Expectations Spark expectations
302    - Performing type matching with Spark's type system
303    - Debugging Spark-specific type issues
304
305    The `# type: ignore` comment suppresses mypy warnings about the `schema`
306    attribute, which is not part of the Data protocol but is present in
307    PySpark DataFrames.
308
309    Examples
310    --------
311    Extract schema from a PySpark DataFrame:
312
313    >>> from pyspark.sql import SparkSession
314    >>> spark = SparkSession.builder.getOrCreate()
315    >>> df = spark.createDataFrame([(1, 10.5, "Alice"), (2, 20.3, "Bob")], ["id", "value", "name"])
316    >>> schema = extract_dataframe_schema_spark_native_format(df)
317    >>> schema
318    {'id': 'LongType', 'value': 'DoubleType', 'name': 'StringType'}
319
320    Compare with SQL-format schema:
321
322    >>> sql_schema = extract_dataframe_schema(df)
323    >>> sql_schema
324    {'id': 'bigint', 'value': 'double', 'name': 'string'}
325    >>> native_schema = extract_dataframe_schema_spark_native_format(df)
326    >>> native_schema
327    {'id': 'LongType', 'value': 'DoubleType', 'name': 'StringType'}
328
329    Use with Great Expectations Spark datasources:
330
331    >>> native_schema = extract_dataframe_schema_spark_native_format(spark_df)
332    >>> # Configure GX expectations using Spark-native type names
333    >>> for col, spark_type in native_schema.items():
334    ...     if spark_type == "LongType":
335    ...         # Add expectations specific to long integer columns
336    ...         pass
337
338    Error when used with pandas DataFrame:
339
340    >>> import pandas as pd
341    >>> pandas_df = pd.DataFrame({"a": [1, 2, 3]})
342    >>> extract_dataframe_schema_spark_native_format(pandas_df)
343    Traceback (most recent call last):
344        ...
345    AttributeError: 'DataFrame' object has no attribute 'schema'
346
347    Extract complex nested types:
348
349    >>> from pyspark.sql.types import StructType, StructField, StringType, ArrayType
350    >>> schema = StructType([StructField("name", StringType()), StructField("tags", ArrayType(StringType()))])
351    >>> df = spark.createDataFrame([("Alice", ["tag1", "tag2"])], schema)
352    >>> extract_dataframe_schema_spark_native_format(df)
353    {'name': 'StringType', 'tags': 'ArrayType(StringType())'}
354    """
355    return {x.name: str(x.dataType)[:-2] for x in list(data.schema)}  # type: ignore
def extract_dataframe_type(data: adc_toolkit.data.abs.Data) -> str:
 47def extract_dataframe_type(data: Data) -> str:
 48    """
 49    Determine the DataFrame framework type from its module name.
 50
 51    This function identifies the data processing framework (pandas, PySpark, etc.)
 52    by inspecting the module path of the data object's type. It extracts the
 53    top-level module name, which typically indicates the framework being used.
 54
 55    The function is framework-agnostic and works with any data object conforming
 56    to the Data protocol. It is commonly used to determine which processing
 57    strategy or validator to apply to a dataset.
 58
 59    Parameters
 60    ----------
 61    data : Data
 62        A data object conforming to the Data protocol. Typically a pandas
 63        DataFrame, Spark DataFrame, or other compatible data structure with
 64        `columns` and `dtypes` properties.
 65
 66    Returns
 67    -------
 68    str
 69        The top-level module name identifying the DataFrame framework.
 70        Common return values include:
 71        - "pandas" for pandas DataFrames
 72        - "pyspark" for PySpark DataFrames
 73        - Other framework names for alternative implementations
 74
 75    See Also
 76    --------
 77    extract_dataframe_schema : Extract column names and types from a DataFrame.
 78    extract_dataframe_schema_spark_native_format : Extract Spark schema in native format.
 79
 80    Notes
 81    -----
 82    The function operates by:
 83    1. Getting the type of the data object using `type(data)`
 84    2. Accessing the `__module__` attribute, which contains the full module path
 85       (e.g., "pandas.core.frame", "pyspark.sql.dataframe")
 86    3. Splitting on "." and returning the first component (the framework name)
 87
 88    This approach is robust across different versions of the same framework, as
 89    the top-level module name typically remains stable even when internal module
 90    structure changes.
 91
 92    The function does not validate that the data object actually conforms to
 93    the Data protocol. It simply extracts the module name from whatever object
 94    is passed.
 95
 96    Examples
 97    --------
 98    Identify a pandas DataFrame:
 99
100    >>> import pandas as pd
101    >>> df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
102    >>> extract_dataframe_type(df)
103    'pandas'
104
105    Identify a PySpark DataFrame:
106
107    >>> from pyspark.sql import SparkSession
108    >>> spark = SparkSession.builder.getOrCreate()
109    >>> spark_df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"])
110    >>> extract_dataframe_type(spark_df)
111    'pyspark'
112
113    Use in conditional logic to apply framework-specific processing:
114
115    >>> if extract_dataframe_type(df) == "pandas":
116    ...     # Apply pandas-specific operations
117    ...     result = df.groupby("category").sum()
118    ... elif extract_dataframe_type(df) == "pyspark":
119    ...     # Apply Spark-specific operations
120    ...     result = df.groupBy("category").sum()
121
122    Determine appropriate validator based on DataFrame type:
123
124    >>> framework = extract_dataframe_type(data)
125    >>> if framework == "pandas":
126    ...     datasource = PandasDatasource(data_context)
127    ... elif framework == "pyspark":
128    ...     datasource = SparkDFDatasource(data_context)
129    """
130    return type(data).__module__.split(".")[0]

Determine the DataFrame framework type from its module name.

This function identifies the data processing framework (pandas, PySpark, etc.) by inspecting the module path of the data object's type. It extracts the top-level module name, which typically indicates the framework being used.

The function is framework-agnostic and works with any data object conforming to the Data protocol. It is commonly used to determine which processing strategy or validator to apply to a dataset.

Parameters
  • data (Data): A data object conforming to the Data protocol. Typically a pandas DataFrame, Spark DataFrame, or other compatible data structure with columns and dtypes properties.
Returns
  • str: The top-level module name identifying the DataFrame framework. Common return values include:
    • "pandas" for pandas DataFrames
    • "pyspark" for PySpark DataFrames
    • Other framework names for alternative implementations
See Also

extract_dataframe_schema: Extract column names and types from a DataFrame.
extract_dataframe_schema_spark_native_format: Extract Spark schema in native format.

Notes

The function operates by:

  1. Getting the type of the data object using type(data)
  2. Accessing the __module__ attribute, which contains the full module path (e.g., "pandas.core.frame", "pyspark.sql.dataframe")
  3. Splitting on "." and returning the first component (the framework name)

This approach is robust across different versions of the same framework, as the top-level module name typically remains stable even when internal module structure changes.

The function does not validate that the data object actually conforms to the Data protocol. It simply extracts the module name from whatever object is passed.

Examples

Identify a pandas DataFrame:

>>> import pandas as pd
>>> df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> extract_dataframe_type(df)
'pandas'

Identify a PySpark DataFrame:

>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> spark_df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"])
>>> extract_dataframe_type(spark_df)
'pyspark'

Use in conditional logic to apply framework-specific processing:

>>> if extract_dataframe_type(df) == "pandas":
...     # Apply pandas-specific operations
...     result = df.groupby("category").sum()
... elif extract_dataframe_type(df) == "pyspark":
...     # Apply Spark-specific operations
...     result = df.groupBy("category").sum()

Determine appropriate validator based on DataFrame type:

>>> framework = extract_dataframe_type(data)
>>> if framework == "pandas":
...     datasource = PandasDatasource(data_context)
... elif framework == "pyspark":
...     datasource = SparkDFDatasource(data_context)
def extract_dataframe_schema(data: adc_toolkit.data.abs.Data) -> dict[str, str]:
133def extract_dataframe_schema(data: Data) -> dict[str, str]:
134    """
135    Extract DataFrame schema as a dictionary mapping column names to type strings.
136
137    This function extracts the complete schema information from a DataFrame by
138    converting the `dtypes` attribute into a dictionary with column names as keys
139    and string representations of data types as values. This format is
140    framework-agnostic and works with both pandas and PySpark DataFrames.
141
142    The resulting dictionary is useful for schema comparison, validation setup,
143    logging, and generating schema documentation. It provides a simple,
144    serializable representation of the DataFrame's structure.
145
146    Parameters
147    ----------
148    data : Data
149        A data object conforming to the Data protocol with a `dtypes` attribute.
150        Typically a pandas DataFrame (with dtypes as a pandas.Series) or a
151        PySpark DataFrame (with dtypes as a list of tuples).
152
153    Returns
154    -------
155    dict of str to str
156        A dictionary mapping each column name to its data type as a string.
157        For pandas DataFrames, types include "int64", "float64", "object", etc.
158        For PySpark DataFrames, types include "bigint", "double", "string", etc.
159        The specific type strings depend on the DataFrame framework being used.
160
161    See Also
162    --------
163    extract_dataframe_type : Determine the DataFrame framework type.
164    extract_dataframe_schema_spark_native_format : Extract Spark schema with native type names.
165
166    Notes
167    -----
168    The function operates differently depending on the DataFrame type:
169
170    For pandas DataFrames:
171    - `data.dtypes` returns a pandas.Series with column names as index and
172      numpy/pandas dtype objects as values
173    - Converting to dict gives {column: dtype_object}
174    - String conversion yields familiar type names like "int64", "float64"
175
176    For PySpark DataFrames:
177    - `data.dtypes` returns a list of tuples: [(column_name, type_string), ...]
178    - Converting to dict gives {column: type_string}
179    - Type strings are in Spark SQL format like "bigint", "double", "string"
180
181    The function uses `dict(data.dtypes)` which works for both pandas (Series)
182    and PySpark (list of tuples), making it framework-agnostic. The additional
183    `str()` conversion ensures type objects are converted to readable strings.
184
185    This function does not validate the data or check for schema consistency.
186    It simply extracts and formats whatever schema information is present in
187    the data object.
188
189    Examples
190    --------
191    Extract schema from a pandas DataFrame:
192
193    >>> import pandas as pd
194    >>> df = pd.DataFrame({"id": [1, 2, 3], "value": [10.5, 20.3, 30.7], "name": ["Alice", "Bob", "Charlie"]})
195    >>> schema = extract_dataframe_schema(df)
196    >>> schema
197    {'id': 'int64', 'value': 'float64', 'name': 'object'}
198
199    Extract schema from a PySpark DataFrame:
200
201    >>> from pyspark.sql import SparkSession
202    >>> spark = SparkSession.builder.getOrCreate()
203    >>> spark_df = spark.createDataFrame([(1, 10.5, "Alice"), (2, 20.3, "Bob")], ["id", "value", "name"])
204    >>> schema = extract_dataframe_schema(spark_df)
205    >>> schema
206    {'id': 'bigint', 'value': 'double', 'name': 'string'}
207
208    Compare schemas between DataFrames:
209
210    >>> df1 = pd.DataFrame({"a": [1, 2], "b": [3.0, 4.0]})
211    >>> df2 = pd.DataFrame({"a": [1, 2], "b": ["x", "y"]})
212    >>> schema1 = extract_dataframe_schema(df1)
213    >>> schema2 = extract_dataframe_schema(df2)
214    >>> schema1 == schema2
215    False
216    >>> schema1["b"], schema2["b"]
217    ('float64', 'object')
218
219    Use for validation schema generation:
220
221    >>> current_schema = extract_dataframe_schema(df)
222    >>> expected_schema = {"id": "int64", "value": "float64", "name": "object"}
223    >>> if current_schema != expected_schema:
224    ...     raise ValueError(f"Schema mismatch: {current_schema} != {expected_schema}")
225
226    Log schema information:
227
228    >>> import logging
229    >>> schema = extract_dataframe_schema(df)
230    >>> logging.info(f"Processing DataFrame with schema: {schema}")
231    INFO:root:Processing DataFrame with schema: {'id': 'int64', 'value': 'float64', 'name': 'object'}
232    """
233    return {col_name: str(col_type) for col_name, col_type in dict(data.dtypes).items()}

Extract DataFrame schema as a dictionary mapping column names to type strings.

This function extracts the complete schema information from a DataFrame by converting the dtypes attribute into a dictionary with column names as keys and string representations of data types as values. This format is framework-agnostic and works with both pandas and PySpark DataFrames.

The resulting dictionary is useful for schema comparison, validation setup, logging, and generating schema documentation. It provides a simple, serializable representation of the DataFrame's structure.

Parameters
  • data (Data): A data object conforming to the Data protocol with a dtypes attribute. Typically a pandas DataFrame (with dtypes as a pandas.Series) or a PySpark DataFrame (with dtypes as a list of tuples).
Returns
  • dict of str to str: A dictionary mapping each column name to its data type as a string. For pandas DataFrames, types include "int64", "float64", "object", etc. For PySpark DataFrames, types include "bigint", "double", "string", etc. The specific type strings depend on the DataFrame framework being used.
See Also

extract_dataframe_type: Determine the DataFrame framework type.
extract_dataframe_schema_spark_native_format: Extract Spark schema with native type names.

Notes

The function operates differently depending on the DataFrame type:

For pandas DataFrames:

  • data.dtypes returns a pandas.Series with column names as index and numpy/pandas dtype objects as values
  • Converting to dict gives {column: dtype_object}
  • String conversion yields familiar type names like "int64", "float64"

For PySpark DataFrames:

  • data.dtypes returns a list of tuples: [(column_name, type_string), ...]
  • Converting to dict gives {column: type_string}
  • Type strings are in Spark SQL format like "bigint", "double", "string"

The function uses dict(data.dtypes) which works for both pandas (Series) and PySpark (list of tuples), making it framework-agnostic. The additional str() conversion ensures type objects are converted to readable strings.

This function does not validate the data or check for schema consistency. It simply extracts and formats whatever schema information is present in the data object.

Examples

Extract schema from a pandas DataFrame:

>>> import pandas as pd
>>> df = pd.DataFrame({"id": [1, 2, 3], "value": [10.5, 20.3, 30.7], "name": ["Alice", "Bob", "Charlie"]})
>>> schema = extract_dataframe_schema(df)
>>> schema
{'id': 'int64', 'value': 'float64', 'name': 'object'}

Extract schema from a PySpark DataFrame:

>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> spark_df = spark.createDataFrame([(1, 10.5, "Alice"), (2, 20.3, "Bob")], ["id", "value", "name"])
>>> schema = extract_dataframe_schema(spark_df)
>>> schema
{'id': 'bigint', 'value': 'double', 'name': 'string'}

Compare schemas between DataFrames:

>>> df1 = pd.DataFrame({"a": [1, 2], "b": [3.0, 4.0]})
>>> df2 = pd.DataFrame({"a": [1, 2], "b": ["x", "y"]})
>>> schema1 = extract_dataframe_schema(df1)
>>> schema2 = extract_dataframe_schema(df2)
>>> schema1 == schema2
False
>>> schema1["b"], schema2["b"]
('float64', 'object')

Use for validation schema generation:

>>> current_schema = extract_dataframe_schema(df)
>>> expected_schema = {"id": "int64", "value": "float64", "name": "object"}
>>> if current_schema != expected_schema:
...     raise ValueError(f"Schema mismatch: {current_schema} != {expected_schema}")

Log schema information:

>>> import logging
>>> schema = extract_dataframe_schema(df)
>>> logging.info(f"Processing DataFrame with schema: {schema}")
INFO:root:Processing DataFrame with schema: {'id': 'int64', 'value': 'float64', 'name': 'object'}
def extract_dataframe_schema_spark_native_format(data: adc_toolkit.data.abs.Data) -> dict[str, str]:
236def extract_dataframe_schema_spark_native_format(data: Data) -> dict[str, str]:
237    """
238    Extract Spark DataFrame schema using native Spark type names.
239
240    This function extracts schema information from a PySpark DataFrame using
241    Spark's native schema representation. It accesses the `schema` attribute
242    (a StructType object) and extracts the type name for each field, removing
243    the trailing "()" suffix that Spark type objects include when converted
244    to strings.
245
246    This function is specifically designed for PySpark DataFrames and produces
247    type names in Spark's native format (e.g., "LongType", "StringType",
248    "DoubleType") rather than the SQL format used by `data.dtypes`
249    (e.g., "bigint", "string", "double").
250
251    Use this function when you need Spark-native type names for compatibility
252    with Great Expectations Spark datasources, custom Spark validation logic,
253    or when working with Spark's type system directly.
254
255    Parameters
256    ----------
257    data : Data
258        A PySpark DataFrame with a `schema` attribute. The schema should be
259        a StructType object containing StructField objects with name and
260        dataType attributes. Using this function with non-Spark data objects
261        will raise an AttributeError.
262
263    Returns
264    -------
265    dict of str to str
266        A dictionary mapping each column name to its Spark-native type name.
267        Type names follow Spark's naming convention with the "Type" suffix:
268        - "LongType" for 64-bit integers
269        - "IntegerType" for 32-bit integers
270        - "DoubleType" for double-precision floats
271        - "StringType" for strings
272        - "BooleanType" for booleans
273        - And other Spark data types
274
275    Raises
276    ------
277    AttributeError
278        If the data object does not have a `schema` attribute (i.e., it is
279        not a PySpark DataFrame or compatible Spark data structure).
280
281    See Also
282    --------
283    extract_dataframe_schema : Extract schema in a framework-agnostic format.
284    extract_dataframe_type : Determine the DataFrame framework type.
285
286    Notes
287    -----
288    The function operates by:
289    1. Accessing `data.schema`, which returns a StructType object for Spark DataFrames
290    2. Iterating over the StructField objects in the schema
291    3. For each field, extracting the name and converting dataType to string
292    4. Removing the trailing "()" from the type string representation
293       (e.g., "LongType()" becomes "LongType")
294
295    Spark type names differ from SQL type names:
296    - `data.dtypes` returns [("col", "bigint"), ...] (SQL format)
297    - This function returns {"col": "LongType"} (native Spark format)
298
299    The SQL format is generally preferred for portability, but the native
300    format is needed when:
301    - Instantiating Spark DataType objects programmatically
302    - Working with Great Expectations Spark expectations
303    - Performing type matching with Spark's type system
304    - Debugging Spark-specific type issues
305
306    The `# type: ignore` comment suppresses mypy warnings about the `schema`
307    attribute, which is not part of the Data protocol but is present in
308    PySpark DataFrames.
309
310    Examples
311    --------
312    Extract schema from a PySpark DataFrame:
313
314    >>> from pyspark.sql import SparkSession
315    >>> spark = SparkSession.builder.getOrCreate()
316    >>> df = spark.createDataFrame([(1, 10.5, "Alice"), (2, 20.3, "Bob")], ["id", "value", "name"])
317    >>> schema = extract_dataframe_schema_spark_native_format(df)
318    >>> schema
319    {'id': 'LongType', 'value': 'DoubleType', 'name': 'StringType'}
320
321    Compare with SQL-format schema:
322
323    >>> sql_schema = extract_dataframe_schema(df)
324    >>> sql_schema
325    {'id': 'bigint', 'value': 'double', 'name': 'string'}
326    >>> native_schema = extract_dataframe_schema_spark_native_format(df)
327    >>> native_schema
328    {'id': 'LongType', 'value': 'DoubleType', 'name': 'StringType'}
329
330    Use with Great Expectations Spark datasources:
331
332    >>> native_schema = extract_dataframe_schema_spark_native_format(spark_df)
333    >>> # Configure GX expectations using Spark-native type names
334    >>> for col, spark_type in native_schema.items():
335    ...     if spark_type == "LongType":
336    ...         # Add expectations specific to long integer columns
337    ...         pass
338
339    Error when used with pandas DataFrame:
340
341    >>> import pandas as pd
342    >>> pandas_df = pd.DataFrame({"a": [1, 2, 3]})
343    >>> extract_dataframe_schema_spark_native_format(pandas_df)
344    Traceback (most recent call last):
345        ...
346    AttributeError: 'DataFrame' object has no attribute 'schema'
347
348    Extract complex nested types:
349
350    >>> from pyspark.sql.types import StructType, StructField, StringType, ArrayType
351    >>> schema = StructType([StructField("name", StringType()), StructField("tags", ArrayType(StringType()))])
352    >>> df = spark.createDataFrame([("Alice", ["tag1", "tag2"])], schema)
353    >>> extract_dataframe_schema_spark_native_format(df)
354    {'name': 'StringType', 'tags': 'ArrayType(StringType())'}
355    """
356    return {x.name: str(x.dataType)[:-2] for x in list(data.schema)}  # type: ignore

Extract Spark DataFrame schema using native Spark type names.

This function extracts schema information from a PySpark DataFrame using Spark's native schema representation. It accesses the schema attribute (a StructType object) and extracts the type name for each field, removing the trailing "()" suffix that Spark type objects include when converted to strings.

This function is specifically designed for PySpark DataFrames and produces type names in Spark's native format (e.g., "LongType", "StringType", "DoubleType") rather than the SQL format used by data.dtypes (e.g., "bigint", "string", "double").

Use this function when you need Spark-native type names for compatibility with Great Expectations Spark datasources, custom Spark validation logic, or when working with Spark's type system directly.

Parameters
  • data (Data): A PySpark DataFrame with a schema attribute. The schema should be a StructType object containing StructField objects with name and dataType attributes. Using this function with non-Spark data objects will raise an AttributeError.
Returns
  • dict of str to str: A dictionary mapping each column name to its Spark-native type name. Type names follow Spark's naming convention with the "Type" suffix:
    • "LongType" for 64-bit integers
    • "IntegerType" for 32-bit integers
    • "DoubleType" for double-precision floats
    • "StringType" for strings
    • "BooleanType" for booleans
    • And other Spark data types
Raises
  • AttributeError: If the data object does not have a schema attribute (i.e., it is not a PySpark DataFrame or compatible Spark data structure).
See Also

extract_dataframe_schema: Extract schema in a framework-agnostic format.
extract_dataframe_type: Determine the DataFrame framework type.

Notes

The function operates by:

  1. Accessing data.schema, which returns a StructType object for Spark DataFrames
  2. Iterating over the StructField objects in the schema
  3. For each field, extracting the name and converting dataType to string
  4. Removing the trailing "()" from the type string representation (e.g., "LongType()" becomes "LongType")

Spark type names differ from SQL type names:

  • data.dtypes returns [("col", "bigint"), ...] (SQL format)
  • This function returns {"col": "LongType"} (native Spark format)

The SQL format is generally preferred for portability, but the native format is needed when:

  • Instantiating Spark DataType objects programmatically
  • Working with Great Expectations Spark expectations
  • Performing type matching with Spark's type system
  • Debugging Spark-specific type issues

The # type: ignore comment suppresses mypy warnings about the schema attribute, which is not part of the Data protocol but is present in PySpark DataFrames.

Examples

Extract schema from a PySpark DataFrame:

>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame([(1, 10.5, "Alice"), (2, 20.3, "Bob")], ["id", "value", "name"])
>>> schema = extract_dataframe_schema_spark_native_format(df)
>>> schema
{'id': 'LongType', 'value': 'DoubleType', 'name': 'StringType'}

Compare with SQL-format schema:

>>> sql_schema = extract_dataframe_schema(df)
>>> sql_schema
{'id': 'bigint', 'value': 'double', 'name': 'string'}
>>> native_schema = extract_dataframe_schema_spark_native_format(df)
>>> native_schema
{'id': 'LongType', 'value': 'DoubleType', 'name': 'StringType'}

Use with Great Expectations Spark datasources:

>>> native_schema = extract_dataframe_schema_spark_native_format(spark_df)
>>> # Configure GX expectations using Spark-native type names
>>> for col, spark_type in native_schema.items():
...     if spark_type == "LongType":
...         # Add expectations specific to long integer columns
...         pass

Error when used with pandas DataFrame:

>>> import pandas as pd
>>> pandas_df = pd.DataFrame({"a": [1, 2, 3]})
>>> extract_dataframe_schema_spark_native_format(pandas_df)
Traceback (most recent call last):
    ...
AttributeError: 'DataFrame' object has no attribute 'schema'

Extract complex nested types:

>>> from pyspark.sql.types import StructType, StructField, StringType, ArrayType
>>> schema = StructType([StructField("name", StringType()), StructField("tags", ArrayType(StringType()))])
>>> df = spark.createDataFrame([("Alice", ["tag1", "tag2"])], schema)
>>> extract_dataframe_schema_spark_native_format(df)
{'name': 'StringType', 'tags': 'ArrayType(StringType())'}