adc_toolkit.data.validators.table_utils.table_properties
Extract DataFrame properties for validation and catalog operations.
This module provides utility functions for extracting metadata and type information from data objects conforming to the Data protocol. These utilities support both pandas and PySpark DataFrames, enabling framework-agnostic data handling.
The functions in this module are primarily used by validators and data catalogs to inspect DataFrame structure, determine appropriate processing strategies, and ensure compatibility with validation schemas.
Functions
extract_dataframe_type(data) Determine the DataFrame framework type (pandas, pyspark, etc.). extract_dataframe_schema(data) Extract column names and data types as a dictionary. extract_dataframe_schema_spark_native_format(data) Extract Spark DataFrame schema in native Spark format.
Examples
Determine DataFrame type:
>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2, 3]})
>>> extract_dataframe_type(df)
'pandas'
Extract schema from a pandas DataFrame:
>>> df = pd.DataFrame({"col1": [1, 2], "col2": [3.0, 4.0]})
>>> extract_dataframe_schema(df)
{'col1': 'int64', 'col2': 'float64'}
See Also
adc_toolkit.data.abs.Data: Protocol defining the Data interface.
adc_toolkit.data.validators.gx.batch_managers: Uses these utilities for datasource detection.
adc_toolkit.data.validators.pandera: Uses these utilities for schema compilation.
1""" 2Extract DataFrame properties for validation and catalog operations. 3 4This module provides utility functions for extracting metadata and type information 5from data objects conforming to the Data protocol. These utilities support both 6pandas and PySpark DataFrames, enabling framework-agnostic data handling. 7 8The functions in this module are primarily used by validators and data catalogs 9to inspect DataFrame structure, determine appropriate processing strategies, and 10ensure compatibility with validation schemas. 11 12Functions 13--------- 14extract_dataframe_type(data) 15 Determine the DataFrame framework type (pandas, pyspark, etc.). 16extract_dataframe_schema(data) 17 Extract column names and data types as a dictionary. 18extract_dataframe_schema_spark_native_format(data) 19 Extract Spark DataFrame schema in native Spark format. 20 21Examples 22-------- 23Determine DataFrame type: 24 25>>> import pandas as pd 26>>> df = pd.DataFrame({"a": [1, 2, 3]}) 27>>> extract_dataframe_type(df) 28'pandas' 29 30Extract schema from a pandas DataFrame: 31 32>>> df = pd.DataFrame({"col1": [1, 2], "col2": [3.0, 4.0]}) 33>>> extract_dataframe_schema(df) 34{'col1': 'int64', 'col2': 'float64'} 35 36See Also 37-------- 38adc_toolkit.data.abs.Data : Protocol defining the Data interface. 39adc_toolkit.data.validators.gx.batch_managers : Uses these utilities for datasource detection. 40adc_toolkit.data.validators.pandera : Uses these utilities for schema compilation. 41""" 42 43from adc_toolkit.data.abs import Data 44 45 46def extract_dataframe_type(data: Data) -> str: 47 """ 48 Determine the DataFrame framework type from its module name. 49 50 This function identifies the data processing framework (pandas, PySpark, etc.) 51 by inspecting the module path of the data object's type. It extracts the 52 top-level module name, which typically indicates the framework being used. 53 54 The function is framework-agnostic and works with any data object conforming 55 to the Data protocol. It is commonly used to determine which processing 56 strategy or validator to apply to a dataset. 57 58 Parameters 59 ---------- 60 data : Data 61 A data object conforming to the Data protocol. Typically a pandas 62 DataFrame, Spark DataFrame, or other compatible data structure with 63 `columns` and `dtypes` properties. 64 65 Returns 66 ------- 67 str 68 The top-level module name identifying the DataFrame framework. 69 Common return values include: 70 - "pandas" for pandas DataFrames 71 - "pyspark" for PySpark DataFrames 72 - Other framework names for alternative implementations 73 74 See Also 75 -------- 76 extract_dataframe_schema : Extract column names and types from a DataFrame. 77 extract_dataframe_schema_spark_native_format : Extract Spark schema in native format. 78 79 Notes 80 ----- 81 The function operates by: 82 1. Getting the type of the data object using `type(data)` 83 2. Accessing the `__module__` attribute, which contains the full module path 84 (e.g., "pandas.core.frame", "pyspark.sql.dataframe") 85 3. Splitting on "." and returning the first component (the framework name) 86 87 This approach is robust across different versions of the same framework, as 88 the top-level module name typically remains stable even when internal module 89 structure changes. 90 91 The function does not validate that the data object actually conforms to 92 the Data protocol. It simply extracts the module name from whatever object 93 is passed. 94 95 Examples 96 -------- 97 Identify a pandas DataFrame: 98 99 >>> import pandas as pd 100 >>> df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]}) 101 >>> extract_dataframe_type(df) 102 'pandas' 103 104 Identify a PySpark DataFrame: 105 106 >>> from pyspark.sql import SparkSession 107 >>> spark = SparkSession.builder.getOrCreate() 108 >>> spark_df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"]) 109 >>> extract_dataframe_type(spark_df) 110 'pyspark' 111 112 Use in conditional logic to apply framework-specific processing: 113 114 >>> if extract_dataframe_type(df) == "pandas": 115 ... # Apply pandas-specific operations 116 ... result = df.groupby("category").sum() 117 ... elif extract_dataframe_type(df) == "pyspark": 118 ... # Apply Spark-specific operations 119 ... result = df.groupBy("category").sum() 120 121 Determine appropriate validator based on DataFrame type: 122 123 >>> framework = extract_dataframe_type(data) 124 >>> if framework == "pandas": 125 ... datasource = PandasDatasource(data_context) 126 ... elif framework == "pyspark": 127 ... datasource = SparkDFDatasource(data_context) 128 """ 129 return type(data).__module__.split(".")[0] 130 131 132def extract_dataframe_schema(data: Data) -> dict[str, str]: 133 """ 134 Extract DataFrame schema as a dictionary mapping column names to type strings. 135 136 This function extracts the complete schema information from a DataFrame by 137 converting the `dtypes` attribute into a dictionary with column names as keys 138 and string representations of data types as values. This format is 139 framework-agnostic and works with both pandas and PySpark DataFrames. 140 141 The resulting dictionary is useful for schema comparison, validation setup, 142 logging, and generating schema documentation. It provides a simple, 143 serializable representation of the DataFrame's structure. 144 145 Parameters 146 ---------- 147 data : Data 148 A data object conforming to the Data protocol with a `dtypes` attribute. 149 Typically a pandas DataFrame (with dtypes as a pandas.Series) or a 150 PySpark DataFrame (with dtypes as a list of tuples). 151 152 Returns 153 ------- 154 dict of str to str 155 A dictionary mapping each column name to its data type as a string. 156 For pandas DataFrames, types include "int64", "float64", "object", etc. 157 For PySpark DataFrames, types include "bigint", "double", "string", etc. 158 The specific type strings depend on the DataFrame framework being used. 159 160 See Also 161 -------- 162 extract_dataframe_type : Determine the DataFrame framework type. 163 extract_dataframe_schema_spark_native_format : Extract Spark schema with native type names. 164 165 Notes 166 ----- 167 The function operates differently depending on the DataFrame type: 168 169 For pandas DataFrames: 170 - `data.dtypes` returns a pandas.Series with column names as index and 171 numpy/pandas dtype objects as values 172 - Converting to dict gives {column: dtype_object} 173 - String conversion yields familiar type names like "int64", "float64" 174 175 For PySpark DataFrames: 176 - `data.dtypes` returns a list of tuples: [(column_name, type_string), ...] 177 - Converting to dict gives {column: type_string} 178 - Type strings are in Spark SQL format like "bigint", "double", "string" 179 180 The function uses `dict(data.dtypes)` which works for both pandas (Series) 181 and PySpark (list of tuples), making it framework-agnostic. The additional 182 `str()` conversion ensures type objects are converted to readable strings. 183 184 This function does not validate the data or check for schema consistency. 185 It simply extracts and formats whatever schema information is present in 186 the data object. 187 188 Examples 189 -------- 190 Extract schema from a pandas DataFrame: 191 192 >>> import pandas as pd 193 >>> df = pd.DataFrame({"id": [1, 2, 3], "value": [10.5, 20.3, 30.7], "name": ["Alice", "Bob", "Charlie"]}) 194 >>> schema = extract_dataframe_schema(df) 195 >>> schema 196 {'id': 'int64', 'value': 'float64', 'name': 'object'} 197 198 Extract schema from a PySpark DataFrame: 199 200 >>> from pyspark.sql import SparkSession 201 >>> spark = SparkSession.builder.getOrCreate() 202 >>> spark_df = spark.createDataFrame([(1, 10.5, "Alice"), (2, 20.3, "Bob")], ["id", "value", "name"]) 203 >>> schema = extract_dataframe_schema(spark_df) 204 >>> schema 205 {'id': 'bigint', 'value': 'double', 'name': 'string'} 206 207 Compare schemas between DataFrames: 208 209 >>> df1 = pd.DataFrame({"a": [1, 2], "b": [3.0, 4.0]}) 210 >>> df2 = pd.DataFrame({"a": [1, 2], "b": ["x", "y"]}) 211 >>> schema1 = extract_dataframe_schema(df1) 212 >>> schema2 = extract_dataframe_schema(df2) 213 >>> schema1 == schema2 214 False 215 >>> schema1["b"], schema2["b"] 216 ('float64', 'object') 217 218 Use for validation schema generation: 219 220 >>> current_schema = extract_dataframe_schema(df) 221 >>> expected_schema = {"id": "int64", "value": "float64", "name": "object"} 222 >>> if current_schema != expected_schema: 223 ... raise ValueError(f"Schema mismatch: {current_schema} != {expected_schema}") 224 225 Log schema information: 226 227 >>> import logging 228 >>> schema = extract_dataframe_schema(df) 229 >>> logging.info(f"Processing DataFrame with schema: {schema}") 230 INFO:root:Processing DataFrame with schema: {'id': 'int64', 'value': 'float64', 'name': 'object'} 231 """ 232 return {col_name: str(col_type) for col_name, col_type in dict(data.dtypes).items()} 233 234 235def extract_dataframe_schema_spark_native_format(data: Data) -> dict[str, str]: 236 """ 237 Extract Spark DataFrame schema using native Spark type names. 238 239 This function extracts schema information from a PySpark DataFrame using 240 Spark's native schema representation. It accesses the `schema` attribute 241 (a StructType object) and extracts the type name for each field, removing 242 the trailing "()" suffix that Spark type objects include when converted 243 to strings. 244 245 This function is specifically designed for PySpark DataFrames and produces 246 type names in Spark's native format (e.g., "LongType", "StringType", 247 "DoubleType") rather than the SQL format used by `data.dtypes` 248 (e.g., "bigint", "string", "double"). 249 250 Use this function when you need Spark-native type names for compatibility 251 with Great Expectations Spark datasources, custom Spark validation logic, 252 or when working with Spark's type system directly. 253 254 Parameters 255 ---------- 256 data : Data 257 A PySpark DataFrame with a `schema` attribute. The schema should be 258 a StructType object containing StructField objects with name and 259 dataType attributes. Using this function with non-Spark data objects 260 will raise an AttributeError. 261 262 Returns 263 ------- 264 dict of str to str 265 A dictionary mapping each column name to its Spark-native type name. 266 Type names follow Spark's naming convention with the "Type" suffix: 267 - "LongType" for 64-bit integers 268 - "IntegerType" for 32-bit integers 269 - "DoubleType" for double-precision floats 270 - "StringType" for strings 271 - "BooleanType" for booleans 272 - And other Spark data types 273 274 Raises 275 ------ 276 AttributeError 277 If the data object does not have a `schema` attribute (i.e., it is 278 not a PySpark DataFrame or compatible Spark data structure). 279 280 See Also 281 -------- 282 extract_dataframe_schema : Extract schema in a framework-agnostic format. 283 extract_dataframe_type : Determine the DataFrame framework type. 284 285 Notes 286 ----- 287 The function operates by: 288 1. Accessing `data.schema`, which returns a StructType object for Spark DataFrames 289 2. Iterating over the StructField objects in the schema 290 3. For each field, extracting the name and converting dataType to string 291 4. Removing the trailing "()" from the type string representation 292 (e.g., "LongType()" becomes "LongType") 293 294 Spark type names differ from SQL type names: 295 - `data.dtypes` returns [("col", "bigint"), ...] (SQL format) 296 - This function returns {"col": "LongType"} (native Spark format) 297 298 The SQL format is generally preferred for portability, but the native 299 format is needed when: 300 - Instantiating Spark DataType objects programmatically 301 - Working with Great Expectations Spark expectations 302 - Performing type matching with Spark's type system 303 - Debugging Spark-specific type issues 304 305 The `# type: ignore` comment suppresses mypy warnings about the `schema` 306 attribute, which is not part of the Data protocol but is present in 307 PySpark DataFrames. 308 309 Examples 310 -------- 311 Extract schema from a PySpark DataFrame: 312 313 >>> from pyspark.sql import SparkSession 314 >>> spark = SparkSession.builder.getOrCreate() 315 >>> df = spark.createDataFrame([(1, 10.5, "Alice"), (2, 20.3, "Bob")], ["id", "value", "name"]) 316 >>> schema = extract_dataframe_schema_spark_native_format(df) 317 >>> schema 318 {'id': 'LongType', 'value': 'DoubleType', 'name': 'StringType'} 319 320 Compare with SQL-format schema: 321 322 >>> sql_schema = extract_dataframe_schema(df) 323 >>> sql_schema 324 {'id': 'bigint', 'value': 'double', 'name': 'string'} 325 >>> native_schema = extract_dataframe_schema_spark_native_format(df) 326 >>> native_schema 327 {'id': 'LongType', 'value': 'DoubleType', 'name': 'StringType'} 328 329 Use with Great Expectations Spark datasources: 330 331 >>> native_schema = extract_dataframe_schema_spark_native_format(spark_df) 332 >>> # Configure GX expectations using Spark-native type names 333 >>> for col, spark_type in native_schema.items(): 334 ... if spark_type == "LongType": 335 ... # Add expectations specific to long integer columns 336 ... pass 337 338 Error when used with pandas DataFrame: 339 340 >>> import pandas as pd 341 >>> pandas_df = pd.DataFrame({"a": [1, 2, 3]}) 342 >>> extract_dataframe_schema_spark_native_format(pandas_df) 343 Traceback (most recent call last): 344 ... 345 AttributeError: 'DataFrame' object has no attribute 'schema' 346 347 Extract complex nested types: 348 349 >>> from pyspark.sql.types import StructType, StructField, StringType, ArrayType 350 >>> schema = StructType([StructField("name", StringType()), StructField("tags", ArrayType(StringType()))]) 351 >>> df = spark.createDataFrame([("Alice", ["tag1", "tag2"])], schema) 352 >>> extract_dataframe_schema_spark_native_format(df) 353 {'name': 'StringType', 'tags': 'ArrayType(StringType())'} 354 """ 355 return {x.name: str(x.dataType)[:-2] for x in list(data.schema)} # type: ignore
47def extract_dataframe_type(data: Data) -> str: 48 """ 49 Determine the DataFrame framework type from its module name. 50 51 This function identifies the data processing framework (pandas, PySpark, etc.) 52 by inspecting the module path of the data object's type. It extracts the 53 top-level module name, which typically indicates the framework being used. 54 55 The function is framework-agnostic and works with any data object conforming 56 to the Data protocol. It is commonly used to determine which processing 57 strategy or validator to apply to a dataset. 58 59 Parameters 60 ---------- 61 data : Data 62 A data object conforming to the Data protocol. Typically a pandas 63 DataFrame, Spark DataFrame, or other compatible data structure with 64 `columns` and `dtypes` properties. 65 66 Returns 67 ------- 68 str 69 The top-level module name identifying the DataFrame framework. 70 Common return values include: 71 - "pandas" for pandas DataFrames 72 - "pyspark" for PySpark DataFrames 73 - Other framework names for alternative implementations 74 75 See Also 76 -------- 77 extract_dataframe_schema : Extract column names and types from a DataFrame. 78 extract_dataframe_schema_spark_native_format : Extract Spark schema in native format. 79 80 Notes 81 ----- 82 The function operates by: 83 1. Getting the type of the data object using `type(data)` 84 2. Accessing the `__module__` attribute, which contains the full module path 85 (e.g., "pandas.core.frame", "pyspark.sql.dataframe") 86 3. Splitting on "." and returning the first component (the framework name) 87 88 This approach is robust across different versions of the same framework, as 89 the top-level module name typically remains stable even when internal module 90 structure changes. 91 92 The function does not validate that the data object actually conforms to 93 the Data protocol. It simply extracts the module name from whatever object 94 is passed. 95 96 Examples 97 -------- 98 Identify a pandas DataFrame: 99 100 >>> import pandas as pd 101 >>> df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]}) 102 >>> extract_dataframe_type(df) 103 'pandas' 104 105 Identify a PySpark DataFrame: 106 107 >>> from pyspark.sql import SparkSession 108 >>> spark = SparkSession.builder.getOrCreate() 109 >>> spark_df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"]) 110 >>> extract_dataframe_type(spark_df) 111 'pyspark' 112 113 Use in conditional logic to apply framework-specific processing: 114 115 >>> if extract_dataframe_type(df) == "pandas": 116 ... # Apply pandas-specific operations 117 ... result = df.groupby("category").sum() 118 ... elif extract_dataframe_type(df) == "pyspark": 119 ... # Apply Spark-specific operations 120 ... result = df.groupBy("category").sum() 121 122 Determine appropriate validator based on DataFrame type: 123 124 >>> framework = extract_dataframe_type(data) 125 >>> if framework == "pandas": 126 ... datasource = PandasDatasource(data_context) 127 ... elif framework == "pyspark": 128 ... datasource = SparkDFDatasource(data_context) 129 """ 130 return type(data).__module__.split(".")[0]
Determine the DataFrame framework type from its module name.
This function identifies the data processing framework (pandas, PySpark, etc.) by inspecting the module path of the data object's type. It extracts the top-level module name, which typically indicates the framework being used.
The function is framework-agnostic and works with any data object conforming to the Data protocol. It is commonly used to determine which processing strategy or validator to apply to a dataset.
Parameters
- data (Data):
A data object conforming to the Data protocol. Typically a pandas
DataFrame, Spark DataFrame, or other compatible data structure with
columnsanddtypesproperties.
Returns
- str: The top-level module name identifying the DataFrame framework.
Common return values include:
- "pandas" for pandas DataFrames
- "pyspark" for PySpark DataFrames
- Other framework names for alternative implementations
See Also
extract_dataframe_schema: Extract column names and types from a DataFrame.
extract_dataframe_schema_spark_native_format: Extract Spark schema in native format.
Notes
The function operates by:
- Getting the type of the data object using
type(data) - Accessing the
__module__attribute, which contains the full module path (e.g., "pandas.core.frame", "pyspark.sql.dataframe") - Splitting on "." and returning the first component (the framework name)
This approach is robust across different versions of the same framework, as the top-level module name typically remains stable even when internal module structure changes.
The function does not validate that the data object actually conforms to the Data protocol. It simply extracts the module name from whatever object is passed.
Examples
Identify a pandas DataFrame:
>>> import pandas as pd
>>> df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> extract_dataframe_type(df)
'pandas'
Identify a PySpark DataFrame:
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> spark_df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"])
>>> extract_dataframe_type(spark_df)
'pyspark'
Use in conditional logic to apply framework-specific processing:
>>> if extract_dataframe_type(df) == "pandas":
... # Apply pandas-specific operations
... result = df.groupby("category").sum()
... elif extract_dataframe_type(df) == "pyspark":
... # Apply Spark-specific operations
... result = df.groupBy("category").sum()
Determine appropriate validator based on DataFrame type:
>>> framework = extract_dataframe_type(data)
>>> if framework == "pandas":
... datasource = PandasDatasource(data_context)
... elif framework == "pyspark":
... datasource = SparkDFDatasource(data_context)
133def extract_dataframe_schema(data: Data) -> dict[str, str]: 134 """ 135 Extract DataFrame schema as a dictionary mapping column names to type strings. 136 137 This function extracts the complete schema information from a DataFrame by 138 converting the `dtypes` attribute into a dictionary with column names as keys 139 and string representations of data types as values. This format is 140 framework-agnostic and works with both pandas and PySpark DataFrames. 141 142 The resulting dictionary is useful for schema comparison, validation setup, 143 logging, and generating schema documentation. It provides a simple, 144 serializable representation of the DataFrame's structure. 145 146 Parameters 147 ---------- 148 data : Data 149 A data object conforming to the Data protocol with a `dtypes` attribute. 150 Typically a pandas DataFrame (with dtypes as a pandas.Series) or a 151 PySpark DataFrame (with dtypes as a list of tuples). 152 153 Returns 154 ------- 155 dict of str to str 156 A dictionary mapping each column name to its data type as a string. 157 For pandas DataFrames, types include "int64", "float64", "object", etc. 158 For PySpark DataFrames, types include "bigint", "double", "string", etc. 159 The specific type strings depend on the DataFrame framework being used. 160 161 See Also 162 -------- 163 extract_dataframe_type : Determine the DataFrame framework type. 164 extract_dataframe_schema_spark_native_format : Extract Spark schema with native type names. 165 166 Notes 167 ----- 168 The function operates differently depending on the DataFrame type: 169 170 For pandas DataFrames: 171 - `data.dtypes` returns a pandas.Series with column names as index and 172 numpy/pandas dtype objects as values 173 - Converting to dict gives {column: dtype_object} 174 - String conversion yields familiar type names like "int64", "float64" 175 176 For PySpark DataFrames: 177 - `data.dtypes` returns a list of tuples: [(column_name, type_string), ...] 178 - Converting to dict gives {column: type_string} 179 - Type strings are in Spark SQL format like "bigint", "double", "string" 180 181 The function uses `dict(data.dtypes)` which works for both pandas (Series) 182 and PySpark (list of tuples), making it framework-agnostic. The additional 183 `str()` conversion ensures type objects are converted to readable strings. 184 185 This function does not validate the data or check for schema consistency. 186 It simply extracts and formats whatever schema information is present in 187 the data object. 188 189 Examples 190 -------- 191 Extract schema from a pandas DataFrame: 192 193 >>> import pandas as pd 194 >>> df = pd.DataFrame({"id": [1, 2, 3], "value": [10.5, 20.3, 30.7], "name": ["Alice", "Bob", "Charlie"]}) 195 >>> schema = extract_dataframe_schema(df) 196 >>> schema 197 {'id': 'int64', 'value': 'float64', 'name': 'object'} 198 199 Extract schema from a PySpark DataFrame: 200 201 >>> from pyspark.sql import SparkSession 202 >>> spark = SparkSession.builder.getOrCreate() 203 >>> spark_df = spark.createDataFrame([(1, 10.5, "Alice"), (2, 20.3, "Bob")], ["id", "value", "name"]) 204 >>> schema = extract_dataframe_schema(spark_df) 205 >>> schema 206 {'id': 'bigint', 'value': 'double', 'name': 'string'} 207 208 Compare schemas between DataFrames: 209 210 >>> df1 = pd.DataFrame({"a": [1, 2], "b": [3.0, 4.0]}) 211 >>> df2 = pd.DataFrame({"a": [1, 2], "b": ["x", "y"]}) 212 >>> schema1 = extract_dataframe_schema(df1) 213 >>> schema2 = extract_dataframe_schema(df2) 214 >>> schema1 == schema2 215 False 216 >>> schema1["b"], schema2["b"] 217 ('float64', 'object') 218 219 Use for validation schema generation: 220 221 >>> current_schema = extract_dataframe_schema(df) 222 >>> expected_schema = {"id": "int64", "value": "float64", "name": "object"} 223 >>> if current_schema != expected_schema: 224 ... raise ValueError(f"Schema mismatch: {current_schema} != {expected_schema}") 225 226 Log schema information: 227 228 >>> import logging 229 >>> schema = extract_dataframe_schema(df) 230 >>> logging.info(f"Processing DataFrame with schema: {schema}") 231 INFO:root:Processing DataFrame with schema: {'id': 'int64', 'value': 'float64', 'name': 'object'} 232 """ 233 return {col_name: str(col_type) for col_name, col_type in dict(data.dtypes).items()}
Extract DataFrame schema as a dictionary mapping column names to type strings.
This function extracts the complete schema information from a DataFrame by
converting the dtypes attribute into a dictionary with column names as keys
and string representations of data types as values. This format is
framework-agnostic and works with both pandas and PySpark DataFrames.
The resulting dictionary is useful for schema comparison, validation setup, logging, and generating schema documentation. It provides a simple, serializable representation of the DataFrame's structure.
Parameters
- data (Data):
A data object conforming to the Data protocol with a
dtypesattribute. Typically a pandas DataFrame (with dtypes as a pandas.Series) or a PySpark DataFrame (with dtypes as a list of tuples).
Returns
- dict of str to str: A dictionary mapping each column name to its data type as a string. For pandas DataFrames, types include "int64", "float64", "object", etc. For PySpark DataFrames, types include "bigint", "double", "string", etc. The specific type strings depend on the DataFrame framework being used.
See Also
extract_dataframe_type: Determine the DataFrame framework type.
extract_dataframe_schema_spark_native_format: Extract Spark schema with native type names.
Notes
The function operates differently depending on the DataFrame type:
For pandas DataFrames:
data.dtypesreturns a pandas.Series with column names as index and numpy/pandas dtype objects as values- Converting to dict gives {column: dtype_object}
- String conversion yields familiar type names like "int64", "float64"
For PySpark DataFrames:
data.dtypesreturns a list of tuples: [(column_name, type_string), ...]- Converting to dict gives {column: type_string}
- Type strings are in Spark SQL format like "bigint", "double", "string"
The function uses dict(data.dtypes) which works for both pandas (Series)
and PySpark (list of tuples), making it framework-agnostic. The additional
str() conversion ensures type objects are converted to readable strings.
This function does not validate the data or check for schema consistency. It simply extracts and formats whatever schema information is present in the data object.
Examples
Extract schema from a pandas DataFrame:
>>> import pandas as pd
>>> df = pd.DataFrame({"id": [1, 2, 3], "value": [10.5, 20.3, 30.7], "name": ["Alice", "Bob", "Charlie"]})
>>> schema = extract_dataframe_schema(df)
>>> schema
{'id': 'int64', 'value': 'float64', 'name': 'object'}
Extract schema from a PySpark DataFrame:
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> spark_df = spark.createDataFrame([(1, 10.5, "Alice"), (2, 20.3, "Bob")], ["id", "value", "name"])
>>> schema = extract_dataframe_schema(spark_df)
>>> schema
{'id': 'bigint', 'value': 'double', 'name': 'string'}
Compare schemas between DataFrames:
>>> df1 = pd.DataFrame({"a": [1, 2], "b": [3.0, 4.0]})
>>> df2 = pd.DataFrame({"a": [1, 2], "b": ["x", "y"]})
>>> schema1 = extract_dataframe_schema(df1)
>>> schema2 = extract_dataframe_schema(df2)
>>> schema1 == schema2
False
>>> schema1["b"], schema2["b"]
('float64', 'object')
Use for validation schema generation:
>>> current_schema = extract_dataframe_schema(df)
>>> expected_schema = {"id": "int64", "value": "float64", "name": "object"}
>>> if current_schema != expected_schema:
... raise ValueError(f"Schema mismatch: {current_schema} != {expected_schema}")
Log schema information:
>>> import logging
>>> schema = extract_dataframe_schema(df)
>>> logging.info(f"Processing DataFrame with schema: {schema}")
INFO:root:Processing DataFrame with schema: {'id': 'int64', 'value': 'float64', 'name': 'object'}
236def extract_dataframe_schema_spark_native_format(data: Data) -> dict[str, str]: 237 """ 238 Extract Spark DataFrame schema using native Spark type names. 239 240 This function extracts schema information from a PySpark DataFrame using 241 Spark's native schema representation. It accesses the `schema` attribute 242 (a StructType object) and extracts the type name for each field, removing 243 the trailing "()" suffix that Spark type objects include when converted 244 to strings. 245 246 This function is specifically designed for PySpark DataFrames and produces 247 type names in Spark's native format (e.g., "LongType", "StringType", 248 "DoubleType") rather than the SQL format used by `data.dtypes` 249 (e.g., "bigint", "string", "double"). 250 251 Use this function when you need Spark-native type names for compatibility 252 with Great Expectations Spark datasources, custom Spark validation logic, 253 or when working with Spark's type system directly. 254 255 Parameters 256 ---------- 257 data : Data 258 A PySpark DataFrame with a `schema` attribute. The schema should be 259 a StructType object containing StructField objects with name and 260 dataType attributes. Using this function with non-Spark data objects 261 will raise an AttributeError. 262 263 Returns 264 ------- 265 dict of str to str 266 A dictionary mapping each column name to its Spark-native type name. 267 Type names follow Spark's naming convention with the "Type" suffix: 268 - "LongType" for 64-bit integers 269 - "IntegerType" for 32-bit integers 270 - "DoubleType" for double-precision floats 271 - "StringType" for strings 272 - "BooleanType" for booleans 273 - And other Spark data types 274 275 Raises 276 ------ 277 AttributeError 278 If the data object does not have a `schema` attribute (i.e., it is 279 not a PySpark DataFrame or compatible Spark data structure). 280 281 See Also 282 -------- 283 extract_dataframe_schema : Extract schema in a framework-agnostic format. 284 extract_dataframe_type : Determine the DataFrame framework type. 285 286 Notes 287 ----- 288 The function operates by: 289 1. Accessing `data.schema`, which returns a StructType object for Spark DataFrames 290 2. Iterating over the StructField objects in the schema 291 3. For each field, extracting the name and converting dataType to string 292 4. Removing the trailing "()" from the type string representation 293 (e.g., "LongType()" becomes "LongType") 294 295 Spark type names differ from SQL type names: 296 - `data.dtypes` returns [("col", "bigint"), ...] (SQL format) 297 - This function returns {"col": "LongType"} (native Spark format) 298 299 The SQL format is generally preferred for portability, but the native 300 format is needed when: 301 - Instantiating Spark DataType objects programmatically 302 - Working with Great Expectations Spark expectations 303 - Performing type matching with Spark's type system 304 - Debugging Spark-specific type issues 305 306 The `# type: ignore` comment suppresses mypy warnings about the `schema` 307 attribute, which is not part of the Data protocol but is present in 308 PySpark DataFrames. 309 310 Examples 311 -------- 312 Extract schema from a PySpark DataFrame: 313 314 >>> from pyspark.sql import SparkSession 315 >>> spark = SparkSession.builder.getOrCreate() 316 >>> df = spark.createDataFrame([(1, 10.5, "Alice"), (2, 20.3, "Bob")], ["id", "value", "name"]) 317 >>> schema = extract_dataframe_schema_spark_native_format(df) 318 >>> schema 319 {'id': 'LongType', 'value': 'DoubleType', 'name': 'StringType'} 320 321 Compare with SQL-format schema: 322 323 >>> sql_schema = extract_dataframe_schema(df) 324 >>> sql_schema 325 {'id': 'bigint', 'value': 'double', 'name': 'string'} 326 >>> native_schema = extract_dataframe_schema_spark_native_format(df) 327 >>> native_schema 328 {'id': 'LongType', 'value': 'DoubleType', 'name': 'StringType'} 329 330 Use with Great Expectations Spark datasources: 331 332 >>> native_schema = extract_dataframe_schema_spark_native_format(spark_df) 333 >>> # Configure GX expectations using Spark-native type names 334 >>> for col, spark_type in native_schema.items(): 335 ... if spark_type == "LongType": 336 ... # Add expectations specific to long integer columns 337 ... pass 338 339 Error when used with pandas DataFrame: 340 341 >>> import pandas as pd 342 >>> pandas_df = pd.DataFrame({"a": [1, 2, 3]}) 343 >>> extract_dataframe_schema_spark_native_format(pandas_df) 344 Traceback (most recent call last): 345 ... 346 AttributeError: 'DataFrame' object has no attribute 'schema' 347 348 Extract complex nested types: 349 350 >>> from pyspark.sql.types import StructType, StructField, StringType, ArrayType 351 >>> schema = StructType([StructField("name", StringType()), StructField("tags", ArrayType(StringType()))]) 352 >>> df = spark.createDataFrame([("Alice", ["tag1", "tag2"])], schema) 353 >>> extract_dataframe_schema_spark_native_format(df) 354 {'name': 'StringType', 'tags': 'ArrayType(StringType())'} 355 """ 356 return {x.name: str(x.dataType)[:-2] for x in list(data.schema)} # type: ignore
Extract Spark DataFrame schema using native Spark type names.
This function extracts schema information from a PySpark DataFrame using
Spark's native schema representation. It accesses the schema attribute
(a StructType object) and extracts the type name for each field, removing
the trailing "()" suffix that Spark type objects include when converted
to strings.
This function is specifically designed for PySpark DataFrames and produces
type names in Spark's native format (e.g., "LongType", "StringType",
"DoubleType") rather than the SQL format used by data.dtypes
(e.g., "bigint", "string", "double").
Use this function when you need Spark-native type names for compatibility with Great Expectations Spark datasources, custom Spark validation logic, or when working with Spark's type system directly.
Parameters
- data (Data):
A PySpark DataFrame with a
schemaattribute. The schema should be a StructType object containing StructField objects with name and dataType attributes. Using this function with non-Spark data objects will raise an AttributeError.
Returns
- dict of str to str: A dictionary mapping each column name to its Spark-native type name.
Type names follow Spark's naming convention with the "Type" suffix:
- "LongType" for 64-bit integers
- "IntegerType" for 32-bit integers
- "DoubleType" for double-precision floats
- "StringType" for strings
- "BooleanType" for booleans
- And other Spark data types
Raises
- AttributeError: If the data object does not have a
schemaattribute (i.e., it is not a PySpark DataFrame or compatible Spark data structure).
See Also
extract_dataframe_schema: Extract schema in a framework-agnostic format.
extract_dataframe_type: Determine the DataFrame framework type.
Notes
The function operates by:
- Accessing
data.schema, which returns a StructType object for Spark DataFrames - Iterating over the StructField objects in the schema
- For each field, extracting the name and converting dataType to string
- Removing the trailing "()" from the type string representation (e.g., "LongType()" becomes "LongType")
Spark type names differ from SQL type names:
data.dtypesreturns [("col", "bigint"), ...] (SQL format)- This function returns {"col": "LongType"} (native Spark format)
The SQL format is generally preferred for portability, but the native format is needed when:
- Instantiating Spark DataType objects programmatically
- Working with Great Expectations Spark expectations
- Performing type matching with Spark's type system
- Debugging Spark-specific type issues
The # type: ignore comment suppresses mypy warnings about the schema
attribute, which is not part of the Data protocol but is present in
PySpark DataFrames.
Examples
Extract schema from a PySpark DataFrame:
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame([(1, 10.5, "Alice"), (2, 20.3, "Bob")], ["id", "value", "name"])
>>> schema = extract_dataframe_schema_spark_native_format(df)
>>> schema
{'id': 'LongType', 'value': 'DoubleType', 'name': 'StringType'}
Compare with SQL-format schema:
>>> sql_schema = extract_dataframe_schema(df)
>>> sql_schema
{'id': 'bigint', 'value': 'double', 'name': 'string'}
>>> native_schema = extract_dataframe_schema_spark_native_format(df)
>>> native_schema
{'id': 'LongType', 'value': 'DoubleType', 'name': 'StringType'}
Use with Great Expectations Spark datasources:
>>> native_schema = extract_dataframe_schema_spark_native_format(spark_df)
>>> # Configure GX expectations using Spark-native type names
>>> for col, spark_type in native_schema.items():
... if spark_type == "LongType":
... # Add expectations specific to long integer columns
... pass
Error when used with pandas DataFrame:
>>> import pandas as pd
>>> pandas_df = pd.DataFrame({"a": [1, 2, 3]})
>>> extract_dataframe_schema_spark_native_format(pandas_df)
Traceback (most recent call last):
...
AttributeError: 'DataFrame' object has no attribute 'schema'
Extract complex nested types:
>>> from pyspark.sql.types import StructType, StructField, StringType, ArrayType
>>> schema = StructType([StructField("name", StringType()), StructField("tags", ArrayType(StringType()))])
>>> df = spark.createDataFrame([("Alice", ["tag1", "tag2"])], schema)
>>> extract_dataframe_schema_spark_native_format(df)
{'name': 'StringType', 'tags': 'ArrayType(StringType())'}