adc_toolkit.processing.steps.pandas
Prebuilt transformation functions for pandas DataFrames.
This module provides a comprehensive library of data transformation functions specifically designed for pandas DataFrames. All functions follow the standard step contract: they accept a DataFrame as the first positional argument and return a transformed DataFrame.
These functions are designed to work seamlessly with
~adc_toolkit.processing.ProcessingPipeline, but can also be used
as standalone functions for ad-hoc data transformations.
Function Categories
Cleaning (from clean.py):
Functions for data quality and standardization.
- `remove_duplicates()`: Remove duplicate rows based on subset of columns
- `fill_missing_values()`: Fill NaN values using various strategies
(mean, median, mode, constant, interpolate)
- `make_columns_snake_case()`: Standardize column names to snake_case
Filtering (from filter.py):
Functions for row and column selection.
- `filter_rows()`: Filter rows using a callable condition
- `select_columns()`: Select specific columns by name
Transforming (from transform.py):
Functions for data transformation and feature engineering.
- `scale_data()`: Scale numerical columns (minmax or zscore)
- `encode_categorical()`: Encode categorical columns (onehot or label)
- `divide_one_column_by_another()`: Create ratio columns
Combining (from combine.py):
Functions for aggregation and grouping.
- `group_and_aggregate()`: Group by columns and apply aggregation functions
Validating (from validate.py):
Functions for data validation.
- `validate_is_dataframe()`: Assert input is a pandas DataFrame
Examples
Using with ProcessingPipeline:
>>> from adc_toolkit.processing import ProcessingPipeline
>>> from adc_toolkit.processing.steps.pandas import (
... remove_duplicates,
... fill_missing_values,
... make_columns_snake_case,
... scale_data,
... )
>>>
>>> pipeline = (
... ProcessingPipeline()
... .add(remove_duplicates, subset=["CustomerID"])
... .add(fill_missing_values, method="mean", columns=["Revenue"])
... .add(make_columns_snake_case)
... .add(scale_data, columns=["revenue"], method="minmax")
... )
>>> clean_data = pipeline.run(raw_data)
Using functions standalone:
>>> import pandas as pd
>>> from adc_toolkit.processing.steps.pandas import fill_missing_values
>>>
>>> df = pd.DataFrame({"A": [1, None, 3], "B": [4, 5, None]})
>>> filled = fill_missing_values(df, method="mean")
>>> filled
A B
0 1.0 4.0
1 2.0 5.0
2 3.0 4.5
Filtering with a condition:
>>> from adc_toolkit.processing.steps.pandas import filter_rows
>>>
>>> df = pd.DataFrame({"name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]})
>>> adults = filter_rows(df, condition=lambda d: d["age"] >= 30)
>>> adults
name age
1 Bob 30
2 Charlie 35
Scaling numerical features:
>>> from adc_toolkit.processing.steps.pandas import scale_data
>>>
>>> df = pd.DataFrame({"price": [100, 200, 300]})
>>> scaled = scale_data(df, columns=["price"], method="minmax")
>>> scaled["price"].tolist()
[0.0, 0.5, 1.0]
See Also
adc_toolkit.processing.ProcessingPipeline: Pipeline for chaining transformations.
adc_toolkit.processing.steps: Parent module with convenience re-exports.
Notes
The Step Contract
All functions in this module follow this signature pattern::
def step_function(data: pd.DataFrame, **kwargs) -> pd.DataFrame:
# Transform data
return transformed_data
This makes them compatible with ProcessingPipeline.add()
<adc_toolkit.processing.ProcessingPipeline.add>().
Immutability
Most functions return a new DataFrame rather than modifying the input in place. This ensures predictable behavior when used in pipelines. Check individual function documentation for specific behavior.
Optional Dependencies
Some functions (like scale_data and encode_categorical) require
scikit-learn when using certain methods. Install the preprocessing extra
to enable these features::
uv sync --extra preprocessing
1""" 2Prebuilt transformation functions for pandas DataFrames. 3 4This module provides a comprehensive library of data transformation functions 5specifically designed for pandas DataFrames. All functions follow the standard 6step contract: they accept a DataFrame as the first positional argument and 7return a transformed DataFrame. 8 9These functions are designed to work seamlessly with 10:class:`~adc_toolkit.processing.ProcessingPipeline`, but can also be used 11as standalone functions for ad-hoc data transformations. 12 13Function Categories 14------------------- 15**Cleaning** (from ``clean.py``): 16 Functions for data quality and standardization. 17 18 - :func:`remove_duplicates`: Remove duplicate rows based on subset of columns 19 - :func:`fill_missing_values`: Fill NaN values using various strategies 20 (mean, median, mode, constant, interpolate) 21 - :func:`make_columns_snake_case`: Standardize column names to snake_case 22 23**Filtering** (from ``filter.py``): 24 Functions for row and column selection. 25 26 - :func:`filter_rows`: Filter rows using a callable condition 27 - :func:`select_columns`: Select specific columns by name 28 29**Transforming** (from ``transform.py``): 30 Functions for data transformation and feature engineering. 31 32 - :func:`scale_data`: Scale numerical columns (minmax or zscore) 33 - :func:`encode_categorical`: Encode categorical columns (onehot or label) 34 - :func:`divide_one_column_by_another`: Create ratio columns 35 36**Combining** (from ``combine.py``): 37 Functions for aggregation and grouping. 38 39 - :func:`group_and_aggregate`: Group by columns and apply aggregation functions 40 41**Validating** (from ``validate.py``): 42 Functions for data validation. 43 44 - :func:`validate_is_dataframe`: Assert input is a pandas DataFrame 45 46Examples 47-------- 48**Using with ProcessingPipeline:** 49 50>>> from adc_toolkit.processing import ProcessingPipeline 51>>> from adc_toolkit.processing.steps.pandas import ( 52... remove_duplicates, 53... fill_missing_values, 54... make_columns_snake_case, 55... scale_data, 56... ) 57>>> 58>>> pipeline = ( 59... ProcessingPipeline() 60... .add(remove_duplicates, subset=["CustomerID"]) 61... .add(fill_missing_values, method="mean", columns=["Revenue"]) 62... .add(make_columns_snake_case) 63... .add(scale_data, columns=["revenue"], method="minmax") 64... ) 65>>> clean_data = pipeline.run(raw_data) 66 67**Using functions standalone:** 68 69>>> import pandas as pd 70>>> from adc_toolkit.processing.steps.pandas import fill_missing_values 71>>> 72>>> df = pd.DataFrame({"A": [1, None, 3], "B": [4, 5, None]}) 73>>> filled = fill_missing_values(df, method="mean") 74>>> filled 75 A B 760 1.0 4.0 771 2.0 5.0 782 3.0 4.5 79 80**Filtering with a condition:** 81 82>>> from adc_toolkit.processing.steps.pandas import filter_rows 83>>> 84>>> df = pd.DataFrame({"name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]}) 85>>> adults = filter_rows(df, condition=lambda d: d["age"] >= 30) 86>>> adults 87 name age 881 Bob 30 892 Charlie 35 90 91**Scaling numerical features:** 92 93>>> from adc_toolkit.processing.steps.pandas import scale_data 94>>> 95>>> df = pd.DataFrame({"price": [100, 200, 300]}) 96>>> scaled = scale_data(df, columns=["price"], method="minmax") 97>>> scaled["price"].tolist() 98[0.0, 0.5, 1.0] 99 100See Also 101-------- 102adc_toolkit.processing.ProcessingPipeline : Pipeline for chaining transformations. 103adc_toolkit.processing.steps : Parent module with convenience re-exports. 104 105Notes 106----- 107**The Step Contract** 108 109All functions in this module follow this signature pattern:: 110 111 def step_function(data: pd.DataFrame, **kwargs) -> pd.DataFrame: 112 # Transform data 113 return transformed_data 114 115This makes them compatible with :meth:`ProcessingPipeline.add() 116<adc_toolkit.processing.ProcessingPipeline.add>`. 117 118**Immutability** 119 120Most functions return a new DataFrame rather than modifying the input in place. 121This ensures predictable behavior when used in pipelines. Check individual 122function documentation for specific behavior. 123 124**Optional Dependencies** 125 126Some functions (like ``scale_data`` and ``encode_categorical``) require 127scikit-learn when using certain methods. Install the ``preprocessing`` extra 128to enable these features:: 129 130 uv sync --extra preprocessing 131""" 132 133from .clean import fill_missing_values, make_columns_snake_case, remove_duplicates 134from .combine import group_and_aggregate 135from .filter import filter_rows, select_columns 136from .transform import divide_one_column_by_another, encode_categorical, scale_data 137from .validate import validate_is_dataframe 138 139 140__all__ = [ 141 "divide_one_column_by_another", 142 "encode_categorical", 143 "fill_missing_values", 144 "filter_rows", 145 "group_and_aggregate", 146 "make_columns_snake_case", 147 "remove_duplicates", 148 "scale_data", 149 "select_columns", 150 "validate_is_dataframe", 151]
107def divide_one_column_by_another( 108 data: pd.DataFrame, numerator: str, denominator: str, new_column_name: str 109) -> pd.DataFrame: 110 """ 111 112 Parameters 113 114 ---------- 115 data : pd.DataFrame 116 The input DataFrame containing the data to be transformed. 117 numerator : str 118 The name of the column to be used as the numerator. 119 denominator : str 120 The name of the column to be used as the denominator. 121 new_column_name : str 122 The name of the new column to be created. 123 Returns 124 ------- 125 pd.DataFrame 126 The transformed DataFrame with the new column added. 127 """ 128 data[new_column_name] = data[numerator] / data[denominator] 129 return data
Parameters
data : pd.DataFrame The input DataFrame containing the data to be transformed. numerator : str The name of the column to be used as the numerator. denominator : str The name of the column to be used as the denominator. new_column_name : str The name of the new column to be created.
Returns
- pd.DataFrame: The transformed DataFrame with the new column added.
72def encode_categorical(data: pd.DataFrame, columns: list[str], method: str = "onehot") -> pd.DataFrame: 73 """ 74 Encode categorical features using the specified encoding method. 75 76 Parameters 77 ---------- 78 data : pd.DataFrame 79 The input DataFrame. 80 columns : List[str] 81 Columns to encode. 82 method : str 83 Encoding method ('onehot' or 'label'). 84 85 Returns 86 ------- 87 pd.DataFrame 88 DataFrame with encoded columns. 89 90 Raises 91 ------ 92 ImportError 93 If scikit-learn is not installed and method is 'label'. 94 """ 95 if method == "onehot": 96 return pd.get_dummies(data, columns=columns) 97 elif method == "label": 98 _check_sklearn_available() 99 encoder = LabelEncoder() 100 for col in columns: 101 data[col] = encoder.fit_transform(data[col]) 102 return data 103 else: 104 raise ValueError(f"Invalid encoding method: {method}")
Encode categorical features using the specified encoding method.
Parameters
- data (pd.DataFrame): The input DataFrame.
- columns (List[str]): Columns to encode.
- method (str): Encoding method ('onehot' or 'label').
Returns
- pd.DataFrame: DataFrame with encoded columns.
Raises
- ImportError: If scikit-learn is not installed and method is 'label'.
44def fill_missing_values( 45 data: pd.DataFrame, 46 method: str = FillMethod.MEAN.value, 47 value: Any = None, 48 columns: list[str] | None = None, 49) -> pd.DataFrame: 50 """ 51 Fill or interpolate missing values in the DataFrame. 52 53 Parameters 54 ---------- 55 data : pd.DataFrame 56 The input DataFrame. 57 method : str 58 The method to fill missing values ("mean", "median", "mode", "constant" or "interpolate"). 59 value : Any 60 Specific value to use for filling if `method="constant"`. 61 columns : Optional[List[str]] 62 List of columns to apply the filling method to. If None, applies to all columns. 63 64 Returns 65 ------- 66 pd.DataFrame 67 DataFrame with missing values filled. 68 """ 69 try: 70 fill_method = FillMethod(method) 71 except ValueError as e: 72 raise ValueError(f"Invalid method: {method}") from e 73 74 if columns is None: 75 columns = list(data.columns) 76 77 missing_columns = [col for col in columns if col not in data.columns] 78 if missing_columns: 79 raise ValueError( 80 f"The following columns are not in the DataFrame: {missing_columns}. Available columns: {data.columns}" 81 ) 82 83 data = data.copy() 84 if fill_method == FillMethod.MEAN: 85 data[columns] = data[columns].fillna(data[columns].mean()) 86 elif fill_method == FillMethod.MEDIAN: 87 data[columns] = data[columns].fillna(data[columns].median()) 88 elif fill_method == FillMethod.MODE: 89 data[columns] = data[columns].apply(lambda col: col.fillna(col.mode().iloc[0])) 90 elif fill_method == FillMethod.CONSTANT and value is not None: 91 data[columns] = data[columns].fillna(value) 92 elif fill_method == FillMethod.INTERPOLATE: 93 data[columns] = data[columns].interpolate() 94 95 return data
Fill or interpolate missing values in the DataFrame.
Parameters
- data (pd.DataFrame): The input DataFrame.
- method (str): The method to fill missing values ("mean", "median", "mode", "constant" or "interpolate").
- value (Any):
Specific value to use for filling if
method="constant". - columns (Optional[List[str]]): List of columns to apply the filling method to. If None, applies to all columns.
Returns
- pd.DataFrame: DataFrame with missing values filled.
9def filter_rows(data: pd.DataFrame, condition: Callable[[pd.DataFrame], pd.Series]) -> pd.DataFrame: 10 """ 11 Filter rows based on a condition. 12 13 Parameters 14 ---------- 15 data : pd.DataFrame 16 The input DataFrame. 17 condition : Callable[[pd.DataFrame], pd.Series] 18 A function that returns a boolean Series indicating rows to keep. 19 20 Returns 21 ------- 22 pd.DataFrame 23 Filtered DataFrame. 24 25 Example 26 -------- 27 >>> data = pd.DataFrame({"A": [1, 2, 3, 4], "B": ["a", "b", "c", "d"]}) 28 >>> condition = lambda df: df["A"] > 2 29 >>> filter_rows(data, condition) 30 A B 31 2 3 c 32 3 4 d 33 """ 34 return data[condition(data)]
Filter rows based on a condition.
Parameters
- data (pd.DataFrame): The input DataFrame.
- condition (Callable[[pd.DataFrame], pd.Series]): A function that returns a boolean Series indicating rows to keep.
Returns
- pd.DataFrame: Filtered DataFrame.
Example
>>> data = pd.DataFrame({"A": [1, 2, 3, 4], "B": ["a", "b", "c", "d"]})
>>> condition = lambda df: df["A"] > 2
>>> filter_rows(data, condition)
A B
2 3 c
3 4 d
9def group_and_aggregate( 10 data: pd.DataFrame, group_by_columns: list[str], agg_funcs: dict[str, Callable] 11) -> pd.DataFrame: 12 """ 13 Group data by specified columns and apply aggregation functions. 14 15 Parameters 16 ---------- 17 data : pd.DataFrame 18 The input DataFrame. 19 group_by_columns : List[str] 20 Columns to group by. 21 agg_funcs : Dict[str, Callable] 22 Dictionary mapping column names to aggregation functions. 23 24 Returns 25 ------- 26 pd.DataFrame 27 Aggregated DataFrame. 28 """ 29 return data.groupby(group_by_columns).agg(agg_funcs).reset_index()
Group data by specified columns and apply aggregation functions.
Parameters
- data (pd.DataFrame): The input DataFrame.
- group_by_columns (List[str]): Columns to group by.
- agg_funcs (Dict[str, Callable]): Dictionary mapping column names to aggregation functions.
Returns
- pd.DataFrame: Aggregated DataFrame.
102def make_columns_snake_case(data: pd.DataFrame) -> pd.DataFrame: 103 """ 104 Standardize column names to snake case. 105 106 Parameters 107 ---------- 108 data : pd.DataFrame 109 The input DataFrame. 110 111 Returns 112 ------- 113 pd.DataFrame 114 DataFrame with standardized column names. 115 """ 116 data.columns = [_convert_camel_case_to_snake_case(col) for col in data.columns] 117 return data
Standardize column names to snake case.
Parameters
- data (pd.DataFrame): The input DataFrame.
Returns
- pd.DataFrame: DataFrame with standardized column names.
11def remove_duplicates( 12 data: pd.DataFrame, 13 subset: list[str] | None = None, 14 keep: Literal["first", "last", False] = "first", 15) -> pd.DataFrame: 16 """ 17 Remove duplicate rows from the DataFrame. 18 19 Parameters 20 ---------- 21 data : pd.DataFrame 22 The input DataFrame. 23 subset : Optional[List[str]] 24 Columns to consider for identifying duplicates. By default, considers all columns. 25 keep : str 26 Which duplicates to keep ('first', 'last', or False for dropping all). 27 28 Returns 29 ------- 30 pd.DataFrame 31 DataFrame without duplicate rows. 32 """ 33 return data.drop_duplicates(subset=subset, keep=keep)
Remove duplicate rows from the DataFrame.
Parameters
- data (pd.DataFrame): The input DataFrame.
- subset (Optional[List[str]]): Columns to consider for identifying duplicates. By default, considers all columns.
- keep (str): Which duplicates to keep ('first', 'last', or False for dropping all).
Returns
- pd.DataFrame: DataFrame without duplicate rows.
28def scale_data( 29 data: pd.DataFrame, 30 columns: list[str], 31 method: str | Callable[[pd.DataFrame, list[str]], pd.DataFrame] = "minmax", 32) -> pd.DataFrame: 33 """ 34 Scale numerical features using specified scaling method. 35 36 Parameters 37 ---------- 38 data : pd.DataFrame 39 The input DataFrame. 40 columns : List[str] 41 Columns to scale. 42 method : Union[str, Callable[[pd.DataFrame, List[str]], pd.DataFrame]] 43 Scaling method ('minmax', 'zscore', or custom scaler function). 44 45 Returns 46 ------- 47 pd.DataFrame 48 DataFrame with scaled columns. 49 50 Raises 51 ------ 52 ImportError 53 If scikit-learn is not installed (required for built-in scalers). 54 """ 55 _check_sklearn_available() 56 if isinstance(method, str): 57 if method == "minmax": 58 scaler = MinMaxScaler() 59 elif method == "zscore": 60 scaler = StandardScaler() 61 else: 62 raise ValueError(f"Invalid scaling method: {method}") 63 elif callable(method): 64 scaler = method 65 else: 66 raise TypeError("Invalid method type. Method must be a string or a callable function.") 67 68 data[columns] = scaler.fit_transform(data[columns]) 69 return data
Scale numerical features using specified scaling method.
Parameters
- data (pd.DataFrame): The input DataFrame.
- columns (List[str]): Columns to scale.
- method (Union[str, Callable[[pd.DataFrame, List[str]], pd.DataFrame]]): Scaling method ('minmax', 'zscore', or custom scaler function).
Returns
- pd.DataFrame: DataFrame with scaled columns.
Raises
- ImportError: If scikit-learn is not installed (required for built-in scalers).
37def select_columns(data: pd.DataFrame, columns: list[str]) -> pd.DataFrame: 38 """ 39 Retain only specified columns in the DataFrame. 40 41 Parameters 42 ---------- 43 data : pd.DataFrame 44 The input DataFrame. 45 columns : List[str] 46 Columns to retain. 47 48 Returns 49 ------- 50 pd.DataFrame 51 DataFrame with only the selected columns. 52 """ 53 return data[columns]
Retain only specified columns in the DataFrame.
Parameters
- data (pd.DataFrame): The input DataFrame.
- columns (List[str]): Columns to retain.
Returns
- pd.DataFrame: DataFrame with only the selected columns.
5def validate_is_dataframe( 6 data: pd.DataFrame, 7) -> pd.DataFrame: 8 """ 9 Validates if the input is a pandas DataFrame. 10 11 Args: 12 data (pd.DataFrame): The input data to validate. 13 14 Returns: 15 pd.DataFrame: The validated DataFrame if the input is valid. 16 17 Raises: 18 ValueError: If the input is not a pandas DataFrame. 19 """ 20 if not isinstance(data, pd.DataFrame): 21 raise ValueError("Input data is not a pandas DataFrame.") 22 23 return data
Validates if the input is a pandas DataFrame.
Args: data (pd.DataFrame): The input data to validate.
Returns: pd.DataFrame: The validated DataFrame if the input is valid.
Raises: ValueError: If the input is not a pandas DataFrame.