adc_toolkit.processing.steps.pandas

Prebuilt transformation functions for pandas DataFrames.

This module provides a comprehensive library of data transformation functions specifically designed for pandas DataFrames. All functions follow the standard step contract: they accept a DataFrame as the first positional argument and return a transformed DataFrame.

These functions are designed to work seamlessly with ~adc_toolkit.processing.ProcessingPipeline, but can also be used as standalone functions for ad-hoc data transformations.

Function Categories

Cleaning (from clean.py): Functions for data quality and standardization.

- `remove_duplicates()`: Remove duplicate rows based on subset of columns
- `fill_missing_values()`: Fill NaN values using various strategies
  (mean, median, mode, constant, interpolate)
- `make_columns_snake_case()`: Standardize column names to snake_case

Filtering (from filter.py): Functions for row and column selection.

- `filter_rows()`: Filter rows using a callable condition
- `select_columns()`: Select specific columns by name

Transforming (from transform.py): Functions for data transformation and feature engineering.

- `scale_data()`: Scale numerical columns (minmax or zscore)
- `encode_categorical()`: Encode categorical columns (onehot or label)
- `divide_one_column_by_another()`: Create ratio columns

Combining (from combine.py): Functions for aggregation and grouping.

- `group_and_aggregate()`: Group by columns and apply aggregation functions

Validating (from validate.py): Functions for data validation.

- `validate_is_dataframe()`: Assert input is a pandas DataFrame
Examples

Using with ProcessingPipeline:

>>> from adc_toolkit.processing import ProcessingPipeline
>>> from adc_toolkit.processing.steps.pandas import (
...     remove_duplicates,
...     fill_missing_values,
...     make_columns_snake_case,
...     scale_data,
... )
>>>
>>> pipeline = (
...     ProcessingPipeline()
...     .add(remove_duplicates, subset=["CustomerID"])
...     .add(fill_missing_values, method="mean", columns=["Revenue"])
...     .add(make_columns_snake_case)
...     .add(scale_data, columns=["revenue"], method="minmax")
... )
>>> clean_data = pipeline.run(raw_data)

Using functions standalone:

>>> import pandas as pd
>>> from adc_toolkit.processing.steps.pandas import fill_missing_values
>>>
>>> df = pd.DataFrame({"A": [1, None, 3], "B": [4, 5, None]})
>>> filled = fill_missing_values(df, method="mean")
>>> filled
     A    B
0  1.0  4.0
1  2.0  5.0
2  3.0  4.5

Filtering with a condition:

>>> from adc_toolkit.processing.steps.pandas import filter_rows
>>>
>>> df = pd.DataFrame({"name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]})
>>> adults = filter_rows(df, condition=lambda d: d["age"] >= 30)
>>> adults
      name  age
1      Bob   30
2  Charlie   35

Scaling numerical features:

>>> from adc_toolkit.processing.steps.pandas import scale_data
>>>
>>> df = pd.DataFrame({"price": [100, 200, 300]})
>>> scaled = scale_data(df, columns=["price"], method="minmax")
>>> scaled["price"].tolist()
[0.0, 0.5, 1.0]
See Also

adc_toolkit.processing.ProcessingPipeline: Pipeline for chaining transformations.
adc_toolkit.processing.steps: Parent module with convenience re-exports.

Notes

The Step Contract

All functions in this module follow this signature pattern::

def step_function(data: pd.DataFrame, **kwargs) -> pd.DataFrame:
    # Transform data
    return transformed_data

This makes them compatible with ProcessingPipeline.add() <adc_toolkit.processing.ProcessingPipeline.add>().

Immutability

Most functions return a new DataFrame rather than modifying the input in place. This ensures predictable behavior when used in pipelines. Check individual function documentation for specific behavior.

Optional Dependencies

Some functions (like scale_data and encode_categorical) require scikit-learn when using certain methods. Install the preprocessing extra to enable these features::

uv sync --extra preprocessing
  1"""
  2Prebuilt transformation functions for pandas DataFrames.
  3
  4This module provides a comprehensive library of data transformation functions
  5specifically designed for pandas DataFrames. All functions follow the standard
  6step contract: they accept a DataFrame as the first positional argument and
  7return a transformed DataFrame.
  8
  9These functions are designed to work seamlessly with
 10:class:`~adc_toolkit.processing.ProcessingPipeline`, but can also be used
 11as standalone functions for ad-hoc data transformations.
 12
 13Function Categories
 14-------------------
 15**Cleaning** (from ``clean.py``):
 16    Functions for data quality and standardization.
 17
 18    - :func:`remove_duplicates`: Remove duplicate rows based on subset of columns
 19    - :func:`fill_missing_values`: Fill NaN values using various strategies
 20      (mean, median, mode, constant, interpolate)
 21    - :func:`make_columns_snake_case`: Standardize column names to snake_case
 22
 23**Filtering** (from ``filter.py``):
 24    Functions for row and column selection.
 25
 26    - :func:`filter_rows`: Filter rows using a callable condition
 27    - :func:`select_columns`: Select specific columns by name
 28
 29**Transforming** (from ``transform.py``):
 30    Functions for data transformation and feature engineering.
 31
 32    - :func:`scale_data`: Scale numerical columns (minmax or zscore)
 33    - :func:`encode_categorical`: Encode categorical columns (onehot or label)
 34    - :func:`divide_one_column_by_another`: Create ratio columns
 35
 36**Combining** (from ``combine.py``):
 37    Functions for aggregation and grouping.
 38
 39    - :func:`group_and_aggregate`: Group by columns and apply aggregation functions
 40
 41**Validating** (from ``validate.py``):
 42    Functions for data validation.
 43
 44    - :func:`validate_is_dataframe`: Assert input is a pandas DataFrame
 45
 46Examples
 47--------
 48**Using with ProcessingPipeline:**
 49
 50>>> from adc_toolkit.processing import ProcessingPipeline
 51>>> from adc_toolkit.processing.steps.pandas import (
 52...     remove_duplicates,
 53...     fill_missing_values,
 54...     make_columns_snake_case,
 55...     scale_data,
 56... )
 57>>>
 58>>> pipeline = (
 59...     ProcessingPipeline()
 60...     .add(remove_duplicates, subset=["CustomerID"])
 61...     .add(fill_missing_values, method="mean", columns=["Revenue"])
 62...     .add(make_columns_snake_case)
 63...     .add(scale_data, columns=["revenue"], method="minmax")
 64... )
 65>>> clean_data = pipeline.run(raw_data)
 66
 67**Using functions standalone:**
 68
 69>>> import pandas as pd
 70>>> from adc_toolkit.processing.steps.pandas import fill_missing_values
 71>>>
 72>>> df = pd.DataFrame({"A": [1, None, 3], "B": [4, 5, None]})
 73>>> filled = fill_missing_values(df, method="mean")
 74>>> filled
 75     A    B
 760  1.0  4.0
 771  2.0  5.0
 782  3.0  4.5
 79
 80**Filtering with a condition:**
 81
 82>>> from adc_toolkit.processing.steps.pandas import filter_rows
 83>>>
 84>>> df = pd.DataFrame({"name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]})
 85>>> adults = filter_rows(df, condition=lambda d: d["age"] >= 30)
 86>>> adults
 87      name  age
 881      Bob   30
 892  Charlie   35
 90
 91**Scaling numerical features:**
 92
 93>>> from adc_toolkit.processing.steps.pandas import scale_data
 94>>>
 95>>> df = pd.DataFrame({"price": [100, 200, 300]})
 96>>> scaled = scale_data(df, columns=["price"], method="minmax")
 97>>> scaled["price"].tolist()
 98[0.0, 0.5, 1.0]
 99
100See Also
101--------
102adc_toolkit.processing.ProcessingPipeline : Pipeline for chaining transformations.
103adc_toolkit.processing.steps : Parent module with convenience re-exports.
104
105Notes
106-----
107**The Step Contract**
108
109All functions in this module follow this signature pattern::
110
111    def step_function(data: pd.DataFrame, **kwargs) -> pd.DataFrame:
112        # Transform data
113        return transformed_data
114
115This makes them compatible with :meth:`ProcessingPipeline.add()
116<adc_toolkit.processing.ProcessingPipeline.add>`.
117
118**Immutability**
119
120Most functions return a new DataFrame rather than modifying the input in place.
121This ensures predictable behavior when used in pipelines. Check individual
122function documentation for specific behavior.
123
124**Optional Dependencies**
125
126Some functions (like ``scale_data`` and ``encode_categorical``) require
127scikit-learn when using certain methods. Install the ``preprocessing`` extra
128to enable these features::
129
130    uv sync --extra preprocessing
131"""
132
133from .clean import fill_missing_values, make_columns_snake_case, remove_duplicates
134from .combine import group_and_aggregate
135from .filter import filter_rows, select_columns
136from .transform import divide_one_column_by_another, encode_categorical, scale_data
137from .validate import validate_is_dataframe
138
139
140__all__ = [
141    "divide_one_column_by_another",
142    "encode_categorical",
143    "fill_missing_values",
144    "filter_rows",
145    "group_and_aggregate",
146    "make_columns_snake_case",
147    "remove_duplicates",
148    "scale_data",
149    "select_columns",
150    "validate_is_dataframe",
151]
def divide_one_column_by_another( data: pandas.core.frame.DataFrame, numerator: str, denominator: str, new_column_name: str) -> pandas.core.frame.DataFrame:
107def divide_one_column_by_another(
108    data: pd.DataFrame, numerator: str, denominator: str, new_column_name: str
109) -> pd.DataFrame:
110    """
111
112    Parameters
113
114    ----------
115    data : pd.DataFrame
116        The input DataFrame containing the data to be transformed.
117    numerator : str
118        The name of the column to be used as the numerator.
119    denominator : str
120        The name of the column to be used as the denominator.
121    new_column_name : str
122        The name of the new column to be created.
123    Returns
124    -------
125    pd.DataFrame
126        The transformed DataFrame with the new column added.
127    """
128    data[new_column_name] = data[numerator] / data[denominator]
129    return data

Parameters


data : pd.DataFrame The input DataFrame containing the data to be transformed. numerator : str The name of the column to be used as the numerator. denominator : str The name of the column to be used as the denominator. new_column_name : str The name of the new column to be created.

Returns
  • pd.DataFrame: The transformed DataFrame with the new column added.
def encode_categorical( data: pandas.core.frame.DataFrame, columns: list[str], method: str = 'onehot') -> pandas.core.frame.DataFrame:
 72def encode_categorical(data: pd.DataFrame, columns: list[str], method: str = "onehot") -> pd.DataFrame:
 73    """
 74    Encode categorical features using the specified encoding method.
 75
 76    Parameters
 77    ----------
 78    data : pd.DataFrame
 79        The input DataFrame.
 80    columns : List[str]
 81        Columns to encode.
 82    method : str
 83        Encoding method ('onehot' or 'label').
 84
 85    Returns
 86    -------
 87    pd.DataFrame
 88        DataFrame with encoded columns.
 89
 90    Raises
 91    ------
 92    ImportError
 93        If scikit-learn is not installed and method is 'label'.
 94    """
 95    if method == "onehot":
 96        return pd.get_dummies(data, columns=columns)
 97    elif method == "label":
 98        _check_sklearn_available()
 99        encoder = LabelEncoder()
100        for col in columns:
101            data[col] = encoder.fit_transform(data[col])
102        return data
103    else:
104        raise ValueError(f"Invalid encoding method: {method}")

Encode categorical features using the specified encoding method.

Parameters
  • data (pd.DataFrame): The input DataFrame.
  • columns (List[str]): Columns to encode.
  • method (str): Encoding method ('onehot' or 'label').
Returns
  • pd.DataFrame: DataFrame with encoded columns.
Raises
  • ImportError: If scikit-learn is not installed and method is 'label'.
def fill_missing_values( data: pandas.core.frame.DataFrame, method: str = 'mean', value: Any = None, columns: list[str] | None = None) -> pandas.core.frame.DataFrame:
44def fill_missing_values(
45    data: pd.DataFrame,
46    method: str = FillMethod.MEAN.value,
47    value: Any = None,
48    columns: list[str] | None = None,
49) -> pd.DataFrame:
50    """
51    Fill or interpolate missing values in the DataFrame.
52
53    Parameters
54    ----------
55    data : pd.DataFrame
56        The input DataFrame.
57    method : str
58        The method to fill missing values ("mean", "median", "mode", "constant" or "interpolate").
59    value : Any
60        Specific value to use for filling if `method="constant"`.
61    columns : Optional[List[str]]
62        List of columns to apply the filling method to. If None, applies to all columns.
63
64    Returns
65    -------
66    pd.DataFrame
67        DataFrame with missing values filled.
68    """
69    try:
70        fill_method = FillMethod(method)
71    except ValueError as e:
72        raise ValueError(f"Invalid method: {method}") from e
73
74    if columns is None:
75        columns = list(data.columns)
76
77    missing_columns = [col for col in columns if col not in data.columns]
78    if missing_columns:
79        raise ValueError(
80            f"The following columns are not in the DataFrame: {missing_columns}. Available columns: {data.columns}"
81        )
82
83    data = data.copy()
84    if fill_method == FillMethod.MEAN:
85        data[columns] = data[columns].fillna(data[columns].mean())
86    elif fill_method == FillMethod.MEDIAN:
87        data[columns] = data[columns].fillna(data[columns].median())
88    elif fill_method == FillMethod.MODE:
89        data[columns] = data[columns].apply(lambda col: col.fillna(col.mode().iloc[0]))
90    elif fill_method == FillMethod.CONSTANT and value is not None:
91        data[columns] = data[columns].fillna(value)
92    elif fill_method == FillMethod.INTERPOLATE:
93        data[columns] = data[columns].interpolate()
94
95    return data

Fill or interpolate missing values in the DataFrame.

Parameters
  • data (pd.DataFrame): The input DataFrame.
  • method (str): The method to fill missing values ("mean", "median", "mode", "constant" or "interpolate").
  • value (Any): Specific value to use for filling if method="constant".
  • columns (Optional[List[str]]): List of columns to apply the filling method to. If None, applies to all columns.
Returns
  • pd.DataFrame: DataFrame with missing values filled.
def filter_rows( data: pandas.core.frame.DataFrame, condition: Callable[[pandas.core.frame.DataFrame], pandas.core.series.Series]) -> pandas.core.frame.DataFrame:
 9def filter_rows(data: pd.DataFrame, condition: Callable[[pd.DataFrame], pd.Series]) -> pd.DataFrame:
10    """
11    Filter rows based on a condition.
12
13    Parameters
14    ----------
15    data : pd.DataFrame
16        The input DataFrame.
17    condition : Callable[[pd.DataFrame], pd.Series]
18        A function that returns a boolean Series indicating rows to keep.
19
20    Returns
21    -------
22    pd.DataFrame
23        Filtered DataFrame.
24
25    Example
26    --------
27    >>> data = pd.DataFrame({"A": [1, 2, 3, 4], "B": ["a", "b", "c", "d"]})
28    >>> condition = lambda df: df["A"] > 2
29    >>> filter_rows(data, condition)
30       A  B
31    2  3  c
32    3  4  d
33    """
34    return data[condition(data)]

Filter rows based on a condition.

Parameters
  • data (pd.DataFrame): The input DataFrame.
  • condition (Callable[[pd.DataFrame], pd.Series]): A function that returns a boolean Series indicating rows to keep.
Returns
  • pd.DataFrame: Filtered DataFrame.
Example
>>> data = pd.DataFrame({"A": [1, 2, 3, 4], "B": ["a", "b", "c", "d"]})
>>> condition = lambda df: df["A"] > 2
>>> filter_rows(data, condition)
   A  B
2  3  c
3  4  d
def group_and_aggregate( data: pandas.core.frame.DataFrame, group_by_columns: list[str], agg_funcs: dict[str, Callable]) -> pandas.core.frame.DataFrame:
 9def group_and_aggregate(
10    data: pd.DataFrame, group_by_columns: list[str], agg_funcs: dict[str, Callable]
11) -> pd.DataFrame:
12    """
13    Group data by specified columns and apply aggregation functions.
14
15    Parameters
16    ----------
17    data : pd.DataFrame
18        The input DataFrame.
19    group_by_columns : List[str]
20        Columns to group by.
21    agg_funcs : Dict[str, Callable]
22        Dictionary mapping column names to aggregation functions.
23
24    Returns
25    -------
26    pd.DataFrame
27        Aggregated DataFrame.
28    """
29    return data.groupby(group_by_columns).agg(agg_funcs).reset_index()

Group data by specified columns and apply aggregation functions.

Parameters
  • data (pd.DataFrame): The input DataFrame.
  • group_by_columns (List[str]): Columns to group by.
  • agg_funcs (Dict[str, Callable]): Dictionary mapping column names to aggregation functions.
Returns
  • pd.DataFrame: Aggregated DataFrame.
def make_columns_snake_case(data: pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame:
102def make_columns_snake_case(data: pd.DataFrame) -> pd.DataFrame:
103    """
104    Standardize column names to snake case.
105
106    Parameters
107    ----------
108    data : pd.DataFrame
109        The input DataFrame.
110
111    Returns
112    -------
113    pd.DataFrame
114        DataFrame with standardized column names.
115    """
116    data.columns = [_convert_camel_case_to_snake_case(col) for col in data.columns]
117    return data

Standardize column names to snake case.

Parameters
  • data (pd.DataFrame): The input DataFrame.
Returns
  • pd.DataFrame: DataFrame with standardized column names.
def remove_duplicates( data: pandas.core.frame.DataFrame, subset: list[str] | None = None, keep: Literal['first', 'last', False] = 'first') -> pandas.core.frame.DataFrame:
11def remove_duplicates(
12    data: pd.DataFrame,
13    subset: list[str] | None = None,
14    keep: Literal["first", "last", False] = "first",
15) -> pd.DataFrame:
16    """
17    Remove duplicate rows from the DataFrame.
18
19    Parameters
20    ----------
21    data : pd.DataFrame
22        The input DataFrame.
23    subset : Optional[List[str]]
24        Columns to consider for identifying duplicates. By default, considers all columns.
25    keep : str
26        Which duplicates to keep ('first', 'last', or False for dropping all).
27
28    Returns
29    -------
30    pd.DataFrame
31        DataFrame without duplicate rows.
32    """
33    return data.drop_duplicates(subset=subset, keep=keep)

Remove duplicate rows from the DataFrame.

Parameters
  • data (pd.DataFrame): The input DataFrame.
  • subset (Optional[List[str]]): Columns to consider for identifying duplicates. By default, considers all columns.
  • keep (str): Which duplicates to keep ('first', 'last', or False for dropping all).
Returns
  • pd.DataFrame: DataFrame without duplicate rows.
def scale_data( data: pandas.core.frame.DataFrame, columns: list[str], method: str | Callable[[pandas.core.frame.DataFrame, list[str]], pandas.core.frame.DataFrame] = 'minmax') -> pandas.core.frame.DataFrame:
28def scale_data(
29    data: pd.DataFrame,
30    columns: list[str],
31    method: str | Callable[[pd.DataFrame, list[str]], pd.DataFrame] = "minmax",
32) -> pd.DataFrame:
33    """
34    Scale numerical features using specified scaling method.
35
36    Parameters
37    ----------
38    data : pd.DataFrame
39        The input DataFrame.
40    columns : List[str]
41        Columns to scale.
42    method : Union[str, Callable[[pd.DataFrame, List[str]], pd.DataFrame]]
43        Scaling method ('minmax', 'zscore', or custom scaler function).
44
45    Returns
46    -------
47    pd.DataFrame
48        DataFrame with scaled columns.
49
50    Raises
51    ------
52    ImportError
53        If scikit-learn is not installed (required for built-in scalers).
54    """
55    _check_sklearn_available()
56    if isinstance(method, str):
57        if method == "minmax":
58            scaler = MinMaxScaler()
59        elif method == "zscore":
60            scaler = StandardScaler()
61        else:
62            raise ValueError(f"Invalid scaling method: {method}")
63    elif callable(method):
64        scaler = method
65    else:
66        raise TypeError("Invalid method type. Method must be a string or a callable function.")
67
68    data[columns] = scaler.fit_transform(data[columns])
69    return data

Scale numerical features using specified scaling method.

Parameters
  • data (pd.DataFrame): The input DataFrame.
  • columns (List[str]): Columns to scale.
  • method (Union[str, Callable[[pd.DataFrame, List[str]], pd.DataFrame]]): Scaling method ('minmax', 'zscore', or custom scaler function).
Returns
  • pd.DataFrame: DataFrame with scaled columns.
Raises
  • ImportError: If scikit-learn is not installed (required for built-in scalers).
def select_columns( data: pandas.core.frame.DataFrame, columns: list[str]) -> pandas.core.frame.DataFrame:
37def select_columns(data: pd.DataFrame, columns: list[str]) -> pd.DataFrame:
38    """
39    Retain only specified columns in the DataFrame.
40
41    Parameters
42    ----------
43    data : pd.DataFrame
44        The input DataFrame.
45    columns : List[str]
46        Columns to retain.
47
48    Returns
49    -------
50    pd.DataFrame
51        DataFrame with only the selected columns.
52    """
53    return data[columns]

Retain only specified columns in the DataFrame.

Parameters
  • data (pd.DataFrame): The input DataFrame.
  • columns (List[str]): Columns to retain.
Returns
  • pd.DataFrame: DataFrame with only the selected columns.
def validate_is_dataframe(data: pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame:
 5def validate_is_dataframe(
 6    data: pd.DataFrame,
 7) -> pd.DataFrame:
 8    """
 9    Validates if the input is a pandas DataFrame.
10
11    Args:
12        data (pd.DataFrame): The input data to validate.
13
14    Returns:
15        pd.DataFrame: The validated DataFrame if the input is valid.
16
17    Raises:
18        ValueError: If the input is not a pandas DataFrame.
19    """
20    if not isinstance(data, pd.DataFrame):
21        raise ValueError("Input data is not a pandas DataFrame.")
22
23    return data

Validates if the input is a pandas DataFrame.

Args: data (pd.DataFrame): The input data to validate.

Returns: pd.DataFrame: The validated DataFrame if the input is valid.

Raises: ValueError: If the input is not a pandas DataFrame.