CUSTOM_TRANSFORM

Apply custom data transformations with flexible operations and error handling.

Purpose

This action provides powerful data transformation capabilities for complex data processing that doesn’t fit standard action patterns. It supports:

  • Chained transformation operations

  • Multiple transformation types

  • Conditional transformations

  • Schema validation

  • Comprehensive error handling

  • Flexible output options

Parameters

Required Parameters

input_key (string)

Key of the input dataset to transform from context[‘datasets’].

output_key (string)

Key where the transformed dataset will be stored.

transformations (list of objects)

List of transformation operations to apply sequentially. Each transformation contains:

  • type (string): Transformation type (see types below)

  • params (object): Parameters specific to the transformation type

  • condition (string, optional): Conditional expression for applying transformation

Optional Parameters

validate_schema (boolean)

Whether to validate output schema matches expectations. Default: true

expected_columns (list of strings)

Expected columns in output dataset (for validation). Default: None

preserve_index (boolean)

Whether to preserve original DataFrame index. Default: true

error_handling (string)

How to handle transformation errors: ‘strict’, ‘warn’, or ‘ignore’. Default: ‘strict’

Transformation Types

Column Operations

column_rename

Rename columns using a mapping dictionary.

Parameters: * mapping: Dictionary of {old_name: new_name}

column_add

Add new columns with specified values or functions.

Parameters: * columns: Dictionary of {column_name: value_or_function}

column_drop

Remove specified columns.

Parameters: * columns: List of column names to drop

column_transform

Transform values in a specific column.

Parameters: * column: Column name to transform * function: Transformation function (string or callable)

Data Operations

filter_rows

Filter rows based on conditions.

Parameters: * query: Pandas query string, OR * conditions: Dictionary of column-based conditions

merge_columns

Combine multiple columns into a new column.

Parameters: * new_column: Name of new column * source_columns: List of columns to merge * separator: String to join values (default: “_”)

split_column

Split a column into multiple new columns.

Parameters: * source_column: Column to split * separator: Split delimiter (default: “_”) * new_columns: List of new column names

Data Cleaning

deduplicate

Remove duplicate rows.

Parameters: * subset: Columns to consider for duplication (optional) * keep: Which duplicate to keep (‘first’, ‘last’, False)

fill_na

Fill missing values.

Parameters: * method: Fill method (‘value’, ‘forward’, ‘backward’) * value: Fill value (if method=’value’)

sort

Sort dataset by columns.

Parameters: * by: List of columns to sort by * ascending: Sort order (boolean or list of booleans)

Example Usage

Basic Column Operations

- name: clean_protein_data
  action:
    type: CUSTOM_TRANSFORM
    params:
      input_key: "raw_proteins"
      output_key: "cleaned_proteins"
      transformations:
        - type: "column_rename"
          params:
            mapping:
              "UniProt": "uniprot_id"
              "Gene": "gene_name"
        - type: "column_transform"
          params:
            column: "gene_name"
            function: "upper"
        - type: "fill_na"
          params:
            method: "value"
            value: "UNKNOWN"

Complex Data Processing

- name: process_metabolite_data
  action:
    type: CUSTOM_TRANSFORM
    params:
      input_key: "metabolite_raw"
      output_key: "metabolite_processed"
      transformations:
        - type: "column_add"
          params:
            columns:
              "processing_date": "2024-01-01"
              "data_source": "nmr_platform"
        - type: "merge_columns"
          params:
            new_column: "compound_identifier"
            source_columns: ["hmdb_id", "chebi_id"]
            separator: "|"
        - type: "filter_rows"
          params:
            conditions:
              confidence:
                operator: ">="
                value: 0.8
        - type: "deduplicate"
          params:
            subset: ["compound_identifier"]
            keep: "first"

String Transformations

- name: standardize_names
  action:
    type: CUSTOM_TRANSFORM
    params:
      input_key: "compound_names"
      output_key: "standardized_names"
      transformations:
        - type: "column_transform"
          params:
            column: "compound_name"
            function: "lower"
        - type: "column_transform"
          params:
            column: "compound_name"
            function: "strip"
        - type: "column_transform"
          params:
            column: "compound_name"
            function: "replace:_: "  # Replace underscores with spaces

Conditional Transformations

- name: conditional_processing
  action:
    type: CUSTOM_TRANSFORM
    params:
      input_key: "mixed_data"
      output_key: "processed_data"
      transformations:
        - type: "column_add"
          params:
            columns:
              "high_confidence": "True"
          condition: "df['confidence'].mean() > 0.8"
        - type: "filter_rows"
          params:
            query: "confidence >= 0.7"
          condition: "len(df) > 100"

Advanced Column Splitting

- name: split_identifiers
  action:
    type: CUSTOM_TRANSFORM
    params:
      input_key: "compound_data"
      output_key: "split_data"
      transformations:
        - type: "split_column"
          params:
            source_column: "compound_ids"
            separator: "|"
            new_columns: ["primary_id", "secondary_id", "tertiary_id"]
        - type: "column_drop"
          params:
            columns: ["compound_ids"]  # Remove original column

Schema Validation

- name: validated_transform
  action:
    type: CUSTOM_TRANSFORM
    params:
      input_key: "input_data"
      output_key: "validated_data"
      validate_schema: true
      expected_columns: ["uniprot_id", "gene_name", "confidence"]
      transformations:
        - type: "column_rename"
          params:
            mapping:
              "UniProt": "uniprot_id"
              "Gene": "gene_name"

Error Handling Examples

- name: robust_transform
  action:
    type: CUSTOM_TRANSFORM
    params:
      input_key: "noisy_data"
      output_key: "cleaned_data"
      error_handling: "warn"  # Continue on errors
      transformations:
        - type: "column_transform"
          params:
            column: "numeric_field"
            function: "float"  # May fail on non-numeric values
        - type: "filter_rows"
          params:
            query: "numeric_field > 0"  # Only valid after conversion

Transformation Functions

String Functions
  • lower - Convert to lowercase

  • upper - Convert to uppercase

  • strip - Remove leading/trailing whitespace

  • replace:old:new - Replace substring

Custom Functions

Functions can be provided as Python callables for complex transformations.

Query Expressions

Use pandas query syntax for complex row filtering:

  • confidence > 0.8 and category == 'reviewed'

  • gene_name.str.contains('BRCA')

  • @external_variable > threshold

Output Format

The action stores the transformed dataset in the context:

# Context after execution
{
    "datasets": {
        "processed_data": [
            {
                "uniprot_id": "P12345",
                "gene_name": "EXAMPLE1",
                "confidence": 0.95,
                "processing_date": "2024-01-01"
            }
            # ... transformed rows
        ]
    }
}

Transformation Result

The action returns detailed information about the transformation:

{
    "success": True,
    "rows_processed": 1000,
    "columns_before": 5,
    "columns_after": 7,
    "transformations_applied": 4,
    "transformations_failed": 0,
    "warnings": [],
    "schema_validation_passed": True
}

Error Handling Modes

Strict Mode (strict)

Stops execution on first error. Best for critical transformations.

Warning Mode (warn)

Logs errors but continues processing. Best for exploratory analysis.

Ignore Mode (ignore)

Silently continues on errors. Use with caution.

Best Practices

  1. Plan transformation sequences carefully - order matters

  2. Use descriptive transformation names in complex pipelines

  3. Validate schemas for critical data transformations

  4. Handle missing data explicitly with fill_na operations

  5. Test transformations on sample data before production

  6. Use appropriate error handling based on data quality expectations

  7. Document complex transformations with clear parameter descriptions

Performance Notes

  • Transformations are applied sequentially using pandas operations

  • Large datasets (>100K rows) process efficiently

  • String operations may be slower than numeric transformations

  • Memory usage scales with dataset size and transformation complexity

  • Consider chunking for extremely large datasets

Common Use Cases

Data Standardization

Normalize column names, formats, and value representations

Data Enrichment

Add computed columns, metadata, or derived values

Quality Control

Remove duplicates, handle missing values, filter invalid data

Format Conversion

Transform data between different structural representations

Experimental Preprocessing

Apply domain-specific transformations for analysis

Integration

This action typically follows data loading and precedes specific analysis:

steps:
  # 1. Load raw data
  - name: load_data
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "/data/raw_proteins.csv"
        identifier_column: "UniProt"
        output_key: "raw_data"

  # 2. Custom transformations
  - name: clean_and_process
    action:
      type: CUSTOM_TRANSFORM
      params:
        input_key: "raw_data"
        output_key: "processed_data"
        transformations:
          - type: "column_rename"
            params:
              mapping: {"UniProt": "uniprot_id"}
          - type: "column_transform"
            params:
              column: "confidence"
              function: "float"
          - type: "filter_rows"
            params:
              query: "confidence >= 0.8"

  # 3. Continue with analysis
  - name: analyze_data
    action:
      type: CALCULATE_SET_OVERLAP
      params:
        dataset_key: "processed_data"

## Verification Sources Last verified: 2025-08-22

This documentation was verified against the following project resources:

  • /biomapper/src/actions/utils/data_processing/custom_transform_expression.py (actual implementation with expression-based transformations)

  • /biomapper/src/actions/typed_base.py (TypedStrategyAction base class)

  • /biomapper/src/actions/registry.py (dual registration for CUSTOM_TRANSFORM and CUSTOM_TRANSFORM_EXPRESSION)

  • /biomapper/CLAUDE.md (2025 standardizations and parameter naming conventions)

  • /biomapper/pyproject.toml (pandas dependency for DataFrame operations)