Action System Architecture

The action system provides the core functionality for biological data processing in BioMapper through a self-registering, type-safe architecture.

Core Data Operations

Fundamental actions for data loading and analysis:

LOAD_DATASET_IDENTIFIERS

Generic data loader supporting CSV/TSV files with intelligent identifier handling, automatic format detection, prefix stripping, and regex-based filtering.

MERGE_DATASETS

Combine multiple datasets with intelligent deduplication and conflict resolution strategies.

FILTER_DATASET

Apply complex filtering criteria using Python expressions for data subsetting.

EXPORT_DATASET

Export results to CSV, TSV, or JSON formats with comprehensive metadata preservation.

CUSTOM_TRANSFORM_EXPRESSION

Apply Python expressions to transform data columns dynamically without code changes.

Action Registry System

Actions self-register at import time using the @register_action decorator:

from actions.registry import register_action
from actions.typed_base import TypedStrategyAction

@register_action("ACTION_NAME")
class MyAction(TypedStrategyAction[ParamsModel, ActionResult]):
    pass

The registry (ACTION_REGISTRY) is a global dictionary that enables dynamic action lookup based on YAML strategy configurations. No manual registration is required.

Type Safety

Pydantic Models

All action parameters and results use Pydantic models for validation.

TypedStrategyAction Base

New base class provides type-safe parameter handling.

Backward Compatibility

Legacy dict-based interface maintained during migration.

Execution Context

Shared Dictionary

Actions communicate through a shared context object.

Data Storage

Results stored with descriptive keys like “ukbb_proteins”.

Metadata Tracking

Automatic collection of execution statistics and timing.

Error Handling

Comprehensive error reporting with context preservation.

Action Development Pattern

Follow Test-Driven Development (TDD) when creating new actions:

from pydantic import BaseModel, Field
from actions.typed_base import TypedStrategyAction, StandardActionResult
from actions.registry import register_action
from typing import Dict, Any, List

class MyActionParams(BaseModel):
    """Parameters for custom action with validation."""
    input_key: str = Field(..., description="Input dataset key")
    threshold: float = Field(0.8, ge=0.0, le=1.0, description="Processing threshold")
    output_key: str = Field(..., description="Output dataset key")

@register_action("MY_ACTION")
class MyAction(TypedStrategyAction[MyActionParams, StandardActionResult]):
    """Process biological data with threshold filtering."""

    def get_params_model(self) -> type[MyActionParams]:
        return MyActionParams

    async def execute_typed(
        self,
        params: MyActionParams,
        context: Dict[str, Any]
    ) -> StandardActionResult:
        # Access input data from context datasets
        input_data = context.get("datasets", {}).get(params.input_key, pd.DataFrame())

        # Process data using pandas operations
        if not input_data.empty:
            processed = input_data[input_data["score"] >= params.threshold]
        else:
            processed = pd.DataFrame()

        # Store results in context
        if "datasets" not in context:
            context["datasets"] = {}
        context["datasets"][params.output_key] = processed

        return StandardActionResult(
            success=True,
            message=f"Processed {len(processed)} items from {len(input_data)} total",
            data={"filtered_count": len(input_data) - len(processed)}
        )

Entity-Specific Actions

Actions are organized by biological entity type:

Protein Actions (entities/proteins/)
  • PROTEIN_EXTRACT_UNIPROT_FROM_XREFS - Extract UniProt IDs from compound fields

  • PROTEIN_NORMALIZE_ACCESSIONS - Standardize protein identifier formats

  • PROTEIN_MULTI_BRIDGE - Multi-source protein resolution

  • MERGE_WITH_UNIPROT_RESOLUTION - Historical UniProt ID mapping

Metabolite Actions (entities/metabolites/)
  • METABOLITE_CTS_BRIDGE - Chemical Translation Service integration

  • METABOLITE_EXTRACT_IDENTIFIERS - Extract metabolite IDs from text

  • METABOLITE_NORMALIZE_HMDB - Standardize HMDB formats

  • METABOLITE_MULTI_BRIDGE - Multi-database metabolite resolution

  • NIGHTINGALE_NMR_MATCH - Nightingale NMR platform matching

  • SEMANTIC_METABOLITE_MATCH - AI-powered semantic matching

  • VECTOR_ENHANCED_MATCH - Vector embedding similarity

  • METABOLITE_API_ENRICHMENT - External API enrichment

  • COMBINE_METABOLITE_MATCHES - Merge multiple matching strategies

Chemistry Actions (entities/chemistry/)
  • CHEMISTRY_EXTRACT_LOINC - Extract LOINC codes from clinical data

  • CHEMISTRY_FUZZY_TEST_MATCH - Fuzzy matching for clinical tests

  • CHEMISTRY_VENDOR_HARMONIZATION - Harmonize vendor-specific codes

  • CHEMISTRY_TO_PHENOTYPE_BRIDGE - Link chemistry to phenotypes

Report Actions (reports/)
  • GENERATE_MAPPING_VISUALIZATIONS - Create visualization reports for mapping results

  • GENERATE_LLM_ANALYSIS - Generate AI-powered analysis reports using LLM providers

Benefits

  • Modularity: Each action is self-contained and independently testable

  • Reusability: Actions work in any strategy combination

  • Type Safety: Compile-time validation with Pydantic models

  • Extensibility: Simple to add new action types without modifying core

  • Discoverability: Entity-based organization improves navigation

  • Error Handling: Comprehensive validation and error reporting

Infrastructure Actions (io/ and utils/)
  • SYNC_TO_GOOGLE_DRIVE_V2 - Upload results to Google Drive with chunked transfer

  • PARSE_COMPOSITE_IDENTIFIERS - Parse complex identifier formats from compound fields

  • CUSTOM_TRANSFORM - Apply custom Python expressions to transform data columns

## Verification Sources Last verified: 2025-01-18

This documentation was verified against the following project resources:

  • /biomapper/src/actions/registry.py (Global ACTION_REGISTRY dictionary with @register_action decorator)

  • /biomapper/src/actions/typed_base.py (TypedStrategyAction base class with execute_typed method)

  • /biomapper/src/actions/load_dataset_identifiers.py (LOAD_DATASET_IDENTIFIERS action implementation)

  • /biomapper/src/actions/merge_datasets.py (MERGE_DATASETS action with deduplication logic)

  • /biomapper/src/actions/semantic_metabolite_match.py (SEMANTIC_METABOLITE_MATCH AI-powered matching)

  • /biomapper/src/actions/reports/generate_mapping_visualizations.py (GENERATE_MAPPING_VISUALIZATIONS action)

  • /biomapper/src/actions/reports/generate_llm_analysis.py (GENERATE_LLM_ANALYSIS action)

  • /biomapper/src/actions/utils/data_processing/filter_dataset.py (FILTER_DATASET action implementation)

  • /biomapper/src/actions/utils/data_processing/custom_transform_expression.py (CUSTOM_TRANSFORM and CUSTOM_TRANSFORM_EXPRESSION actions)

  • /biomapper/src/actions/io/sync_to_google_drive_v2.py (SYNC_TO_GOOGLE_DRIVE_V2 implementation)

  • /biomapper/CLAUDE.md (2025 standardizations and TDD development patterns)