BioMapper Architecture

Overview

BioMapper is a YAML-based workflow platform for biological data harmonization and ontology mapping. Built on a self-registering action system with comprehensive specialized actions, it provides workflows for mapping proteins, metabolites, chemistry data, and other biological entities.

The architecture follows a three-layer design (Client → API → Core) with type-safe actions, automatic validation, and extensibility through simple decorator-based registration.

Core Components

Self-Registering Action System

Actions automatically register at import time using the @register_action decorator, eliminating manual registration. The global ACTION_REGISTRY enables dynamic action lookup from YAML strategies.

YAML Strategy System

Declarative workflow definition with variable substitution, metadata tracking, and parameter validation. Strategies execute sequentially with shared context between steps.

Available Actions

Organized by biological entity type:

Data Operations:

LOAD_DATASET_IDENTIFIERS - Generic CSV/TSV loader
MERGE_DATASETS - Combine with deduplication
FILTER_DATASET - Complex filtering
EXPORT_DATASET - Multi-format export
CUSTOM_TRANSFORM_EXPRESSION - Dynamic transformations

Protein Actions:

PROTEIN_NORMALIZE_ACCESSIONS - Standardize UniProt IDs
PROTEIN_EXTRACT_UNIPROT_FROM_XREFS - Extract from compound fields

Metabolite Actions:

NIGHTINGALE_NMR_MATCH - Nightingale platform matching
SEMANTIC_METABOLITE_MATCH - AI-powered matching

Chemistry Actions:

CHEMISTRY_FUZZY_TEST_MATCH - Fuzzy clinical test matching

Analysis & Reporting:

GENERATE_MAPPING_VISUALIZATIONS - Create visualization reports
GENERATE_LLM_ANALYSIS - AI-powered analysis reports

Infrastructure Actions:

SYNC_TO_GOOGLE_DRIVE_V2 - Upload to Google Drive
PARSE_COMPOSITE_IDENTIFIERS - Parse complex identifiers
CUSTOM_TRANSFORM - Apply custom transformations

REST API Layer

FastAPI service with:

Strategy execution endpoints (/api/strategies/v2/)
Job management with SQLite persistence
Background processing with checkpointing
Server-Sent Events for real-time progress
OpenAPI documentation

Python Client Library

BiomapperClient in src/client/client_v2.py provides:

Synchronous wrapper for async operations
Automatic retry and error handling
Progress streaming support
Simple interface: client.run("strategy_name")

Core Execution Engine

MinimalStrategyService in src/core/minimal_strategy_service.py:

Direct YAML loading from src/configs/strategies/
Sequential action execution with error handling
Variable substitution (${parameters.key}, ${env.VAR})
Shared execution context management
No database dependencies

Directory Structure

biomapper/
├── src/                            # Main source directory
│   ├── actions/                    # Self-registering actions
│   │   ├── entities/               # Entity-specific actions
│   │   │   ├── proteins/           # UniProt, Ensembl actions
│   │   │   ├── metabolites/        # HMDB, CHEBI, KEGG actions
│   │   │   └── chemistry/          # LOINC, clinical test actions
│   │   ├── algorithms/             # Analysis algorithms  
│   │   ├── io/                     # Import/export actions
│   │   ├── utils/                  # Utility actions
│   │   ├── workflows/              # High-level workflows
│   │   ├── typed_base.py           # TypedStrategyAction base
│   │   ├── registry.py             # Global ACTION_REGISTRY
│   │   └── base.py                 # BaseStrategyAction
│   ├── api/                        # FastAPI service
│   │   ├── main.py                 # Server configuration
│   │   ├── routes/                 # REST endpoints
│   │   └── services/
│   │       └── mapper_service.py   # Job orchestration
│   ├── client/                     # Python client
│   │   └── client_v2.py            # BiomapperClient
│   ├── core/                       # Core library
│   │   ├── minimal_strategy_service.py  # Execution engine
│   │   ├── models/                 # Data models
│   │   ├── standards/              # 2025 standardizations
│   │   └── algorithms/             # Core algorithms
│   └── configs/
│       └── strategies/             # YAML strategies
│           ├── experimental/       # Development strategies
│           ├── metabolite/         # Metabolite-specific
│           └── protein/            # Protein-specific
├── tests/
│   └── unit/                       # Unit tests
│       ├── core/
│       │   └── strategy_actions/   # Action tests
│       └── strategy_actions/       # Legacy test location
└── docs/                           # Documentation
    └── source/
        └── architecture/           # Architecture docs

System Architecture

┌─────────────────────────────────────────────────────┐
│                   Client Layer                      │
│  • BiomapperClient (Python)                        │
│  • CLI Scripts                                     │
│  • Jupyter Notebooks                               │
└───────────────────┬─────────────────────────────────┘
                    │ HTTP/REST
┌───────────────────▼─────────────────────────────────┐
│                    API Layer                        │
│  • FastAPI Server (port 8000)                      │
│  • MapperService (job orchestration)               │
│  • Background job processing                        │
│  • SQLite persistence (biomapper.db)               │
└───────────────────┬─────────────────────────────────┘
                    │
┌───────────────────▼─────────────────────────────────┐
│                   Core Layer                        │
│  • MinimalStrategyService (execution engine)       │
│  • ACTION_REGISTRY (global action registry)        │
│  • TypedStrategyAction (base class)                │
│  • Execution Context (shared state)                │
└─────────────────────────────────────────────────────┘

Execution Flow

Client Request: BiomapperClient.run("strategy_name")
Job Creation: API creates background job with unique ID
Strategy Loading: MinimalStrategyService loads YAML from configs/
Action Resolution: ACTION_REGISTRY lookup for each step
Parameter Validation: Pydantic models validate action params
Sequential Execution: Actions execute via execute_typed()
Context Updates: Each action modifies shared context
Checkpointing: Progress saved to SQLite for recovery
Result Return: Via REST response or SSE stream

Key Design Principles

1. Self-Registration

Actions register automatically via @register_action decorator
No manual registration or executor modifications needed
Plugin-style extensibility

2. Type Safety

Pydantic models for parameter validation
TypedStrategyAction generic base class
Compile-time type hints with runtime validation
Backward compatibility during migration

3. Shared Execution Context

Actions communicate through shared Dict[str, Any]
Standard keys: datasets, statistics, output_files
Data flows between steps via named keys

4. Entity-Based Organization

Actions organized by biological entity type
Clear navigation: entities/proteins/, entities/metabolites/
Reusable algorithms in dedicated directories

5. Test-Driven Development

Write tests first, then implementation
Minimum 80% coverage requirement
All new actions must use TypedStrategyAction pattern

Creating New Actions

from actions.typed_base import TypedStrategyAction, StandardActionResult
from actions.registry import register_action
from pydantic import BaseModel, Field
from typing import Dict, Any
import pandas as pd

class MyActionParams(BaseModel):
    input_key: str = Field(..., description="Input dataset key")
    threshold: float = Field(0.8, ge=0.0, le=1.0)
    output_key: str = Field(..., description="Output dataset key")

@register_action("MY_ACTION")
class MyAction(TypedStrategyAction[MyActionParams, StandardActionResult]):
    def get_params_model(self) -> type[MyActionParams]:
        return MyActionParams
    
    async def execute_typed(self, params: MyActionParams, context: Dict[str, Any]) -> StandardActionResult:
        # Access input data
        datasets = context.get("datasets", {})
        input_data = datasets.get(params.input_key, pd.DataFrame())
        
        # Process data using pandas
        if not input_data.empty:
            processed = input_data[input_data["score"] >= params.threshold]
        else:
            processed = pd.DataFrame()
        
        # Store output
        if "datasets" not in context:
            context["datasets"] = {}
        context["datasets"][params.output_key] = processed
        
        return StandardActionResult(
            success=True,
            message=f"Processed {len(processed)} items",
            data={"input_count": len(input_data), "output_count": len(processed)}
        )

Action will auto-register - no other changes needed!

Strategy Configuration

Strategies are defined in YAML files:

name: "STRATEGY_NAME"
description: "Strategy description"

metadata:
  entity_type: "proteins|metabolites|chemistry"
  quality_tier: "experimental|production|test"
  version: "1.0.0"

parameters:
  input_file: "${DATA_DIR}/input.tsv"
  output_dir: "${OUTPUT_DIR:-/tmp/results}"

steps:
  - name: step_name
    action:
      type: ACTION_TYPE
      params:
        input_key: "dataset_key"
        output_key: "result_key"

See YAML Strategies for complete documentation.

Performance Considerations

Chunking: Large datasets processed via CHUNK_PROCESSOR action
Async Execution: All actions implement async execute_typed()
Caching: SQLite persistence for job recovery
Streaming: SSE for real-time progress without polling
Memory Management: Iterative processing for large files

Current Status

Comprehensive Actions: Core coverage of biological entities
Type Safety Migration: Most actions use TypedStrategyAction
Production Ready: Used in multiple research projects
Active Development: Regular additions based on research needs

Deployment

The system runs as a containerized FastAPI service with:

Async HTTP handling via uvicorn
Automatic API documentation via OpenAPI/Swagger
Health check endpoints
CORS support for web applications
SQLite job persistence
Background job processing

Future Enhancements

Planned Features

JSON schema generation for YAML validation
OpenAPI integration for auto-documentation
Web UI for strategy creation and monitoring
Advanced caching strategies
Parallel action execution support

Extensibility Points

Custom action types via registry
Alternative execution strategies
Different storage backends
Integration with external workflow systems

Verification Sources

Last verified: 2025-08-22

This documentation was verified against the following project resources:

/biomapper/src/actions/ (Self-registering actions organized by entity type with comprehensive registry system)
/biomapper/src/actions/typed_base.py (TypedStrategyAction base class with execute_typed method pattern)
/biomapper/src/core/minimal_strategy_service.py (MinimalStrategyService execution engine implementation)
/biomapper/src/api/main.py (FastAPI server configuration with src-layout imports and uvicorn)
/biomapper/src/api/services/mapper_service.py (MapperService job orchestration and background processing)
/biomapper/src/client/client_v2.py (BiomapperClient synchronous wrapper with run() method)
/biomapper/src/configs/strategies/ (YAML strategy definitions and experimental configurations)
/biomapper/src/actions/load_dataset_identifiers.py (LOAD_DATASET_IDENTIFIERS implementation)
/biomapper/src/actions/registry.py (Global ACTION_REGISTRY and @register_action decorator)
/biomapper/CLAUDE.md (2025 standardizations, TDD development patterns, and validation rules)