BioMapper Architecture

Overview

BioMapper is a YAML-based workflow platform for biological data harmonization and ontology mapping. Built on a self-registering action system with comprehensive specialized actions, it provides workflows for mapping proteins, metabolites, chemistry data, and other biological entities.

The architecture follows a three-layer design (Client → API → Core) with type-safe actions, automatic validation, and extensibility through simple decorator-based registration.

Core Components

Self-Registering Action System

Actions automatically register at import time using the @register_action decorator, eliminating manual registration. The global ACTION_REGISTRY enables dynamic action lookup from YAML strategies.

YAML Strategy System

Declarative workflow definition with variable substitution, metadata tracking, and parameter validation. Strategies execute sequentially with shared context between steps.

Available Actions

Organized by biological entity type:

Data Operations:

  • LOAD_DATASET_IDENTIFIERS - Generic CSV/TSV loader

  • MERGE_DATASETS - Combine with deduplication

  • FILTER_DATASET - Complex filtering

  • EXPORT_DATASET - Multi-format export

  • CUSTOM_TRANSFORM_EXPRESSION - Dynamic transformations

Protein Actions:

  • PROTEIN_NORMALIZE_ACCESSIONS - Standardize UniProt IDs

  • PROTEIN_EXTRACT_UNIPROT_FROM_XREFS - Extract from compound fields

Metabolite Actions:

  • NIGHTINGALE_NMR_MATCH - Nightingale platform matching

  • SEMANTIC_METABOLITE_MATCH - AI-powered matching

Chemistry Actions:

  • CHEMISTRY_FUZZY_TEST_MATCH - Fuzzy clinical test matching

Analysis & Reporting:

  • GENERATE_MAPPING_VISUALIZATIONS - Create visualization reports

  • GENERATE_LLM_ANALYSIS - AI-powered analysis reports

Infrastructure Actions:

  • SYNC_TO_GOOGLE_DRIVE_V2 - Upload to Google Drive

  • PARSE_COMPOSITE_IDENTIFIERS - Parse complex identifiers

  • CUSTOM_TRANSFORM - Apply custom transformations

REST API Layer

FastAPI service with:

  • Strategy execution endpoints (/api/strategies/v2/)

  • Job management with SQLite persistence

  • Background processing with checkpointing

  • Server-Sent Events for real-time progress

  • OpenAPI documentation

Python Client Library

BiomapperClient in src/client/client_v2.py provides:

  • Synchronous wrapper for async operations

  • Automatic retry and error handling

  • Progress streaming support

  • Simple interface: client.run("strategy_name")

Core Execution Engine

MinimalStrategyService in src/core/minimal_strategy_service.py:

  • Direct YAML loading from src/configs/strategies/

  • Sequential action execution with error handling

  • Variable substitution (${parameters.key}, ${env.VAR})

  • Shared execution context management

  • No database dependencies

Directory Structure

biomapper/
├── src/                            # Main source directory
│   ├── actions/                    # Self-registering actions
│   │   ├── entities/               # Entity-specific actions
│   │   │   ├── proteins/           # UniProt, Ensembl actions
│   │   │   ├── metabolites/        # HMDB, CHEBI, KEGG actions
│   │   │   └── chemistry/          # LOINC, clinical test actions
│   │   ├── algorithms/             # Analysis algorithms  
│   │   ├── io/                     # Import/export actions
│   │   ├── utils/                  # Utility actions
│   │   ├── workflows/              # High-level workflows
│   │   ├── typed_base.py           # TypedStrategyAction base
│   │   ├── registry.py             # Global ACTION_REGISTRY
│   │   └── base.py                 # BaseStrategyAction
│   ├── api/                        # FastAPI service
│   │   ├── main.py                 # Server configuration
│   │   ├── routes/                 # REST endpoints
│   │   └── services/
│   │       └── mapper_service.py   # Job orchestration
│   ├── client/                     # Python client
│   │   └── client_v2.py            # BiomapperClient
│   ├── core/                       # Core library
│   │   ├── minimal_strategy_service.py  # Execution engine
│   │   ├── models/                 # Data models
│   │   ├── standards/              # 2025 standardizations
│   │   └── algorithms/             # Core algorithms
│   └── configs/
│       └── strategies/             # YAML strategies
│           ├── experimental/       # Development strategies
│           ├── metabolite/         # Metabolite-specific
│           └── protein/            # Protein-specific
├── tests/
│   └── unit/                       # Unit tests
│       ├── core/
│       │   └── strategy_actions/   # Action tests
│       └── strategy_actions/       # Legacy test location
└── docs/                           # Documentation
    └── source/
        └── architecture/           # Architecture docs

System Architecture

┌─────────────────────────────────────────────────────┐
│                   Client Layer                      │
│  • BiomapperClient (Python)                        │
│  • CLI Scripts                                     │
│  • Jupyter Notebooks                               │
└───────────────────┬─────────────────────────────────┘
                    │ HTTP/REST
┌───────────────────▼─────────────────────────────────┐
│                    API Layer                        │
│  • FastAPI Server (port 8000)                      │
│  • MapperService (job orchestration)               │
│  • Background job processing                        │
│  • SQLite persistence (biomapper.db)               │
└───────────────────┬─────────────────────────────────┘
                    │
┌───────────────────▼─────────────────────────────────┐
│                   Core Layer                        │
│  • MinimalStrategyService (execution engine)       │
│  • ACTION_REGISTRY (global action registry)        │
│  • TypedStrategyAction (base class)                │
│  • Execution Context (shared state)                │
└─────────────────────────────────────────────────────┘

Execution Flow

  1. Client Request: BiomapperClient.run("strategy_name")

  2. Job Creation: API creates background job with unique ID

  3. Strategy Loading: MinimalStrategyService loads YAML from configs/

  4. Action Resolution: ACTION_REGISTRY lookup for each step

  5. Parameter Validation: Pydantic models validate action params

  6. Sequential Execution: Actions execute via execute_typed()

  7. Context Updates: Each action modifies shared context

  8. Checkpointing: Progress saved to SQLite for recovery

  9. Result Return: Via REST response or SSE stream

Key Design Principles

1. Self-Registration

  • Actions register automatically via @register_action decorator

  • No manual registration or executor modifications needed

  • Plugin-style extensibility

2. Type Safety

  • Pydantic models for parameter validation

  • TypedStrategyAction generic base class

  • Compile-time type hints with runtime validation

  • Backward compatibility during migration

3. Shared Execution Context

  • Actions communicate through shared Dict[str, Any]

  • Standard keys: datasets, statistics, output_files

  • Data flows between steps via named keys

4. Entity-Based Organization

  • Actions organized by biological entity type

  • Clear navigation: entities/proteins/, entities/metabolites/

  • Reusable algorithms in dedicated directories

5. Test-Driven Development

  • Write tests first, then implementation

  • Minimum 80% coverage requirement

  • All new actions must use TypedStrategyAction pattern

Creating New Actions

from actions.typed_base import TypedStrategyAction, StandardActionResult
from actions.registry import register_action
from pydantic import BaseModel, Field
from typing import Dict, Any
import pandas as pd

class MyActionParams(BaseModel):
    input_key: str = Field(..., description="Input dataset key")
    threshold: float = Field(0.8, ge=0.0, le=1.0)
    output_key: str = Field(..., description="Output dataset key")

@register_action("MY_ACTION")
class MyAction(TypedStrategyAction[MyActionParams, StandardActionResult]):
    def get_params_model(self) -> type[MyActionParams]:
        return MyActionParams
    
    async def execute_typed(self, params: MyActionParams, context: Dict[str, Any]) -> StandardActionResult:
        # Access input data
        datasets = context.get("datasets", {})
        input_data = datasets.get(params.input_key, pd.DataFrame())
        
        # Process data using pandas
        if not input_data.empty:
            processed = input_data[input_data["score"] >= params.threshold]
        else:
            processed = pd.DataFrame()
        
        # Store output
        if "datasets" not in context:
            context["datasets"] = {}
        context["datasets"][params.output_key] = processed
        
        return StandardActionResult(
            success=True,
            message=f"Processed {len(processed)} items",
            data={"input_count": len(input_data), "output_count": len(processed)}
        )

Action will auto-register - no other changes needed!

Strategy Configuration

Strategies are defined in YAML files:

name: "STRATEGY_NAME"
description: "Strategy description"

metadata:
  entity_type: "proteins|metabolites|chemistry"
  quality_tier: "experimental|production|test"
  version: "1.0.0"

parameters:
  input_file: "${DATA_DIR}/input.tsv"
  output_dir: "${OUTPUT_DIR:-/tmp/results}"

steps:
  - name: step_name
    action:
      type: ACTION_TYPE
      params:
        input_key: "dataset_key"
        output_key: "result_key"

See YAML Strategies for complete documentation.

Performance Considerations

  • Chunking: Large datasets processed via CHUNK_PROCESSOR action

  • Async Execution: All actions implement async execute_typed()

  • Caching: SQLite persistence for job recovery

  • Streaming: SSE for real-time progress without polling

  • Memory Management: Iterative processing for large files

Current Status

  • Comprehensive Actions: Core coverage of biological entities

  • Type Safety Migration: Most actions use TypedStrategyAction

  • Production Ready: Used in multiple research projects

  • Active Development: Regular additions based on research needs

Deployment

The system runs as a containerized FastAPI service with:

  • Async HTTP handling via uvicorn

  • Automatic API documentation via OpenAPI/Swagger

  • Health check endpoints

  • CORS support for web applications

  • SQLite job persistence

  • Background job processing

Future Enhancements

Planned Features

  • JSON schema generation for YAML validation

  • OpenAPI integration for auto-documentation

  • Web UI for strategy creation and monitoring

  • Advanced caching strategies

  • Parallel action execution support

Extensibility Points

  • Custom action types via registry

  • Alternative execution strategies

  • Different storage backends

  • Integration with external workflow systems


Verification Sources

Last verified: 2025-08-22

This documentation was verified against the following project resources:

  • /biomapper/src/actions/ (Self-registering actions organized by entity type with comprehensive registry system)

  • /biomapper/src/actions/typed_base.py (TypedStrategyAction base class with execute_typed method pattern)

  • /biomapper/src/core/minimal_strategy_service.py (MinimalStrategyService execution engine implementation)

  • /biomapper/src/api/main.py (FastAPI server configuration with src-layout imports and uvicorn)

  • /biomapper/src/api/services/mapper_service.py (MapperService job orchestration and background processing)

  • /biomapper/src/client/client_v2.py (BiomapperClient synchronous wrapper with run() method)

  • /biomapper/src/configs/strategies/ (YAML strategy definitions and experimental configurations)

  • /biomapper/src/actions/load_dataset_identifiers.py (LOAD_DATASET_IDENTIFIERS implementation)

  • /biomapper/src/actions/registry.py (Global ACTION_REGISTRY and @register_action decorator)

  • /biomapper/CLAUDE.md (2025 standardizations, TDD development patterns, and validation rules)