BioMapper Architecture
Overview
BioMapper is a YAML-based workflow platform for biological data harmonization and ontology mapping. Built on a self-registering action system with comprehensive specialized actions, it provides workflows for mapping proteins, metabolites, chemistry data, and other biological entities.
The architecture follows a three-layer design (Client → API → Core) with type-safe actions, automatic validation, and extensibility through simple decorator-based registration.
Core Components
Self-Registering Action System
Actions automatically register at import time using the @register_action decorator, eliminating manual registration. The global ACTION_REGISTRY enables dynamic action lookup from YAML strategies.
YAML Strategy System
Declarative workflow definition with variable substitution, metadata tracking, and parameter validation. Strategies execute sequentially with shared context between steps.
Available Actions
Organized by biological entity type:
Data Operations:
LOAD_DATASET_IDENTIFIERS- Generic CSV/TSV loaderMERGE_DATASETS- Combine with deduplicationFILTER_DATASET- Complex filteringEXPORT_DATASET- Multi-format exportCUSTOM_TRANSFORM_EXPRESSION- Dynamic transformations
Protein Actions:
PROTEIN_NORMALIZE_ACCESSIONS- Standardize UniProt IDsPROTEIN_EXTRACT_UNIPROT_FROM_XREFS- Extract from compound fields
Metabolite Actions:
NIGHTINGALE_NMR_MATCH- Nightingale platform matchingSEMANTIC_METABOLITE_MATCH- AI-powered matching
Chemistry Actions:
CHEMISTRY_FUZZY_TEST_MATCH- Fuzzy clinical test matching
Analysis & Reporting:
GENERATE_MAPPING_VISUALIZATIONS- Create visualization reportsGENERATE_LLM_ANALYSIS- AI-powered analysis reports
Infrastructure Actions:
SYNC_TO_GOOGLE_DRIVE_V2- Upload to Google DrivePARSE_COMPOSITE_IDENTIFIERS- Parse complex identifiersCUSTOM_TRANSFORM- Apply custom transformations
REST API Layer
FastAPI service with:
Strategy execution endpoints (
/api/strategies/v2/)Job management with SQLite persistence
Background processing with checkpointing
Server-Sent Events for real-time progress
OpenAPI documentation
Python Client Library
BiomapperClient in src/client/client_v2.py provides:
Synchronous wrapper for async operations
Automatic retry and error handling
Progress streaming support
Simple interface:
client.run("strategy_name")
Core Execution Engine
MinimalStrategyService in src/core/minimal_strategy_service.py:
Direct YAML loading from
src/configs/strategies/Sequential action execution with error handling
Variable substitution (
${parameters.key},${env.VAR})Shared execution context management
No database dependencies
Directory Structure
biomapper/
├── src/ # Main source directory
│ ├── actions/ # Self-registering actions
│ │ ├── entities/ # Entity-specific actions
│ │ │ ├── proteins/ # UniProt, Ensembl actions
│ │ │ ├── metabolites/ # HMDB, CHEBI, KEGG actions
│ │ │ └── chemistry/ # LOINC, clinical test actions
│ │ ├── algorithms/ # Analysis algorithms
│ │ ├── io/ # Import/export actions
│ │ ├── utils/ # Utility actions
│ │ ├── workflows/ # High-level workflows
│ │ ├── typed_base.py # TypedStrategyAction base
│ │ ├── registry.py # Global ACTION_REGISTRY
│ │ └── base.py # BaseStrategyAction
│ ├── api/ # FastAPI service
│ │ ├── main.py # Server configuration
│ │ ├── routes/ # REST endpoints
│ │ └── services/
│ │ └── mapper_service.py # Job orchestration
│ ├── client/ # Python client
│ │ └── client_v2.py # BiomapperClient
│ ├── core/ # Core library
│ │ ├── minimal_strategy_service.py # Execution engine
│ │ ├── models/ # Data models
│ │ ├── standards/ # 2025 standardizations
│ │ └── algorithms/ # Core algorithms
│ └── configs/
│ └── strategies/ # YAML strategies
│ ├── experimental/ # Development strategies
│ ├── metabolite/ # Metabolite-specific
│ └── protein/ # Protein-specific
├── tests/
│ └── unit/ # Unit tests
│ ├── core/
│ │ └── strategy_actions/ # Action tests
│ └── strategy_actions/ # Legacy test location
└── docs/ # Documentation
└── source/
└── architecture/ # Architecture docs
System Architecture
┌─────────────────────────────────────────────────────┐
│ Client Layer │
│ • BiomapperClient (Python) │
│ • CLI Scripts │
│ • Jupyter Notebooks │
└───────────────────┬─────────────────────────────────┘
│ HTTP/REST
┌───────────────────▼─────────────────────────────────┐
│ API Layer │
│ • FastAPI Server (port 8000) │
│ • MapperService (job orchestration) │
│ • Background job processing │
│ • SQLite persistence (biomapper.db) │
└───────────────────┬─────────────────────────────────┘
│
┌───────────────────▼─────────────────────────────────┐
│ Core Layer │
│ • MinimalStrategyService (execution engine) │
│ • ACTION_REGISTRY (global action registry) │
│ • TypedStrategyAction (base class) │
│ • Execution Context (shared state) │
└─────────────────────────────────────────────────────┘
Execution Flow
Client Request:
BiomapperClient.run("strategy_name")Job Creation: API creates background job with unique ID
Strategy Loading: MinimalStrategyService loads YAML from configs/
Action Resolution: ACTION_REGISTRY lookup for each step
Parameter Validation: Pydantic models validate action params
Sequential Execution: Actions execute via
execute_typed()Context Updates: Each action modifies shared context
Checkpointing: Progress saved to SQLite for recovery
Result Return: Via REST response or SSE stream
Key Design Principles
1. Self-Registration
Actions register automatically via
@register_actiondecoratorNo manual registration or executor modifications needed
Plugin-style extensibility
2. Type Safety
Pydantic models for parameter validation
TypedStrategyActiongeneric base classCompile-time type hints with runtime validation
Backward compatibility during migration
4. Entity-Based Organization
Actions organized by biological entity type
Clear navigation:
entities/proteins/,entities/metabolites/Reusable algorithms in dedicated directories
5. Test-Driven Development
Write tests first, then implementation
Minimum 80% coverage requirement
All new actions must use TypedStrategyAction pattern
Creating New Actions
from actions.typed_base import TypedStrategyAction, StandardActionResult
from actions.registry import register_action
from pydantic import BaseModel, Field
from typing import Dict, Any
import pandas as pd
class MyActionParams(BaseModel):
input_key: str = Field(..., description="Input dataset key")
threshold: float = Field(0.8, ge=0.0, le=1.0)
output_key: str = Field(..., description="Output dataset key")
@register_action("MY_ACTION")
class MyAction(TypedStrategyAction[MyActionParams, StandardActionResult]):
def get_params_model(self) -> type[MyActionParams]:
return MyActionParams
async def execute_typed(self, params: MyActionParams, context: Dict[str, Any]) -> StandardActionResult:
# Access input data
datasets = context.get("datasets", {})
input_data = datasets.get(params.input_key, pd.DataFrame())
# Process data using pandas
if not input_data.empty:
processed = input_data[input_data["score"] >= params.threshold]
else:
processed = pd.DataFrame()
# Store output
if "datasets" not in context:
context["datasets"] = {}
context["datasets"][params.output_key] = processed
return StandardActionResult(
success=True,
message=f"Processed {len(processed)} items",
data={"input_count": len(input_data), "output_count": len(processed)}
)
Action will auto-register - no other changes needed!
Strategy Configuration
Strategies are defined in YAML files:
name: "STRATEGY_NAME"
description: "Strategy description"
metadata:
entity_type: "proteins|metabolites|chemistry"
quality_tier: "experimental|production|test"
version: "1.0.0"
parameters:
input_file: "${DATA_DIR}/input.tsv"
output_dir: "${OUTPUT_DIR:-/tmp/results}"
steps:
- name: step_name
action:
type: ACTION_TYPE
params:
input_key: "dataset_key"
output_key: "result_key"
See YAML Strategies for complete documentation.
Performance Considerations
Chunking: Large datasets processed via CHUNK_PROCESSOR action
Async Execution: All actions implement async execute_typed()
Caching: SQLite persistence for job recovery
Streaming: SSE for real-time progress without polling
Memory Management: Iterative processing for large files
Current Status
Comprehensive Actions: Core coverage of biological entities
Type Safety Migration: Most actions use TypedStrategyAction
Production Ready: Used in multiple research projects
Active Development: Regular additions based on research needs
Deployment
The system runs as a containerized FastAPI service with:
Async HTTP handling via uvicorn
Automatic API documentation via OpenAPI/Swagger
Health check endpoints
CORS support for web applications
SQLite job persistence
Background job processing
Future Enhancements
Planned Features
JSON schema generation for YAML validation
OpenAPI integration for auto-documentation
Web UI for strategy creation and monitoring
Advanced caching strategies
Parallel action execution support
Extensibility Points
Custom action types via registry
Alternative execution strategies
Different storage backends
Integration with external workflow systems
Verification Sources
Last verified: 2025-08-22
This documentation was verified against the following project resources:
/biomapper/src/actions/(Self-registering actions organized by entity type with comprehensive registry system)/biomapper/src/actions/typed_base.py(TypedStrategyAction base class with execute_typed method pattern)/biomapper/src/core/minimal_strategy_service.py(MinimalStrategyService execution engine implementation)/biomapper/src/api/main.py(FastAPI server configuration with src-layout imports and uvicorn)/biomapper/src/api/services/mapper_service.py(MapperService job orchestration and background processing)/biomapper/src/client/client_v2.py(BiomapperClient synchronous wrapper with run() method)/biomapper/src/configs/strategies/(YAML strategy definitions and experimental configurations)/biomapper/src/actions/load_dataset_identifiers.py(LOAD_DATASET_IDENTIFIERS implementation)/biomapper/src/actions/registry.py(Global ACTION_REGISTRY and @register_action decorator)/biomapper/CLAUDE.md(2025 standardizations, TDD development patterns, and validation rules)