BioMapper Documentationο
BioMapper is a general-purpose plugin- and strategy-based orchestration framework, with its first application in biological data harmonization. Architecturally, it blends elements of workflow engines (Nextflow, Snakemake, Kedro, Dagster) with a lightweight service-oriented API and a plugin registry backed by a unified UniversalContext. Its standout differentiator is an AI-native developer experience: CLAUDE.md, .claude/ scaffolding, custom slash commands, and the BioSherpa guide. This potentially makes it the first open-source bioinformatics orchestration platform with built-in LLM-assisted contributor workflows.
The result is a platform that is modular, extensible, and uniquely AI-augmented, well-positioned for long-term ecosystem growth. Built on a self-registering action system and YAML-based workflow definitions, it features a modern src-layout architecture with comprehensive test coverage and 2025 standardizations for production reliability.
π― Key Featuresο
Self-registering action system - Actions automatically register via decorators
Type-safe parameters - Pydantic models provide validation and IDE support
YAML workflow definition - Declarative strategies without coding
Real-time progress tracking - SSE events for long-running jobs
Extensible architecture - Easy to add new actions and entity types
AI-ready design - Built for integration with Claude Code and LLM assistance
π Quick Startο
# Install with Poetry
poetry install --with dev,docs,api
poetry shell
# Start the API server
cd biomapper-api && poetry run uvicorn app.main:app --reload --port 8000
# Or use the CLI (from root directory)
poetry run biomapper --help
poetry run biomapper health
# Python client usage
from src.client.client_v2 import BiomapperClient
client = BiomapperClient(base_url="http://localhost:8000")
result = client.run("test_metabolite_simple", parameters={
"input_file": "/data/metabolites.csv",
"output_dir": "/tmp/results"
})
print(f"Success: {result.success}") # StrategyResult object
ποΈ Architectureο
BioMapper follows a modern microservices architecture with clear separation of concerns:
Core Design:
YAML Strategies - Declarative configs defining pipelines of actions
Action Registry - Self-registering via decorators; plug-and-play extensibility
UniversalContext - Normalizes state access across heterogeneous action types
Pydantic Models (v2) - Typed parameter models per action category
Progressive Mapping - Iterative enrichment stages (65% β 80% coverage)
Comparison to Known Patterns:
Similar to: Nextflow & Snakemake (declarative pipelines), Kedro (typed configs + reproducibility), Dagster (observability and orchestration)
Different from: Heavy orchestrators (Airflow, Beam) β BioMapper is lighter, service/API-first, domain-agnostic, and tailored for interactive workflows
Unique: Combines API service with strategy-based pipeline engine; domain-specific operations first (bio), but extensible beyond
Three-Layer Design:
Client Layer - Python client library (
src.client.client_v2) for programmatic accessAPI Layer - FastAPI service with SQLite job persistence and SSE progress tracking
Core Layer - Self-registering actions with strategy execution engine
The system uses a registry pattern where actions self-register via @register_action decorators, a strategy pattern for YAML-based workflow configuration, and a pipeline pattern for data flow through shared execution context. Actions are organized by biological entity (proteins, metabolites, chemistry) and automatically discovered at runtime.
Getting Started
Actions Reference
- Actions Reference
- LOAD_DATASET_IDENTIFIERS
- MERGE_DATASETS
- FILTER_DATASET
- EXPORT_DATASET
- CUSTOM_TRANSFORM
- PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
- protein_normalize_accessions
- NIGHTINGALE_NMR_MATCH
- semantic_metabolite_match
- HMDB Vector Match
- Metabolite Fuzzy String Match
- Metabolite RampDB Bridge
- Progressive Semantic Match
- CHEMISTRY_FUZZY_TEST_MATCH
- Quick Reference
- Usage Example
- 2025 Standardizations
- Strategy File Locations
- HMDB Vector Match
- Sync to Google Drive
- Parse Composite Identifiers
- Metabolite Fuzzy String Match
- Metabolite RampDB Bridge
- Progressive Semantic Match
Workflows
Integrations
Examples
Performance
- Performance Optimization Guide
- Overview
- Performance Fundamentals
- Strategy-Level Optimizations
- Stage-Specific Optimizations
- Memory Management
- Parallel Processing
- Caching Strategies
- Database Optimizations
- Monitoring and Profiling
- Troubleshooting Performance Issues
- Performance Testing Framework
- Production Optimization Checklist
- Algorithm Complexity Resources
- See Also
API Reference
- API Reference
- Architecture Documentation
- BioMapper Architecture
- Action System Architecture
- YAML Strategy System
- Typed Strategy Actions
- BioMapper Test Architecture
- YAML Strategy Schema Documentation
- UniProt Historical ID Resolution
- Core Concepts
- Quick Architecture Overview
- Component Responsibilities
- Action System
- YAML Strategy System
- Performance Considerations
- Testing Architecture
- Security Considerations
- Future Architecture Goals
AI-Assisted Development
- AI-Assisted Development
- BioMapper Framework Triad
- Framework Triggering Mechanics
- Detection Pipeline
- Pattern Matching (Primary Detection)
- Confidence Scoring Algorithm
- Keyword Boosting (Secondary)
- Framework Name Detection
- Ambiguity Resolution
- Target Extraction
- Real-World Scoring Examples
- Performance Characteristics
- Debugging Framework Selection
- Tips for Better Detection
- Manual Override
- Slash Commands Reference
- Real-World Usage Examples
π€ AI Integrationο
BioMapper features an AI-native developer experience that sets it apart from traditional orchestration frameworks:
Current AI Features:
CLAUDE.md - Project βconstitutionβ providing role-defining guidance for AI agents
.claude/ folder - Structured agent configs and scaffolding
BiOMapper Framework Triad - Three automatic isolation frameworks for safe development
Hook System - Automatic TDD enforcement and validation
Type-safe actions - Enable better code completion and error detection
Self-documenting - Pydantic models include descriptions
BiOMapper Framework Triad:
The system includes three complementary frameworks that automatically activate based on natural language:
π Surgical: Fix internal action logic while preserving all external interfaces. Automatically activates when you describe counting, calculation, or statistics issues.
π Circuitous: Repair pipeline orchestration and parameter flow. Automatically activates when you describe parameters not passing between steps or substitution failures.
π Interstitial: Ensure 100% backward compatibility during interface evolution. Automatically activates when you describe compatibility issues or parameter changes breaking existing code.
Automatic Activation: You donβt need to know framework names - just describe the problem naturally and the appropriate framework activates automatically. See Framework Triggering Mechanics for details on how this works.
Development Discipline: Separate from the frameworks, a hook system enforces:
TDD Requirements - Tests must exist before implementation
Parameter Validation - All
${parameters.x}must resolveImport Verification - All modules must load cleanly
Quality Gates - Blocks premature success declarations
Comparisons:
Copilot/Cody: Offer IDE assistance but donβt ship with per-project scaffolding
Claude-Orchestrator/Flow frameworks: Orchestrate multiple Claude agents, but not tied to strategy orchestration
BioMapper: First to embed LLM-native scaffolding inside an orchestration framework repo, making the AI βpart of the project contractβ
π Available Actionsο
BioMapper includes actions across multiple categories:
Data Operations: Load, merge, filter, export, transform
Protein Actions: UniProt extraction, accession normalization, multi-bridge resolution
Metabolite Actions: CTS bridge, Nightingale NMR matching, semantic matching, vector matching, API enrichment
Chemistry Actions: LOINC extraction, fuzzy test matching, vendor harmonization
Analysis & Reporting: Set overlap, mapping quality, comprehensive reports
Integration Actions: Google Drive sync with chunked transfer
β 2025 Standardizationsο
Production-Ready Architecture Achieved:
Barebones Architecture: Client β API β MinimalStrategyService β Self-Registering Actions
Comprehensive Test Suite: 1,217 passing tests with 79.69% coverage
Type Safety: Comprehensive Pydantic v2 migration
Standards Compliance: All 10 biomapper 2025 standardizations implemented
Biological Data Testing: Real-world protein, metabolite, and chemistry data patterns
Architectural Strengths:
Clean modularity (strategy vs action vs context)
Low barrier for extension (just register a new action)
Declarative configuration approachable to non-programmers
Pragmatic service orientation (FastAPI, Poetry, pytest, Pydantic)
Gaps & Opportunities:
No DAG/conditional execution in YAML
Limited provenance/lineage tracking
Potential performance bottlenecks at scale (10Kβ1M records)
Observability/logging not yet first-class
Single-agent AI model; opportunity for multi-agent orchestration
Indices and tablesο
β
β
Verification Sourcesο
Last verified: 2025-08-22
This documentation was verified against the following project resources:
/biomapper/README.md(Project overview and architectural analysis with 1,217 passing tests)/biomapper/CLAUDE.md(Commands, patterns, and 2025 standardizations)/biomapper/src/actions/registry.py(Self-registering action system implementation)/biomapper/src/client/client_v2.py(BiomapperClient class with correct import path and methods)/biomapper/src/api/main.py(FastAPI server configuration and endpoint routing)/biomapper/pyproject.toml(Project configuration, Python 3.11+ requirement, src-layout packages)