YAML Strategy System

The strategy system enables declarative workflow definition through YAML configurations, providing BioMapper’s configuration-driven approach to biological data harmonization.

Strategy Structure

Every strategy follows this comprehensive pattern:

name: "strategy_name"
description: "Clear description of what this strategy accomplishes"

metadata:
  id: "entity_source_to_target_bridge_v1_tier"
  entity_type: "proteins|metabolites|chemistry"
  quality_tier: "experimental|production|test"
  version: "1.0.0"
  author: "researcher@institution.edu"
  tags: ["proteins", "uniprot", "harmonization"]

parameters:
  input_file: "${DATA_DIR}/input.tsv"
  output_dir: "${OUTPUT_DIR:-/tmp/results}"
  threshold: 0.85

steps:
  - name: load_data
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "${parameters.input_file}"
        identifier_column: "protein_id"
        output_key: "raw_proteins"

  - name: normalize
    action:
      type: PROTEIN_NORMALIZE_ACCESSIONS
      params:
        input_key: "raw_proteins"
        output_key: "normalized_proteins"

  - name: export
    action:
      type: EXPORT_DATASET_V2
      params:
        input_key: "normalized_proteins"
        output_file: "${parameters.output_dir}/results.csv"
        format: "csv"

Execution Model

Sequential Processing

Steps execute in the order defined in the YAML file, with each step building on previous results.

Shared Execution Context

A Dict[str, Any] is passed between all steps containing:

  • datasets - Named datasets from action outputs

  • current_identifiers - Active identifier set being processed

  • statistics - Accumulated metrics and counts

  • output_files - Paths to generated files

  • metadata - Strategy metadata and parameters

Key-Value Storage Pattern

Each action stores results using output_key parameters, enabling downstream actions to reference data via input_key.

Automatic Tracking

Execution statistics, timing information, and checkpoints are automatically collected for recovery and monitoring.

Variable Substitution

Supports multiple substitution patterns:

  • ${parameters.key} - Access strategy parameters

  • ${env.VAR_NAME} - Environment variables

  • ${VAR_NAME} - Shorthand for environment variables

  • ${metadata.field} - Access metadata fields

  • ${VAR:-default} - Default values if variable not set

Common Strategy Patterns

Entity Harmonization

Load identifiers → Normalize → Enrich → Export

Multi-Source Integration

Load multiple datasets → Merge → Deduplicate → Analyze overlap

Quality Assessment

Load data → Validate → Calculate metrics → Generate report

API Enrichment

Load identifiers → Call external APIs → Combine results → Export

See configs/strategies/ for production examples and the Configuration Guide guide for best practices.

Strategy Loading and Discovery

The MinimalStrategyService loads strategies from multiple sources:

  1. Config Directory (src/configs/strategies/) Automatically discovered at startup, organized by entity and tier:

    • experimental/ - Active development and testing strategies

    • metabolite/ - Metabolite-specific strategies

    • protein/ - Protein-specific strategies

    • templates/ - Reusable strategy templates

  2. Direct File Paths Absolute paths specified in API calls or client requests

  3. YAML String Content Strategy definitions passed directly as strings

  4. URL Loading (planned) Remote strategy loading from version-controlled repositories

Integration Points

REST API Endpoints
  • POST /api/strategies/v2/ - Execute strategy with parameters

  • GET /api/strategies/ - List available strategies

  • GET /api/strategies/{name} - Get strategy definition

  • GET /api/jobs/{job_id} - Check job status and results

  • GET /api/jobs/{job_id}/stream - Server-Sent Events progress stream

Python Client Library
from src.client.client_v2 import BiomapperClient

client = BiomapperClient(base_url="http://localhost:8000")
result = client.run("strategy_name", parameters={
    "input_file": "/data/proteins.csv",
    "threshold": 0.9
})
print(f"Job completed: {result['status']}")
CLI Tools
poetry run python scripts/run_strategy.py --strategy my_strategy
poetry run biomapper execute --strategy production/protein_harmonization

Benefits

  • Version Control: Plain text YAML files work with Git

  • Reproducibility: Identical YAML + data produces identical results

  • Collaboration: Non-programmers can create and modify workflows

  • Testing: Simple to create test strategies with mock data

  • Documentation: Self-documenting with descriptions and metadata

  • Modularity: Strategies can reference and build on each other

  • Portability: YAML strategies work across environments

  • Validation: Schema validation ensures correctness before execution

## Verification Sources Last verified: 2025-01-18

This documentation was verified against the following project resources:

  • /home/ubuntu/biomapper/src/core/minimal_strategy_service.py (MinimalStrategyService with parameter resolver and dual context)

  • /home/ubuntu/biomapper/src/configs/strategies/ (YAML strategies organized by entity type)

  • /home/ubuntu/biomapper/src/api/routes/strategies_v2_simple.py (FastAPI strategy execution endpoints)

  • /home/ubuntu/biomapper/src/client/client_v2.py (BiomapperClient with synchronous run() wrapper)

  • /home/ubuntu/biomapper/README.md (Strategy execution examples and Python client usage)

  • /home/ubuntu/biomapper/CLAUDE.md (Variable substitution patterns: ${parameters.key}, ${env.VAR}, ${VAR:-default})