Usage Guide

This guide demonstrates how to use Biomapper’s YAML strategy system for biological entity mapping through the REST API.

Installation

Install Biomapper using Poetry:

# Clone the repository
git clone https://github.com/arpanauts/biomapper.git
cd biomapper

# Install dependencies
poetry install --with dev,docs,api

# Activate the environment
poetry shell

Quick Start

Biomapper uses YAML strategies executed through a REST API. Here’s the basic workflow:

  1. Start the API Server

cd biomapper-api
poetry run uvicorn app.main:app --reload --port 8000
  1. Create or Use a Strategy YAML

Place strategy in configs/strategies/ or create my_strategy.yaml:

# Optional metadata for tracking
metadata:
  entity_type: "proteins"
  quality_tier: "experimental"
  expected_match_rate: 0.85

# Optional runtime parameters
parameters:
  data_dir: "${DATA_DIR:-/data}"
  output_dir: "${OUTPUT_DIR:-/tmp/results}"

# Required strategy definition
name: "BASIC_PROTEIN_MAPPING"
description: "Map proteins between datasets"

steps:
  - name: load_data
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "${parameters.data_dir}/proteins.csv"
        identifier_column: "uniprot"
        output_key: "proteins"
        additional_columns: ["gene_name", "description"]

  - name: normalize
    action:
      type: PROTEIN_NORMALIZE_ACCESSIONS
      params:
        input_key: "proteins"
        output_key: "normalized_proteins"

  - name: calculate_quality
    action:
      type: CALCULATE_MAPPING_QUALITY
      params:
        dataset_key: "normalized_proteins"
        output_key: "quality_metrics"
  1. Execute via Python Client

from biomapper_client import BiomapperClient

# Simple synchronous usage (recommended)
client = BiomapperClient("http://localhost:8000")

# Execute strategy by name (if in configs/strategies/)
result = client.run("BASIC_PROTEIN_MAPPING")

# Or execute with custom YAML file
result = client.run("/path/to/my_strategy.yaml")

# Check results
print(f"Status: {result['status']}")
if result['status'] == 'success':
    stats = result['results'].get('overlap_stats', {})
    print(f"Overlap: {stats.get('jaccard_similarity', 0):.2%}")
  1. Execute via CLI

# Using the biomapper CLI
poetry run biomapper --help
poetry run biomapper health
poetry run biomapper metadata list

# Or use the client directly
poetry run python -c "from biomapper_client import BiomapperClient; print(BiomapperClient().run('test_metabolite_simple'))"

Core Concepts

Core Actions

Biomapper provides 30+ self-registering actions organized by category:

Data Operations
  • LOAD_DATASET_IDENTIFIERS: Load identifiers from CSV/TSV files

  • MERGE_DATASETS: Combine multiple datasets

  • FILTER_DATASET: Apply filtering criteria

  • EXPORT_DATASET: Export to various formats

  • CUSTOM_TRANSFORM: Apply Python expressions

Protein Actions
  • MERGE_WITH_UNIPROT_RESOLUTION: Historical UniProt ID resolution

  • PROTEIN_EXTRACT_UNIPROT_FROM_XREFS: Extract IDs from compound fields

  • PROTEIN_NORMALIZE_ACCESSIONS: Standardize protein identifiers

Metabolite Actions
  • CTS_ENRICHED_MATCH: Chemical Translation Service matching

  • SEMANTIC_METABOLITE_MATCH: AI-powered semantic matching

  • NIGHTINGALE_NMR_MATCH: Nightingale reference matching

Analysis Actions
  • CALCULATE_SET_OVERLAP: Jaccard similarity and Venn diagrams

  • CALCULATE_THREE_WAY_OVERLAP: Three-dataset comparison

  • GENERATE_METABOLOMICS_REPORT: Comprehensive reports

Strategy Configuration

Strategies are defined in YAML files with these sections:

Required Fields:

  • name: Strategy identifier (use UPPERCASE_WITH_UNDERSCORES)

  • description: Human-readable description

  • steps: Ordered list of actions to execute

Optional Fields:

  • metadata: Tracking information (version, quality tier, expected match rates)

  • parameters: Runtime parameters with environment variable support

Each step contains:

  • name: Step identifier

  • action.type: One of the registered action types

  • action.params: Parameters specific to the action

Data Flow

  1. Data is loaded into a shared context dictionary

  2. Each action reads from and writes to this context

  3. Actions use output_key to store results

  4. Subsequent actions reference data using these keys

  5. Final results include all context data plus execution metadata

Working with Real Data

Protein Mapping Example

Here’s a complete example mapping UKBB proteins to HPA:

name: "UKBB_HPA_PROTEIN_MAPPING"
description: "Map UK Biobank proteins to Human Protein Atlas"

steps:
  - name: load_ukbb_data
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "/data/UKBB_Protein_Meta.tsv"
        identifier_column: "UniProt"
        output_key: "ukbb_proteins"

  - name: load_hpa_data
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "/data/hpa_osps.csv"
        identifier_column: "uniprot"
        output_key: "hpa_proteins"

  - name: merge_ukbb_uniprot
    action:
      type: MERGE_WITH_UNIPROT_RESOLUTION
      params:
        source_dataset_key: "ukbb_proteins"
        target_dataset_key: "hpa_proteins"
        source_id_column: "UniProt"
        target_id_column: "uniprot"
        output_key: "ukbb_merged"

  - name: calculate_overlap
    action:
      type: CALCULATE_SET_OVERLAP
      params:
        dataset_a_key: "ukbb_merged"
        dataset_b_key: "hpa_proteins"
        output_key: "overlap_analysis"

Multi-Dataset Analysis

Compare multiple datasets by loading each one and calculating pairwise overlaps:

name: "MULTI_DATASET_ANALYSIS"
description: "Compare proteins across multiple sources"

steps:
  # Load all datasets
  - name: load_arivale
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "/data/arivale/proteomics_metadata.tsv"
        identifier_column: "uniprot"
        output_key: "arivale_proteins"

  - name: load_qin
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "/data/qin_osps.csv"
        identifier_column: "uniprot"
        output_key: "qin_proteins"

  # Calculate overlaps
  - name: arivale_vs_qin
    action:
      type: CALCULATE_SET_OVERLAP
      params:
        dataset_a_key: "arivale_proteins"
        dataset_b_key: "qin_proteins"
        output_key: "arivale_qin_overlap"

Error Handling

Common Issues and Solutions

File not found errors

Check file paths are absolute and files exist.

Column not found errors

Verify the identifier_column matches your CSV headers exactly.

Timeout errors

Large datasets may take time. Default timeout is 5 minutes, but can be increased:

client = BiomapperClient(timeout=3600)  # 1 hour
Validation errors

Ensure YAML syntax is correct and all required parameters are provided.

Debugging

Enable detailed logging:

import logging
logging.basicConfig(level=logging.DEBUG)

async with BiomapperClient("http://localhost:8000") as client:
    result = await client.execute_strategy_file("strategy.yaml")

Check API server logs for detailed error messages and execution progress.

Performance Tips

  • Use environment variables for portable file paths

  • For large datasets (>100K rows), increase client timeout and consider chunking

  • Monitor API server resources during execution

  • Use the watch=True parameter to see real-time progress:

    result = client.run("large_strategy", watch=True)
    
  • Consider using CHUNK_PROCESSOR action for very large files

  • Enable job persistence for recovery from failures

Advanced Features

Environment Variables

Strategies support variable substitution:

parameters:
  data_dir: "${DATA_DIR:-/default/path}"
steps:
  - action:
      params:
        file_path: "${parameters.data_dir}/file.csv"

Progress Tracking

Use Server-Sent Events for real-time progress:

result = client.run_with_progress("my_strategy")

Job Recovery

Jobs are persisted to SQLite for recovery:

# Check job status
job = client.get_job(job_id)
if job.status == "failed":
    # Retry from last checkpoint
    result = client.retry_job(job_id)

Next Steps

Verification Sources

Last verified: 2025-08-17

This documentation was verified against the following project resources:

  • /biomapper/CLAUDE.md (CLI commands and best practices)

  • /biomapper/README.md (Installation and quick start)

  • /biomapper/pyproject.toml (Project configuration)