Usage Guide
This guide demonstrates how to use Biomapper’s YAML strategy system for biological entity mapping through the REST API.
Installation
Install Biomapper using Poetry:
# Clone the repository
git clone https://github.com/arpanauts/biomapper.git
cd biomapper
# Install dependencies
poetry install --with dev,docs,api
# Activate the environment
poetry shell
Quick Start
Biomapper uses YAML strategies executed through a REST API. Here’s the basic workflow:
Start the API Server
cd biomapper-api
poetry run uvicorn app.main:app --reload --port 8000
Create or Use a Strategy YAML
Place strategy in configs/strategies/ or create my_strategy.yaml:
# Optional metadata for tracking
metadata:
entity_type: "proteins"
quality_tier: "experimental"
expected_match_rate: 0.85
# Optional runtime parameters
parameters:
data_dir: "${DATA_DIR:-/data}"
output_dir: "${OUTPUT_DIR:-/tmp/results}"
# Required strategy definition
name: "BASIC_PROTEIN_MAPPING"
description: "Map proteins between datasets"
steps:
- name: load_data
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "${parameters.data_dir}/proteins.csv"
identifier_column: "uniprot"
output_key: "proteins"
additional_columns: ["gene_name", "description"]
- name: normalize
action:
type: PROTEIN_NORMALIZE_ACCESSIONS
params:
input_key: "proteins"
output_key: "normalized_proteins"
- name: calculate_quality
action:
type: CALCULATE_MAPPING_QUALITY
params:
dataset_key: "normalized_proteins"
output_key: "quality_metrics"
Execute via Python Client
from biomapper_client import BiomapperClient
# Simple synchronous usage (recommended)
client = BiomapperClient("http://localhost:8000")
# Execute strategy by name (if in configs/strategies/)
result = client.run("BASIC_PROTEIN_MAPPING")
# Or execute with custom YAML file
result = client.run("/path/to/my_strategy.yaml")
# Check results
print(f"Status: {result['status']}")
if result['status'] == 'success':
stats = result['results'].get('overlap_stats', {})
print(f"Overlap: {stats.get('jaccard_similarity', 0):.2%}")
Execute via CLI
# Using the biomapper CLI
poetry run biomapper --help
poetry run biomapper health
poetry run biomapper metadata list
# Or use the client directly
poetry run python -c "from biomapper_client import BiomapperClient; print(BiomapperClient().run('test_metabolite_simple'))"
Core Concepts
Core Actions
Biomapper provides 30+ self-registering actions organized by category:
- Data Operations
LOAD_DATASET_IDENTIFIERS: Load identifiers from CSV/TSV filesMERGE_DATASETS: Combine multiple datasetsFILTER_DATASET: Apply filtering criteriaEXPORT_DATASET: Export to various formatsCUSTOM_TRANSFORM: Apply Python expressions
- Protein Actions
MERGE_WITH_UNIPROT_RESOLUTION: Historical UniProt ID resolutionPROTEIN_EXTRACT_UNIPROT_FROM_XREFS: Extract IDs from compound fieldsPROTEIN_NORMALIZE_ACCESSIONS: Standardize protein identifiers
- Metabolite Actions
CTS_ENRICHED_MATCH: Chemical Translation Service matchingSEMANTIC_METABOLITE_MATCH: AI-powered semantic matchingNIGHTINGALE_NMR_MATCH: Nightingale reference matching
- Analysis Actions
CALCULATE_SET_OVERLAP: Jaccard similarity and Venn diagramsCALCULATE_THREE_WAY_OVERLAP: Three-dataset comparisonGENERATE_METABOLOMICS_REPORT: Comprehensive reports
Strategy Configuration
Strategies are defined in YAML files with these sections:
Required Fields:
name: Strategy identifier (use UPPERCASE_WITH_UNDERSCORES)description: Human-readable descriptionsteps: Ordered list of actions to execute
Optional Fields:
metadata: Tracking information (version, quality tier, expected match rates)parameters: Runtime parameters with environment variable support
Each step contains:
name: Step identifieraction.type: One of the registered action typesaction.params: Parameters specific to the action
Data Flow
Data is loaded into a shared context dictionary
Each action reads from and writes to this context
Actions use
output_keyto store resultsSubsequent actions reference data using these keys
Final results include all context data plus execution metadata
Working with Real Data
Protein Mapping Example
Here’s a complete example mapping UKBB proteins to HPA:
name: "UKBB_HPA_PROTEIN_MAPPING"
description: "Map UK Biobank proteins to Human Protein Atlas"
steps:
- name: load_ukbb_data
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "/data/UKBB_Protein_Meta.tsv"
identifier_column: "UniProt"
output_key: "ukbb_proteins"
- name: load_hpa_data
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "/data/hpa_osps.csv"
identifier_column: "uniprot"
output_key: "hpa_proteins"
- name: merge_ukbb_uniprot
action:
type: MERGE_WITH_UNIPROT_RESOLUTION
params:
source_dataset_key: "ukbb_proteins"
target_dataset_key: "hpa_proteins"
source_id_column: "UniProt"
target_id_column: "uniprot"
output_key: "ukbb_merged"
- name: calculate_overlap
action:
type: CALCULATE_SET_OVERLAP
params:
dataset_a_key: "ukbb_merged"
dataset_b_key: "hpa_proteins"
output_key: "overlap_analysis"
Multi-Dataset Analysis
Compare multiple datasets by loading each one and calculating pairwise overlaps:
name: "MULTI_DATASET_ANALYSIS"
description: "Compare proteins across multiple sources"
steps:
# Load all datasets
- name: load_arivale
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "/data/arivale/proteomics_metadata.tsv"
identifier_column: "uniprot"
output_key: "arivale_proteins"
- name: load_qin
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "/data/qin_osps.csv"
identifier_column: "uniprot"
output_key: "qin_proteins"
# Calculate overlaps
- name: arivale_vs_qin
action:
type: CALCULATE_SET_OVERLAP
params:
dataset_a_key: "arivale_proteins"
dataset_b_key: "qin_proteins"
output_key: "arivale_qin_overlap"
Error Handling
Common Issues and Solutions
- File not found errors
Check file paths are absolute and files exist.
- Column not found errors
Verify the
identifier_columnmatches your CSV headers exactly.- Timeout errors
Large datasets may take time. Default timeout is 5 minutes, but can be increased:
client = BiomapperClient(timeout=3600) # 1 hour
- Validation errors
Ensure YAML syntax is correct and all required parameters are provided.
Debugging
Enable detailed logging:
import logging
logging.basicConfig(level=logging.DEBUG)
async with BiomapperClient("http://localhost:8000") as client:
result = await client.execute_strategy_file("strategy.yaml")
Check API server logs for detailed error messages and execution progress.
Performance Tips
Use environment variables for portable file paths
For large datasets (>100K rows), increase client timeout and consider chunking
Monitor API server resources during execution
Use the
watch=Trueparameter to see real-time progress:result = client.run("large_strategy", watch=True)
Consider using
CHUNK_PROCESSORaction for very large filesEnable job persistence for recovery from failures
Advanced Features
Environment Variables
Strategies support variable substitution:
parameters:
data_dir: "${DATA_DIR:-/default/path}"
steps:
- action:
params:
file_path: "${parameters.data_dir}/file.csv"
Progress Tracking
Use Server-Sent Events for real-time progress:
result = client.run_with_progress("my_strategy")
Job Recovery
Jobs are persisted to SQLite for recovery:
# Check job status
job = client.get_job(job_id)
if job.status == "failed":
# Retry from last checkpoint
result = client.retry_job(job_id)
Next Steps
See Configuration Guide for advanced YAML strategy options
Check API Reference for complete API reference
Review Actions Reference for all available actions
Explore templates in
configs/strategies/templates/Read Creating New Actions to add custom actions
—
Verification Sources
Last verified: 2025-08-17
This documentation was verified against the following project resources:
/biomapper/CLAUDE.md(CLI commands and best practices)/biomapper/README.md(Installation and quick start)/biomapper/pyproject.toml(Project configuration)