Configuration Guide
Biomapper uses YAML strategy files to define mapping workflows. Strategies can include metadata for tracking, runtime parameters with environment variable substitution, and a sequence of self-registering actions. This guide covers strategy configuration, action parameters, and best practices.
Strategy File Structure
Every strategy file follows this structure:
# Optional metadata for tracking and organization
metadata:
id: "strategy_unique_id"
name: "Human Readable Name"
version: "1.0.0"
entity_type: "proteins" # or metabolites, chemistry
quality_tier: "experimental" # or production, deprecated
# Optional runtime parameters with defaults
parameters:
output_dir: "${OUTPUT_DIR:-/tmp/outputs}"
threshold: 0.85
batch_size: 1000
# Required: strategy execution steps
name: "STRATEGY_NAME"
description: "What this strategy does"
steps:
- name: step1
action:
type: ACTION_TYPE
params:
parameter1: "${parameters.threshold}" # Use parameters
parameter2: "/data/input.csv"
- name: step2
action:
type: ACTION_TYPE
params:
input_key: step1_output # Reference previous outputs
output_key: final_result
Required Fields
- name
Unique identifier for the strategy. Use UPPERCASE_WITH_UNDERSCORES.
- description
Human-readable description of what the strategy accomplishes.
- steps
List of actions to execute in order.
Each step requires:
- name
Step identifier within the strategy.
- action.type
One of the 30+ registered action types (see Action Types section).
- action.params
Parameters specific to that action type.
Action Types
Biomapper includes 30+ self-registering actions organized by category:
Data Operations
LOAD_DATASET_IDENTIFIERS- Load identifiers from CSV/TSV filesMERGE_DATASETS- Combine multiple datasetsFILTER_DATASET- Apply filtering criteriaEXPORT_DATASET- Export to various formatsCUSTOM_TRANSFORM- Apply Python expressions
Protein Actions
MERGE_WITH_UNIPROT_RESOLUTION- Historical UniProt ID resolutionPROTEIN_EXTRACT_UNIPROT_FROM_XREFS- Extract IDs from compound fieldsPROTEIN_NORMALIZE_ACCESSIONS- Standardize protein identifiersPROTEIN_MULTI_BRIDGE- Cross-dataset resolution
Metabolite Actions
CTS_ENRICHED_MATCH- Chemical Translation Service matchingSEMANTIC_METABOLITE_MATCH- AI-powered semantic matchingVECTOR_ENHANCED_MATCH- Vector similarity matchingNIGHTINGALE_NMR_MATCH- Nightingale reference matchingCOMBINE_METABOLITE_MATCHES- Merge multiple approaches
Chemistry Actions
CHEMISTRY_EXTRACT_LOINC- Extract LOINC codesCHEMISTRY_FUZZY_TEST_MATCH- Fuzzy test name matchingCHEMISTRY_VENDOR_HARMONIZATION- Harmonize vendor data
Analysis Actions
CALCULATE_SET_OVERLAP- Jaccard similarity analysisCALCULATE_THREE_WAY_OVERLAP- Three-dataset comparisonCALCULATE_MAPPING_QUALITY- Quality metricsGENERATE_METABOLOMICS_REPORT- Comprehensive reports
Common Action Parameters
LOAD_DATASET_IDENTIFIERS
Loads identifiers from CSV/TSV files.
Required Parameters:
* file_path: Path to data file (supports environment variables)
* identifier_column: Column name containing identifiers
* output_key: Key to store results in context
Optional Parameters:
* dataset_name: Human-readable name for logging
* filter_empty: Remove empty identifiers (default: true)
* additional_columns: List of extra columns to preserve
- name: load_proteins
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "${DATA_DIR:-/data}/proteins.csv" # Environment variable
identifier_column: "uniprot_id"
output_key: "protein_list"
dataset_name: "My Protein Dataset"
additional_columns: ["gene_name", "description"]
MERGE_WITH_UNIPROT_RESOLUTION
Merges two datasets with historical UniProt identifier resolution.
Required Parameters:
* source_dataset_key: Context key of source dataset
* target_dataset_key: Context key of target dataset
* source_id_column: Column name in source data
* target_id_column: Column name in target data
* output_key: Key to store merged results
- name: merge_data
action:
type: MERGE_WITH_UNIPROT_RESOLUTION
params:
source_dataset_key: "dataset_a"
target_dataset_key: "dataset_b"
source_id_column: "UniProt"
target_id_column: "uniprot"
output_key: "merged_dataset"
CALCULATE_SET_OVERLAP
Calculates Jaccard similarity and generates Venn diagrams.
Required Parameters:
* dataset_a_key: Context key of first dataset
* dataset_b_key: Context key of second dataset
* output_key: Key to store overlap results
Optional Parameters:
* generate_venn: Create Venn diagram (default: true)
* output_path: Path for diagram file
- name: find_overlap
action:
type: CALCULATE_SET_OVERLAP
params:
dataset_a_key: "proteins_a"
dataset_b_key: "proteins_b"
output_key: "overlap_stats"
generate_venn: true
output_path: "${parameters.output_dir}/venn_diagram.png"
Example Configurations
Basic Protein Mapping
name: "BASIC_PROTEIN_MAPPING"
description: "Load and analyze protein overlap"
steps:
- name: load_source
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "/data/source_proteins.csv"
identifier_column: "protein_id"
output_key: "source_proteins"
- name: load_target
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "/data/target_proteins.csv"
identifier_column: "uniprot_ac"
output_key: "target_proteins"
- name: calculate_overlap
action:
type: CALCULATE_SET_OVERLAP
params:
dataset_a_key: "source_proteins"
dataset_b_key: "target_proteins"
output_key: "analysis_results"
Multi-Dataset Comparison
name: "MULTI_DATASET_COMPARISON"
description: "Compare multiple protein datasets with UniProt resolution"
steps:
- name: load_arivale
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "/data/arivale/proteomics_metadata.tsv"
identifier_column: "uniprot"
output_key: "arivale_proteins"
dataset_name: "Arivale Proteomics"
- name: load_hpa
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "/data/hpa_osps.csv"
identifier_column: "uniprot"
output_key: "hpa_proteins"
dataset_name: "Human Protein Atlas"
- name: merge_arivale_hpa
action:
type: MERGE_WITH_UNIPROT_RESOLUTION
params:
source_dataset_key: "arivale_proteins"
target_dataset_key: "hpa_proteins"
source_id_column: "uniprot"
target_id_column: "uniprot"
output_key: "arivale_hpa_merged"
- name: analyze_overlap
action:
type: CALCULATE_SET_OVERLAP
params:
dataset_a_key: "arivale_hpa_merged"
dataset_b_key: "hpa_proteins"
output_key: "final_analysis"
Strategy Organization
File Naming
Use descriptive names that indicate the datasets and purpose:
ukbb_hpa_mapping.yaml- Maps UKBB to HPAmulti_protein_comparison.yaml- Compares multiple sourcesarivale_qin_overlap.yaml- Analyzes Arivale vs QIN overlap
Directory Structure
Organize strategies in the configs/strategies/ directory:
configs/strategies/
├── templates/ # Reusable templates
│ ├── protein_mapping_template.yaml
│ ├── metabolite_mapping_template.yaml
│ └── chemistry_mapping_template.yaml
├── experimental/ # In development
│ ├── prot_arv_to_kg2c_uniprot_v2.yaml
│ └── met_multi_to_unified_semantic.yaml
└── production/ # Validated strategies
└── (strategies promoted from experimental)
Data Requirements
File Formats
Strategies work with CSV and TSV files. Ensure your data files:
Have headers in the first row
Use consistent delimiter (comma for CSV, tab for TSV)
Contain the identifier columns referenced in strategies
Use UTF-8 encoding
File Paths
Use absolute paths or environment variables in strategy files:
# Good - absolute path
file_path: "/data/proteins/ukbb_data.csv"
# Better - environment variable with default
file_path: "${DATA_DIR:-/data}/proteins/ukbb_data.csv"
# Best - use parameters section
parameters:
data_dir: "${DATA_DIR:-/data}"
steps:
- action:
params:
file_path: "${parameters.data_dir}/proteins/ukbb_data.csv"
Column Names
Ensure the identifier_column exactly matches your CSV headers:
# If your CSV header is "UniProt_ID"
identifier_column: "UniProt_ID"
# Not "uniprot_id" or "UniProt"
Best Practices
Use descriptive names for steps and output keys
Test with small datasets before running on large files
Keep strategies focused on specific comparisons
Document with metadata including version, quality tier, and expected match rates
Use environment variables for portable file paths
Follow naming conventions: - Strategy IDs:
entity_source_to_target_bridge_version- Output keys:entity_type_stage(e.g.,proteins_normalized)Track data lineage with source_files and target_files metadata
Set quality expectations with expected_match_rate
Troubleshooting
Common Configuration Errors
- YAML syntax errors
Validate YAML syntax with an online checker.
- Missing required parameters
Check that all required params are provided for each action.
- File path issues
Use absolute paths and verify files exist.
- Column name mismatches
Ensure identifier_column matches CSV headers exactly.
- Key conflicts
Use unique output_key names within each strategy.
Validation
Before deploying strategies:
Check YAML syntax is valid
Verify all file paths exist and are readable
Confirm column names match data files
Test with small sample datasets first
Review logs for any warnings or errors
Environment Variables
Strategies support variable substitution:
${VAR}or${env.VAR}- Environment variable${VAR:-default}- With default value${parameters.key}- Reference parameters section${metadata.field}- Reference metadata fields
Common environment variables:
DATA_DIR- Base data directoryOUTPUT_DIR- Output directoryBIOMAPPER_CONFIG- Configuration path
Next Steps
See Usage Guide for executing strategies
Check Actions Reference for complete action reference
Review templates in
configs/strategies/templates/Learn about the REST API Reference for programmatic execution
—
Verification Sources
Last verified: 2025-08-17
This documentation was verified against the following project resources:
/biomapper/CLAUDE.md(Best practices and conventions)/biomapper/README.md(Configuration overview)/biomapper/pyproject.toml(Project configuration)