YAML Strategy Schema Documentation
Overview
This document provides the complete schema reference for defining mapping strategies in YAML configuration files. The YAML strategy system allows users to create flexible, multi-step mapping workflows using the 37+ self-registering actions available in BioMapper.
Schema Structure
Top-Level Configuration
name: "STRATEGY_NAME"
description: "Brief description of what this strategy does"
metadata:
id: "unique_strategy_identifier"
entity_type: "proteins|metabolites|chemistry"
quality_tier: "experimental|production|test"
version: "1.0.0"
author: "author@institution.edu"
tags: ["tag1", "tag2"]
parameters:
param_name: "${ENV_VAR:-default_value}"
# User-configurable parameters
steps:
- name: "step_name"
action:
type: "ACTION_TYPE"
params:
# Parameters specific to the action type
Required Fields
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Strategy identifier (uppercase with underscores) |
|
string |
No |
Human-readable strategy description |
|
object |
No |
Strategy metadata including version, author, tags |
|
object |
No |
User-configurable parameters with variable substitution |
|
array |
Yes |
List of steps to execute sequentially |
Step Structure
Each step in the steps array has this structure:
- name: "descriptive_step_name"
action:
type: "ACTION_TYPE"
params:
parameter1: value1
parameter2: value2
Step Fields
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Descriptive name for the step |
|
string |
Yes |
One of the 37+ registered action types |
|
object |
Yes |
Parameters specific to the action type (validated by Pydantic) |
Common Action Types
Data Loading Actions
LOAD_DATASET_IDENTIFIERS
Loads identifiers from CSV/TSV files with flexible column mapping.
Parameters:
Parameter |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Absolute path to the data file |
|
string |
Yes |
- |
Column name containing identifiers |
|
string |
Yes |
- |
Key to store results in context |
|
string |
No |
- |
Human-readable name for logging |
|
string |
No |
- |
Prefix to remove from identifiers |
|
string |
No |
- |
Column to apply filtering on |
|
array |
No |
- |
Values/patterns to filter by |
|
string |
No |
“include” |
“include” or “exclude” |
|
boolean |
No |
true |
Drop rows with empty identifiers |
Example:
- name: load_ukbb_proteins
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "/data/ukbb_proteins.tsv"
identifier_column: "UniProt"
output_key: "ukbb_proteins"
dataset_name: "UK Biobank Proteins"
drop_empty_ids: true
Protein Actions
PROTEIN_NORMALIZE_ACCESSIONS
Normalizes and validates UniProt accessions.
Parameters:
Parameter |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Context key of input dataset |
|
string |
Yes |
- |
Key to store normalized results |
|
boolean |
No |
true |
Remove isoform suffixes (-1, -2, etc.) |
|
boolean |
No |
true |
Validate UniProt accession format |
MERGE_DATASETS
Merges two datasets on specified columns.
Parameters:
Parameter |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Context key of first dataset |
|
string |
Yes |
- |
Context key of second dataset |
|
string |
Yes |
- |
Column name in first dataset |
|
string |
Yes |
- |
Column name in second dataset |
|
string |
Yes |
- |
Key to store merged results |
Example:
- name: merge_datasets
action:
type: MERGE_DATASETS
params:
dataset1_key: "ukbb_proteins"
dataset2_key: "hpa_proteins"
merge_column1: "UniProt"
merge_column2: "uniprot"
output_key: "merged_dataset"
Analysis Actions
CALCULATE_SET_OVERLAP
Calculates overlap statistics between two datasets and generates Venn diagrams.
Parameters:
Parameter |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Context key of merged dataset |
|
string |
Yes |
- |
Display name for source dataset |
|
string |
Yes |
- |
Display name for target dataset |
|
string |
Yes |
- |
Key to store overlap results |
|
string |
No |
- |
Unique identifier for this mapping |
|
number |
No |
0.0 |
Minimum confidence for high-quality matches |
|
string |
No |
“data/results” |
Directory for output files |
Example:
- name: calculate_overlap
action:
type: CALCULATE_SET_OVERLAP
params:
merged_dataset_key: "merged_dataset"
source_name: "UKBB"
target_name: "HPA"
output_key: "overlap_statistics"
mapping_combo_id: "UKBB_HPA_ANALYSIS"
confidence_threshold: 0.7
output_directory: "data/results/UKBB_HPA"
Complete Example
Here’s a complete strategy that loads two protein datasets, normalizes them, merges them, and calculates overlap:
name: "UKBB_HPA_PROTEIN_COMPARISON"
description: "Compare protein coverage between UK Biobank and Human Protein Atlas"
metadata:
id: "ukbb_hpa_protein_comparison_v1"
entity_type: "proteins"
quality_tier: "production"
version: "1.0.0"
author: "researcher@institution.edu"
tags: ["ukbb", "hpa", "proteins", "overlap"]
parameters:
ukbb_file: "${UKBB_FILE:-/data/ukbb_proteins.tsv}"
hpa_file: "${HPA_FILE:-/data/hpa_proteins.csv}"
output_dir: "${OUTPUT_DIR:-/tmp/results}"
steps:
# Step 1: Load UK Biobank protein data
- name: load_ukbb_data
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "${parameters.ukbb_file}"
identifier_column: "UniProt"
output_key: "ukbb_proteins_raw"
dataset_name: "UK Biobank Proteins"
# Step 2: Normalize UK Biobank proteins
- name: normalize_ukbb
action:
type: PROTEIN_NORMALIZE_ACCESSIONS
params:
input_key: "ukbb_proteins_raw"
output_key: "ukbb_proteins"
remove_isoforms: true
validate_format: true
# Step 3: Load Human Protein Atlas data
- name: load_hpa_data
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "${parameters.hpa_file}"
identifier_column: "uniprot"
output_key: "hpa_proteins_raw"
dataset_name: "Human Protein Atlas"
# Step 4: Normalize HPA proteins
- name: normalize_hpa
action:
type: PROTEIN_NORMALIZE_ACCESSIONS
params:
input_key: "hpa_proteins_raw"
output_key: "hpa_proteins"
remove_isoforms: true
validate_format: true
# Step 5: Merge datasets
- name: merge_protein_data
action:
type: MERGE_DATASETS
params:
dataset1_key: "ukbb_proteins"
dataset2_key: "hpa_proteins"
merge_column1: "identifier"
merge_column2: "identifier"
output_key: "merged_proteins"
# Step 6: Calculate overlap statistics
- name: analyze_overlap
action:
type: CALCULATE_SET_OVERLAP
params:
merged_dataset_key: "merged_proteins"
source_name: "UKBB"
target_name: "HPA"
output_key: "overlap_analysis"
mapping_combo_id: "UKBB_HPA_COMPARISON"
confidence_threshold: 0.7
output_directory: "${parameters.output_dir}/UKBB_HPA"
# Step 7: Export results
- name: export_results
action:
type: EXPORT_DATASET
params:
input_key: "overlap_analysis"
output_file: "${parameters.output_dir}/overlap_results.csv"
format: "csv"
Data Flow Between Steps
The context dictionary passes data between steps using the output_key from one step as input keys for subsequent steps:
Step 1: LOAD_DATASET_IDENTIFIERS → context["datasets"]["ukbb_proteins_raw"]
Step 2: PROTEIN_NORMALIZE_ACCESSIONS → context["datasets"]["ukbb_proteins"]
Step 3: LOAD_DATASET_IDENTIFIERS → context["datasets"]["hpa_proteins_raw"]
Step 4: PROTEIN_NORMALIZE_ACCESSIONS → context["datasets"]["hpa_proteins"]
Step 5: MERGE_DATASETS → context["datasets"]["merged_proteins"]
Step 6: CALCULATE_SET_OVERLAP → context["datasets"]["overlap_analysis"]
Step 7: EXPORT_DATASET → context["output_files"].append("overlap_results.csv")
Variable Substitution
The strategy system supports multiple variable substitution patterns:
${parameters.key}: Access strategy parameters${env.VAR_NAME}: Access environment variables explicitly${VAR_NAME}: Shorthand for environment variables${metadata.field}: Access metadata fields${VAR:-default}: Provide default value if variable not set
File Path Considerations
Absolute paths recommended: Use full paths like
/data/proteins.csvRelative paths supported: Relative to the working directory where the strategy is executed
Variable substitution: Use
${parameters.file_path}for configurable pathsOutput directories: Created automatically if they don’t exist
Validation
The YAML strategy is validated at multiple levels:
Schema validation: Ensures all required fields are present
Parameter validation: Uses Pydantic models for type checking and constraints
Action validation: Verifies action type exists in ACTION_REGISTRY
Reference validation: Checks that referenced context keys exist during execution
File path validation: Verifies input files exist at execution time
Error Handling
When a step fails:
Execution stops immediately
Error details are logged
Previous steps’ results are preserved in context
API returns error information with context state
Best Practices
Naming Conventions
Strategy names: UPPERCASE_WITH_UNDERSCORES
Step names: lowercase_with_underscores, descriptive
Output keys: descriptive, reflect data content
Dataset names: Human-readable for logging
Strategy Design
Sequential steps: Each step builds on previous results
Descriptive names: Make the workflow self-documenting
Logical grouping: Group related operations
Error consideration: Plan for missing files or empty datasets
File Organization
configs/
├── simple_strategies/
│ ├── load_single_dataset.yaml
│ └── basic_comparison.yaml
├── protein_strategies/
│ ├── ukbb_hpa_comparison.yaml
│ └── multi_source_analysis.yaml
└── production_strategies/
└── comprehensive_protein_mapping.yaml
Performance Considerations
File sizes: Large files (>1M rows) may require increased timeouts
API calls: UniProt resolution adds significant time for unmatched IDs
Memory usage: Large datasets are processed in memory
Output files: Venn diagrams and CSV files are generated for each analysis
Integration with API
Strategies are executed via the REST API or Python client:
Using Python Client (Synchronous)
from src.client.client_v2 import BiomapperClient
client = BiomapperClient(base_url="http://localhost:8000")
# Execute with custom parameters
result = client.run(
strategy_name="UKBB_HPA_PROTEIN_COMPARISON",
parameters={
"ukbb_file": "/custom/path/ukbb.tsv",
"hpa_file": "/custom/path/hpa.csv",
"output_dir": "/custom/output"
}
)
print(f"Job ID: {result['job_id']}")
print(f"Status: {result['status']}")
print(f"Results: {result['results']}")
Using REST API Directly
curl -X POST "http://localhost:8000/api/strategies/v2/" \
-H "Content-Type: application/json" \
-d '{
"strategy_name": "UKBB_HPA_PROTEIN_COMPARISON",
"parameters": {
"ukbb_file": "/data/ukbb.tsv",
"hpa_file": "/data/hpa.csv"
}
}'
Available Actions Reference
BioMapper provides 37+ self-registering actions organized by entity type:
Protein Actions
PROTEIN_NORMALIZE_ACCESSIONS- Standardize UniProt identifiersPROTEIN_EXTRACT_UNIPROT_FROM_XREFS- Extract UniProt IDs from compound fields
Metabolite Actions
NIGHTINGALE_NMR_MATCH- Nightingale platform matchingSEMANTIC_METABOLITE_MATCH- AI-powered matching
Chemistry Actions
CHEMISTRY_FUZZY_TEST_MATCH- Fuzzy clinical test matching
Data Processing Actions
LOAD_DATASET_IDENTIFIERS- Load identifiers from filesMERGE_DATASETS- Merge datasets on common columnsEXPORT_DATASET- Export results to filesFILTER_DATASET- Apply filtering criteriaCUSTOM_TRANSFORM- Apply custom transformationsPARSE_COMPOSITE_IDENTIFIERS- Parse compound identifier fields
Reporting Actions
GENERATE_MAPPING_VISUALIZATIONS- Create mapping visualizationsGENERATE_LLM_ANALYSIS- Generate AI-powered analysis reports
I/O Actions
SYNC_TO_GOOGLE_DRIVE_V2- Sync results to Google Drive
Verification Sources
Last verified: 2025-01-18
This documentation was verified against the following project resources:
/home/ubuntu/biomapper/src/configs/strategies/(YAML strategy organization by entity type)/home/ubuntu/biomapper/src/core/minimal_strategy_service.py(Parameter substitution logic and context management)/home/ubuntu/biomapper/src/actions/load_dataset_identifiers.py(LOAD_DATASET_IDENTIFIERS action parameters)/home/ubuntu/biomapper/src/actions/merge_datasets.py(MERGE_DATASETS action parameters)/home/ubuntu/biomapper/src/actions/export_dataset.py(EXPORT_DATASET action parameters)/home/ubuntu/biomapper/src/actions/entities/proteins/annotation/normalize_accessions.py(PROTEIN_NORMALIZE_ACCESSIONS action)/home/ubuntu/biomapper/src/client/client_v2.py(BiomapperClient.run() method and parameter passing)/home/ubuntu/biomapper/src/actions/(Action registry and available actions)