YAML Strategy Schema Documentation

Overview

This document provides the complete schema reference for defining mapping strategies in YAML configuration files. The YAML strategy system allows users to create flexible, multi-step mapping workflows using the 37+ self-registering actions available in BioMapper.

Schema Structure

Top-Level Configuration

name: "STRATEGY_NAME"
description: "Brief description of what this strategy does"

metadata:
  id: "unique_strategy_identifier"
  entity_type: "proteins|metabolites|chemistry"
  quality_tier: "experimental|production|test"
  version: "1.0.0"
  author: "author@institution.edu"
  tags: ["tag1", "tag2"]

parameters:
  param_name: "${ENV_VAR:-default_value}"
  # User-configurable parameters

steps:
  - name: "step_name"
    action:
      type: "ACTION_TYPE"
      params:
        # Parameters specific to the action type

Required Fields

Field	Type	Required	Description
`name`	string	Yes	Strategy identifier (uppercase with underscores)
`description`	string	No	Human-readable strategy description
`metadata`	object	No	Strategy metadata including version, author, tags
`parameters`	object	No	User-configurable parameters with variable substitution
`steps`	array	Yes	List of steps to execute sequentially

Step Structure

Each step in the steps array has this structure:

- name: "descriptive_step_name"
  action:
    type: "ACTION_TYPE"
    params:
      parameter1: value1
      parameter2: value2

Step Fields

Field	Type	Required	Description
`name`	string	Yes	Descriptive name for the step
`action.type`	string	Yes	One of the 37+ registered action types
`action.params`	object	Yes	Parameters specific to the action type (validated by Pydantic)

Common Action Types

Data Loading Actions

LOAD_DATASET_IDENTIFIERS

Loads identifiers from CSV/TSV files with flexible column mapping.

Parameters:

Parameter	Type	Required	Default	Description
`file_path`	string	Yes	-	Absolute path to the data file
`identifier_column`	string	Yes	-	Column name containing identifiers
`output_key`	string	Yes	-	Key to store results in context
`dataset_name`	string	No	-	Human-readable name for logging
`strip_prefix`	string	No	-	Prefix to remove from identifiers
`filter_column`	string	No	-	Column to apply filtering on
`filter_values`	array	No	-	Values/patterns to filter by
`filter_mode`	string	No	“include”	“include” or “exclude”
`drop_empty_ids`	boolean	No	true	Drop rows with empty identifiers

Example:

- name: load_ukbb_proteins
  action:
    type: LOAD_DATASET_IDENTIFIERS
    params:
      file_path: "/data/ukbb_proteins.tsv"
      identifier_column: "UniProt"
      output_key: "ukbb_proteins"
      dataset_name: "UK Biobank Proteins"
      drop_empty_ids: true

Protein Actions

PROTEIN_NORMALIZE_ACCESSIONS

Normalizes and validates UniProt accessions.

Parameters:

Parameter	Type	Required	Default	Description
`input_key`	string	Yes	-	Context key of input dataset
`output_key`	string	Yes	-	Key to store normalized results
`remove_isoforms`	boolean	No	true	Remove isoform suffixes (-1, -2, etc.)
`validate_format`	boolean	No	true	Validate UniProt accession format

MERGE_DATASETS

Merges two datasets on specified columns.

Parameters:

Parameter	Type	Required	Default	Description
`dataset1_key`	string	Yes	-	Context key of first dataset
`dataset2_key`	string	Yes	-	Context key of second dataset
`merge_column1`	string	Yes	-	Column name in first dataset
`merge_column2`	string	Yes	-	Column name in second dataset
`output_key`	string	Yes	-	Key to store merged results

Example:

- name: merge_datasets
  action:
    type: MERGE_DATASETS
    params:
      dataset1_key: "ukbb_proteins"
      dataset2_key: "hpa_proteins"
      merge_column1: "UniProt"
      merge_column2: "uniprot"
      output_key: "merged_dataset"

Analysis Actions

CALCULATE_SET_OVERLAP

Calculates overlap statistics between two datasets and generates Venn diagrams.

Parameters:

Parameter	Type	Required	Default	Description
`merged_dataset_key`	string	Yes	-	Context key of merged dataset
`source_name`	string	Yes	-	Display name for source dataset
`target_name`	string	Yes	-	Display name for target dataset
`output_key`	string	Yes	-	Key to store overlap results
`mapping_combo_id`	string	No	-	Unique identifier for this mapping
`confidence_threshold`	number	No	0.0	Minimum confidence for high-quality matches
`output_directory`	string	No	“data/results”	Directory for output files

Example:

- name: calculate_overlap
  action:
    type: CALCULATE_SET_OVERLAP
    params:
      merged_dataset_key: "merged_dataset"
      source_name: "UKBB"
      target_name: "HPA"
      output_key: "overlap_statistics"
      mapping_combo_id: "UKBB_HPA_ANALYSIS"
      confidence_threshold: 0.7
      output_directory: "data/results/UKBB_HPA"

Complete Example

Here’s a complete strategy that loads two protein datasets, normalizes them, merges them, and calculates overlap:

name: "UKBB_HPA_PROTEIN_COMPARISON"
description: "Compare protein coverage between UK Biobank and Human Protein Atlas"

metadata:
  id: "ukbb_hpa_protein_comparison_v1"
  entity_type: "proteins"
  quality_tier: "production"
  version: "1.0.0"
  author: "researcher@institution.edu"
  tags: ["ukbb", "hpa", "proteins", "overlap"]

parameters:
  ukbb_file: "${UKBB_FILE:-/data/ukbb_proteins.tsv}"
  hpa_file: "${HPA_FILE:-/data/hpa_proteins.csv}"
  output_dir: "${OUTPUT_DIR:-/tmp/results}"

steps:
  # Step 1: Load UK Biobank protein data
  - name: load_ukbb_data
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "${parameters.ukbb_file}"
        identifier_column: "UniProt"
        output_key: "ukbb_proteins_raw"
        dataset_name: "UK Biobank Proteins"

  # Step 2: Normalize UK Biobank proteins
  - name: normalize_ukbb
    action:
      type: PROTEIN_NORMALIZE_ACCESSIONS
      params:
        input_key: "ukbb_proteins_raw"
        output_key: "ukbb_proteins"
        remove_isoforms: true
        validate_format: true

  # Step 3: Load Human Protein Atlas data  
  - name: load_hpa_data
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "${parameters.hpa_file}"
        identifier_column: "uniprot"
        output_key: "hpa_proteins_raw" 
        dataset_name: "Human Protein Atlas"

  # Step 4: Normalize HPA proteins
  - name: normalize_hpa
    action:
      type: PROTEIN_NORMALIZE_ACCESSIONS
      params:
        input_key: "hpa_proteins_raw"
        output_key: "hpa_proteins"
        remove_isoforms: true
        validate_format: true

  # Step 5: Merge datasets
  - name: merge_protein_data
    action:
      type: MERGE_DATASETS
      params:
        dataset1_key: "ukbb_proteins"
        dataset2_key: "hpa_proteins"
        merge_column1: "identifier"
        merge_column2: "identifier"
        output_key: "merged_proteins"

  # Step 6: Calculate overlap statistics
  - name: analyze_overlap
    action:
      type: CALCULATE_SET_OVERLAP
      params:
        merged_dataset_key: "merged_proteins"
        source_name: "UKBB"
        target_name: "HPA" 
        output_key: "overlap_analysis"
        mapping_combo_id: "UKBB_HPA_COMPARISON"
        confidence_threshold: 0.7
        output_directory: "${parameters.output_dir}/UKBB_HPA"

  # Step 7: Export results
  - name: export_results
    action:
      type: EXPORT_DATASET
      params:
        input_key: "overlap_analysis"
        output_file: "${parameters.output_dir}/overlap_results.csv"
        format: "csv"

Data Flow Between Steps

The context dictionary passes data between steps using the output_key from one step as input keys for subsequent steps:

Step 1: LOAD_DATASET_IDENTIFIERS → context["datasets"]["ukbb_proteins_raw"]
Step 2: PROTEIN_NORMALIZE_ACCESSIONS → context["datasets"]["ukbb_proteins"]
Step 3: LOAD_DATASET_IDENTIFIERS → context["datasets"]["hpa_proteins_raw"]
Step 4: PROTEIN_NORMALIZE_ACCESSIONS → context["datasets"]["hpa_proteins"]
Step 5: MERGE_DATASETS → context["datasets"]["merged_proteins"]
Step 6: CALCULATE_SET_OVERLAP → context["datasets"]["overlap_analysis"]
Step 7: EXPORT_DATASET → context["output_files"].append("overlap_results.csv")

Variable Substitution

The strategy system supports multiple variable substitution patterns:

${parameters.key}: Access strategy parameters
${env.VAR_NAME}: Access environment variables explicitly
${VAR_NAME}: Shorthand for environment variables
${metadata.field}: Access metadata fields
${VAR:-default}: Provide default value if variable not set

File Path Considerations

Absolute paths recommended: Use full paths like /data/proteins.csv
Relative paths supported: Relative to the working directory where the strategy is executed
Variable substitution: Use ${parameters.file_path} for configurable paths
Output directories: Created automatically if they don’t exist

Validation

The YAML strategy is validated at multiple levels:

Schema validation: Ensures all required fields are present
Parameter validation: Uses Pydantic models for type checking and constraints
Action validation: Verifies action type exists in ACTION_REGISTRY
Reference validation: Checks that referenced context keys exist during execution
File path validation: Verifies input files exist at execution time

Error Handling

When a step fails:

Execution stops immediately
Error details are logged
Previous steps’ results are preserved in context
API returns error information with context state

Best Practices

Naming Conventions

Strategy names: UPPERCASE_WITH_UNDERSCORES
Step names: lowercase_with_underscores, descriptive
Output keys: descriptive, reflect data content
Dataset names: Human-readable for logging

Strategy Design

Sequential steps: Each step builds on previous results
Descriptive names: Make the workflow self-documenting
Logical grouping: Group related operations
Error consideration: Plan for missing files or empty datasets

File Organization

configs/
├── simple_strategies/
│   ├── load_single_dataset.yaml
│   └── basic_comparison.yaml
├── protein_strategies/
│   ├── ukbb_hpa_comparison.yaml
│   └── multi_source_analysis.yaml
└── production_strategies/
    └── comprehensive_protein_mapping.yaml

Performance Considerations

File sizes: Large files (>1M rows) may require increased timeouts
API calls: UniProt resolution adds significant time for unmatched IDs
Memory usage: Large datasets are processed in memory
Output files: Venn diagrams and CSV files are generated for each analysis

Integration with API

Strategies are executed via the REST API or Python client:

Using Python Client (Synchronous)

from src.client.client_v2 import BiomapperClient

client = BiomapperClient(base_url="http://localhost:8000")

# Execute with custom parameters
result = client.run(
    strategy_name="UKBB_HPA_PROTEIN_COMPARISON",
    parameters={
        "ukbb_file": "/custom/path/ukbb.tsv",
        "hpa_file": "/custom/path/hpa.csv",
        "output_dir": "/custom/output"
    }
)

print(f"Job ID: {result['job_id']}")
print(f"Status: {result['status']}")
print(f"Results: {result['results']}")

Using REST API Directly

curl -X POST "http://localhost:8000/api/strategies/v2/" \
  -H "Content-Type: application/json" \
  -d '{
    "strategy_name": "UKBB_HPA_PROTEIN_COMPARISON",
    "parameters": {
      "ukbb_file": "/data/ukbb.tsv",
      "hpa_file": "/data/hpa.csv"
    }
  }'

Available Actions Reference

BioMapper provides 37+ self-registering actions organized by entity type:

Protein Actions

PROTEIN_NORMALIZE_ACCESSIONS - Standardize UniProt identifiers
PROTEIN_EXTRACT_UNIPROT_FROM_XREFS - Extract UniProt IDs from compound fields

Metabolite Actions

NIGHTINGALE_NMR_MATCH - Nightingale platform matching
SEMANTIC_METABOLITE_MATCH - AI-powered matching

Chemistry Actions

CHEMISTRY_FUZZY_TEST_MATCH - Fuzzy clinical test matching

Data Processing Actions

LOAD_DATASET_IDENTIFIERS - Load identifiers from files
MERGE_DATASETS - Merge datasets on common columns
EXPORT_DATASET - Export results to files
FILTER_DATASET - Apply filtering criteria
CUSTOM_TRANSFORM - Apply custom transformations
PARSE_COMPOSITE_IDENTIFIERS - Parse compound identifier fields

Reporting Actions

GENERATE_MAPPING_VISUALIZATIONS - Create mapping visualizations
GENERATE_LLM_ANALYSIS - Generate AI-powered analysis reports

I/O Actions

SYNC_TO_GOOGLE_DRIVE_V2 - Sync results to Google Drive

Verification Sources

Last verified: 2025-01-18

This documentation was verified against the following project resources:

/home/ubuntu/biomapper/src/configs/strategies/ (YAML strategy organization by entity type)
/home/ubuntu/biomapper/src/core/minimal_strategy_service.py (Parameter substitution logic and context management)
/home/ubuntu/biomapper/src/actions/load_dataset_identifiers.py (LOAD_DATASET_IDENTIFIERS action parameters)
/home/ubuntu/biomapper/src/actions/merge_datasets.py (MERGE_DATASETS action parameters)
/home/ubuntu/biomapper/src/actions/export_dataset.py (EXPORT_DATASET action parameters)
/home/ubuntu/biomapper/src/actions/entities/proteins/annotation/normalize_accessions.py (PROTEIN_NORMALIZE_ACCESSIONS action)
/home/ubuntu/biomapper/src/client/client_v2.py (BiomapperClient.run() method and parameter passing)
/home/ubuntu/biomapper/src/actions/ (Action registry and available actions)

YAML Strategy Schema Documentation

Overview

Schema Structure

Top-Level Configuration

Required Fields

Step Structure

Step Fields

Common Action Types

Data Loading Actions

LOAD_DATASET_IDENTIFIERS

Protein Actions

PROTEIN_NORMALIZE_ACCESSIONS

MERGE_DATASETS

Analysis Actions

CALCULATE_SET_OVERLAP

Complete Example

Data Flow Between Steps

Variable Substitution

File Path Considerations

Validation

Error Handling

Best Practices

Naming Conventions

Strategy Design

File Organization

Performance Considerations

Integration with API

Using Python Client (Synchronous)

Using REST API Directly

Available Actions Reference

Protein Actions

Metabolite Actions

Chemistry Actions

Data Processing Actions

Reporting Actions

I/O Actions

Verification Sources

See Also