YAML Strategy Schema Documentation

Overview

This document provides the complete schema reference for defining mapping strategies in YAML configuration files. The YAML strategy system allows users to create flexible, multi-step mapping workflows using the 37+ self-registering actions available in BioMapper.

Schema Structure

Top-Level Configuration

name: "STRATEGY_NAME"
description: "Brief description of what this strategy does"

metadata:
  id: "unique_strategy_identifier"
  entity_type: "proteins|metabolites|chemistry"
  quality_tier: "experimental|production|test"
  version: "1.0.0"
  author: "author@institution.edu"
  tags: ["tag1", "tag2"]

parameters:
  param_name: "${ENV_VAR:-default_value}"
  # User-configurable parameters

steps:
  - name: "step_name"
    action:
      type: "ACTION_TYPE"
      params:
        # Parameters specific to the action type

Required Fields

Field

Type

Required

Description

name

string

Yes

Strategy identifier (uppercase with underscores)

description

string

No

Human-readable strategy description

metadata

object

No

Strategy metadata including version, author, tags

parameters

object

No

User-configurable parameters with variable substitution

steps

array

Yes

List of steps to execute sequentially

Step Structure

Each step in the steps array has this structure:

- name: "descriptive_step_name"
  action:
    type: "ACTION_TYPE"
    params:
      parameter1: value1
      parameter2: value2

Step Fields

Field

Type

Required

Description

name

string

Yes

Descriptive name for the step

action.type

string

Yes

One of the 37+ registered action types

action.params

object

Yes

Parameters specific to the action type (validated by Pydantic)

Common Action Types

Data Loading Actions

LOAD_DATASET_IDENTIFIERS

Loads identifiers from CSV/TSV files with flexible column mapping.

Parameters:

Parameter

Type

Required

Default

Description

file_path

string

Yes

-

Absolute path to the data file

identifier_column

string

Yes

-

Column name containing identifiers

output_key

string

Yes

-

Key to store results in context

dataset_name

string

No

-

Human-readable name for logging

strip_prefix

string

No

-

Prefix to remove from identifiers

filter_column

string

No

-

Column to apply filtering on

filter_values

array

No

-

Values/patterns to filter by

filter_mode

string

No

“include”

“include” or “exclude”

drop_empty_ids

boolean

No

true

Drop rows with empty identifiers

Example:

- name: load_ukbb_proteins
  action:
    type: LOAD_DATASET_IDENTIFIERS
    params:
      file_path: "/data/ukbb_proteins.tsv"
      identifier_column: "UniProt"
      output_key: "ukbb_proteins"
      dataset_name: "UK Biobank Proteins"
      drop_empty_ids: true

Protein Actions

PROTEIN_NORMALIZE_ACCESSIONS

Normalizes and validates UniProt accessions.

Parameters:

Parameter

Type

Required

Default

Description

input_key

string

Yes

-

Context key of input dataset

output_key

string

Yes

-

Key to store normalized results

remove_isoforms

boolean

No

true

Remove isoform suffixes (-1, -2, etc.)

validate_format

boolean

No

true

Validate UniProt accession format

MERGE_DATASETS

Merges two datasets on specified columns.

Parameters:

Parameter

Type

Required

Default

Description

dataset1_key

string

Yes

-

Context key of first dataset

dataset2_key

string

Yes

-

Context key of second dataset

merge_column1

string

Yes

-

Column name in first dataset

merge_column2

string

Yes

-

Column name in second dataset

output_key

string

Yes

-

Key to store merged results

Example:

- name: merge_datasets
  action:
    type: MERGE_DATASETS
    params:
      dataset1_key: "ukbb_proteins"
      dataset2_key: "hpa_proteins"
      merge_column1: "UniProt"
      merge_column2: "uniprot"
      output_key: "merged_dataset"

Analysis Actions

CALCULATE_SET_OVERLAP

Calculates overlap statistics between two datasets and generates Venn diagrams.

Parameters:

Parameter

Type

Required

Default

Description

merged_dataset_key

string

Yes

-

Context key of merged dataset

source_name

string

Yes

-

Display name for source dataset

target_name

string

Yes

-

Display name for target dataset

output_key

string

Yes

-

Key to store overlap results

mapping_combo_id

string

No

-

Unique identifier for this mapping

confidence_threshold

number

No

0.0

Minimum confidence for high-quality matches

output_directory

string

No

“data/results”

Directory for output files

Example:

- name: calculate_overlap
  action:
    type: CALCULATE_SET_OVERLAP
    params:
      merged_dataset_key: "merged_dataset"
      source_name: "UKBB"
      target_name: "HPA"
      output_key: "overlap_statistics"
      mapping_combo_id: "UKBB_HPA_ANALYSIS"
      confidence_threshold: 0.7
      output_directory: "data/results/UKBB_HPA"

Complete Example

Here’s a complete strategy that loads two protein datasets, normalizes them, merges them, and calculates overlap:

name: "UKBB_HPA_PROTEIN_COMPARISON"
description: "Compare protein coverage between UK Biobank and Human Protein Atlas"

metadata:
  id: "ukbb_hpa_protein_comparison_v1"
  entity_type: "proteins"
  quality_tier: "production"
  version: "1.0.0"
  author: "researcher@institution.edu"
  tags: ["ukbb", "hpa", "proteins", "overlap"]

parameters:
  ukbb_file: "${UKBB_FILE:-/data/ukbb_proteins.tsv}"
  hpa_file: "${HPA_FILE:-/data/hpa_proteins.csv}"
  output_dir: "${OUTPUT_DIR:-/tmp/results}"

steps:
  # Step 1: Load UK Biobank protein data
  - name: load_ukbb_data
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "${parameters.ukbb_file}"
        identifier_column: "UniProt"
        output_key: "ukbb_proteins_raw"
        dataset_name: "UK Biobank Proteins"

  # Step 2: Normalize UK Biobank proteins
  - name: normalize_ukbb
    action:
      type: PROTEIN_NORMALIZE_ACCESSIONS
      params:
        input_key: "ukbb_proteins_raw"
        output_key: "ukbb_proteins"
        remove_isoforms: true
        validate_format: true

  # Step 3: Load Human Protein Atlas data  
  - name: load_hpa_data
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "${parameters.hpa_file}"
        identifier_column: "uniprot"
        output_key: "hpa_proteins_raw" 
        dataset_name: "Human Protein Atlas"

  # Step 4: Normalize HPA proteins
  - name: normalize_hpa
    action:
      type: PROTEIN_NORMALIZE_ACCESSIONS
      params:
        input_key: "hpa_proteins_raw"
        output_key: "hpa_proteins"
        remove_isoforms: true
        validate_format: true

  # Step 5: Merge datasets
  - name: merge_protein_data
    action:
      type: MERGE_DATASETS
      params:
        dataset1_key: "ukbb_proteins"
        dataset2_key: "hpa_proteins"
        merge_column1: "identifier"
        merge_column2: "identifier"
        output_key: "merged_proteins"

  # Step 6: Calculate overlap statistics
  - name: analyze_overlap
    action:
      type: CALCULATE_SET_OVERLAP
      params:
        merged_dataset_key: "merged_proteins"
        source_name: "UKBB"
        target_name: "HPA" 
        output_key: "overlap_analysis"
        mapping_combo_id: "UKBB_HPA_COMPARISON"
        confidence_threshold: 0.7
        output_directory: "${parameters.output_dir}/UKBB_HPA"

  # Step 7: Export results
  - name: export_results
    action:
      type: EXPORT_DATASET
      params:
        input_key: "overlap_analysis"
        output_file: "${parameters.output_dir}/overlap_results.csv"
        format: "csv"

Data Flow Between Steps

The context dictionary passes data between steps using the output_key from one step as input keys for subsequent steps:

Step 1: LOAD_DATASET_IDENTIFIERS → context["datasets"]["ukbb_proteins_raw"]
Step 2: PROTEIN_NORMALIZE_ACCESSIONS → context["datasets"]["ukbb_proteins"]
Step 3: LOAD_DATASET_IDENTIFIERS → context["datasets"]["hpa_proteins_raw"]
Step 4: PROTEIN_NORMALIZE_ACCESSIONS → context["datasets"]["hpa_proteins"]
Step 5: MERGE_DATASETS → context["datasets"]["merged_proteins"]
Step 6: CALCULATE_SET_OVERLAP → context["datasets"]["overlap_analysis"]
Step 7: EXPORT_DATASET → context["output_files"].append("overlap_results.csv")

Variable Substitution

The strategy system supports multiple variable substitution patterns:

  • ${parameters.key}: Access strategy parameters

  • ${env.VAR_NAME}: Access environment variables explicitly

  • ${VAR_NAME}: Shorthand for environment variables

  • ${metadata.field}: Access metadata fields

  • ${VAR:-default}: Provide default value if variable not set

File Path Considerations

  • Absolute paths recommended: Use full paths like /data/proteins.csv

  • Relative paths supported: Relative to the working directory where the strategy is executed

  • Variable substitution: Use ${parameters.file_path} for configurable paths

  • Output directories: Created automatically if they don’t exist

Validation

The YAML strategy is validated at multiple levels:

  • Schema validation: Ensures all required fields are present

  • Parameter validation: Uses Pydantic models for type checking and constraints

  • Action validation: Verifies action type exists in ACTION_REGISTRY

  • Reference validation: Checks that referenced context keys exist during execution

  • File path validation: Verifies input files exist at execution time

Error Handling

When a step fails:

  • Execution stops immediately

  • Error details are logged

  • Previous steps’ results are preserved in context

  • API returns error information with context state

Best Practices

Naming Conventions

  • Strategy names: UPPERCASE_WITH_UNDERSCORES

  • Step names: lowercase_with_underscores, descriptive

  • Output keys: descriptive, reflect data content

  • Dataset names: Human-readable for logging

Strategy Design

  • Sequential steps: Each step builds on previous results

  • Descriptive names: Make the workflow self-documenting

  • Logical grouping: Group related operations

  • Error consideration: Plan for missing files or empty datasets

File Organization

configs/
├── simple_strategies/
│   ├── load_single_dataset.yaml
│   └── basic_comparison.yaml
├── protein_strategies/
│   ├── ukbb_hpa_comparison.yaml
│   └── multi_source_analysis.yaml
└── production_strategies/
    └── comprehensive_protein_mapping.yaml

Performance Considerations

  • File sizes: Large files (>1M rows) may require increased timeouts

  • API calls: UniProt resolution adds significant time for unmatched IDs

  • Memory usage: Large datasets are processed in memory

  • Output files: Venn diagrams and CSV files are generated for each analysis

Integration with API

Strategies are executed via the REST API or Python client:

Using Python Client (Synchronous)

from src.client.client_v2 import BiomapperClient

client = BiomapperClient(base_url="http://localhost:8000")

# Execute with custom parameters
result = client.run(
    strategy_name="UKBB_HPA_PROTEIN_COMPARISON",
    parameters={
        "ukbb_file": "/custom/path/ukbb.tsv",
        "hpa_file": "/custom/path/hpa.csv",
        "output_dir": "/custom/output"
    }
)

print(f"Job ID: {result['job_id']}")
print(f"Status: {result['status']}")
print(f"Results: {result['results']}")

Using REST API Directly

curl -X POST "http://localhost:8000/api/strategies/v2/" \
  -H "Content-Type: application/json" \
  -d '{
    "strategy_name": "UKBB_HPA_PROTEIN_COMPARISON",
    "parameters": {
      "ukbb_file": "/data/ukbb.tsv",
      "hpa_file": "/data/hpa.csv"
    }
  }'

Available Actions Reference

BioMapper provides 37+ self-registering actions organized by entity type:

Protein Actions

  • PROTEIN_NORMALIZE_ACCESSIONS - Standardize UniProt identifiers

  • PROTEIN_EXTRACT_UNIPROT_FROM_XREFS - Extract UniProt IDs from compound fields

Metabolite Actions

  • NIGHTINGALE_NMR_MATCH - Nightingale platform matching

  • SEMANTIC_METABOLITE_MATCH - AI-powered matching

Chemistry Actions

  • CHEMISTRY_FUZZY_TEST_MATCH - Fuzzy clinical test matching

Data Processing Actions

  • LOAD_DATASET_IDENTIFIERS - Load identifiers from files

  • MERGE_DATASETS - Merge datasets on common columns

  • EXPORT_DATASET - Export results to files

  • FILTER_DATASET - Apply filtering criteria

  • CUSTOM_TRANSFORM - Apply custom transformations

  • PARSE_COMPOSITE_IDENTIFIERS - Parse compound identifier fields

Reporting Actions

  • GENERATE_MAPPING_VISUALIZATIONS - Create mapping visualizations

  • GENERATE_LLM_ANALYSIS - Generate AI-powered analysis reports

I/O Actions

  • SYNC_TO_GOOGLE_DRIVE_V2 - Sync results to Google Drive


Verification Sources

Last verified: 2025-01-18

This documentation was verified against the following project resources:

  • /home/ubuntu/biomapper/src/configs/strategies/ (YAML strategy organization by entity type)

  • /home/ubuntu/biomapper/src/core/minimal_strategy_service.py (Parameter substitution logic and context management)

  • /home/ubuntu/biomapper/src/actions/load_dataset_identifiers.py (LOAD_DATASET_IDENTIFIERS action parameters)

  • /home/ubuntu/biomapper/src/actions/merge_datasets.py (MERGE_DATASETS action parameters)

  • /home/ubuntu/biomapper/src/actions/export_dataset.py (EXPORT_DATASET action parameters)

  • /home/ubuntu/biomapper/src/actions/entities/proteins/annotation/normalize_accessions.py (PROTEIN_NORMALIZE_ACCESSIONS action)

  • /home/ubuntu/biomapper/src/client/client_v2.py (BiomapperClient.run() method and parameter passing)

  • /home/ubuntu/biomapper/src/actions/ (Action registry and available actions)

See Also