# YAML Strategy Schema Documentation

## Overview

This document provides the complete schema reference for defining mapping strategies in YAML configuration files. The YAML strategy system allows users to create flexible, multi-step mapping workflows using the 37+ self-registering actions available in BioMapper.

## Schema Structure

### Top-Level Configuration

```yaml
name: "STRATEGY_NAME"
description: "Brief description of what this strategy does"

metadata:
  id: "unique_strategy_identifier"
  entity_type: "proteins|metabolites|chemistry"
  quality_tier: "experimental|production|test"
  version: "1.0.0"
  author: "author@institution.edu"
  tags: ["tag1", "tag2"]

parameters:
  param_name: "${ENV_VAR:-default_value}"
  # User-configurable parameters

steps:
  - name: "step_name"
    action:
      type: "ACTION_TYPE"
      params:
        # Parameters specific to the action type
```

### Required Fields

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `name` | string | Yes | Strategy identifier (uppercase with underscores) |
| `description` | string | No | Human-readable strategy description |
| `metadata` | object | No | Strategy metadata including version, author, tags |
| `parameters` | object | No | User-configurable parameters with variable substitution |
| `steps` | array | Yes | List of steps to execute sequentially |

### Step Structure

Each step in the `steps` array has this structure:

```yaml
- name: "descriptive_step_name"
  action:
    type: "ACTION_TYPE"
    params:
      parameter1: value1
      parameter2: value2
```

#### Step Fields

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `name` | string | Yes | Descriptive name for the step |
| `action.type` | string | Yes | One of the 37+ registered action types |
| `action.params` | object | Yes | Parameters specific to the action type (validated by Pydantic) |

## Common Action Types

### Data Loading Actions

#### LOAD_DATASET_IDENTIFIERS

Loads identifiers from CSV/TSV files with flexible column mapping.

**Parameters:**

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `file_path` | string | Yes | - | Absolute path to the data file |
| `identifier_column` | string | Yes | - | Column name containing identifiers |
| `output_key` | string | Yes | - | Key to store results in context |
| `dataset_name` | string | No | - | Human-readable name for logging |
| `strip_prefix` | string | No | - | Prefix to remove from identifiers |
| `filter_column` | string | No | - | Column to apply filtering on |
| `filter_values` | array | No | - | Values/patterns to filter by |
| `filter_mode` | string | No | "include" | "include" or "exclude" |
| `drop_empty_ids` | boolean | No | true | Drop rows with empty identifiers |

**Example:**

```yaml
- name: load_ukbb_proteins
  action:
    type: LOAD_DATASET_IDENTIFIERS
    params:
      file_path: "/data/ukbb_proteins.tsv"
      identifier_column: "UniProt"
      output_key: "ukbb_proteins"
      dataset_name: "UK Biobank Proteins"
      drop_empty_ids: true
```

### Protein Actions

#### PROTEIN_NORMALIZE_ACCESSIONS

Normalizes and validates UniProt accessions.

**Parameters:**

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `input_key` | string | Yes | - | Context key of input dataset |
| `output_key` | string | Yes | - | Key to store normalized results |
| `remove_isoforms` | boolean | No | true | Remove isoform suffixes (-1, -2, etc.) |
| `validate_format` | boolean | No | true | Validate UniProt accession format |

#### MERGE_DATASETS

Merges two datasets on specified columns.

**Parameters:**

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset1_key` | string | Yes | - | Context key of first dataset |
| `dataset2_key` | string | Yes | - | Context key of second dataset |
| `merge_column1` | string | Yes | - | Column name in first dataset |
| `merge_column2` | string | Yes | - | Column name in second dataset |
| `output_key` | string | Yes | - | Key to store merged results |

**Example:**

```yaml
- name: merge_datasets
  action:
    type: MERGE_DATASETS
    params:
      dataset1_key: "ukbb_proteins"
      dataset2_key: "hpa_proteins"
      merge_column1: "UniProt"
      merge_column2: "uniprot"
      output_key: "merged_dataset"
```

### Analysis Actions

#### CALCULATE_SET_OVERLAP

Calculates overlap statistics between two datasets and generates Venn diagrams.

**Parameters:**

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `merged_dataset_key` | string | Yes | - | Context key of merged dataset |
| `source_name` | string | Yes | - | Display name for source dataset |
| `target_name` | string | Yes | - | Display name for target dataset |
| `output_key` | string | Yes | - | Key to store overlap results |
| `mapping_combo_id` | string | No | - | Unique identifier for this mapping |
| `confidence_threshold` | number | No | 0.0 | Minimum confidence for high-quality matches |
| `output_directory` | string | No | "data/results" | Directory for output files |

**Example:**

```yaml
- name: calculate_overlap
  action:
    type: CALCULATE_SET_OVERLAP
    params:
      merged_dataset_key: "merged_dataset"
      source_name: "UKBB"
      target_name: "HPA"
      output_key: "overlap_statistics"
      mapping_combo_id: "UKBB_HPA_ANALYSIS"
      confidence_threshold: 0.7
      output_directory: "data/results/UKBB_HPA"
```

## Complete Example

Here's a complete strategy that loads two protein datasets, normalizes them, merges them, and calculates overlap:

```yaml
name: "UKBB_HPA_PROTEIN_COMPARISON"
description: "Compare protein coverage between UK Biobank and Human Protein Atlas"

metadata:
  id: "ukbb_hpa_protein_comparison_v1"
  entity_type: "proteins"
  quality_tier: "production"
  version: "1.0.0"
  author: "researcher@institution.edu"
  tags: ["ukbb", "hpa", "proteins", "overlap"]

parameters:
  ukbb_file: "${UKBB_FILE:-/data/ukbb_proteins.tsv}"
  hpa_file: "${HPA_FILE:-/data/hpa_proteins.csv}"
  output_dir: "${OUTPUT_DIR:-/tmp/results}"

steps:
  # Step 1: Load UK Biobank protein data
  - name: load_ukbb_data
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "${parameters.ukbb_file}"
        identifier_column: "UniProt"
        output_key: "ukbb_proteins_raw"
        dataset_name: "UK Biobank Proteins"

  # Step 2: Normalize UK Biobank proteins
  - name: normalize_ukbb
    action:
      type: PROTEIN_NORMALIZE_ACCESSIONS
      params:
        input_key: "ukbb_proteins_raw"
        output_key: "ukbb_proteins"
        remove_isoforms: true
        validate_format: true

  # Step 3: Load Human Protein Atlas data  
  - name: load_hpa_data
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "${parameters.hpa_file}"
        identifier_column: "uniprot"
        output_key: "hpa_proteins_raw" 
        dataset_name: "Human Protein Atlas"

  # Step 4: Normalize HPA proteins
  - name: normalize_hpa
    action:
      type: PROTEIN_NORMALIZE_ACCESSIONS
      params:
        input_key: "hpa_proteins_raw"
        output_key: "hpa_proteins"
        remove_isoforms: true
        validate_format: true

  # Step 5: Merge datasets
  - name: merge_protein_data
    action:
      type: MERGE_DATASETS
      params:
        dataset1_key: "ukbb_proteins"
        dataset2_key: "hpa_proteins"
        merge_column1: "identifier"
        merge_column2: "identifier"
        output_key: "merged_proteins"

  # Step 6: Calculate overlap statistics
  - name: analyze_overlap
    action:
      type: CALCULATE_SET_OVERLAP
      params:
        merged_dataset_key: "merged_proteins"
        source_name: "UKBB"
        target_name: "HPA" 
        output_key: "overlap_analysis"
        mapping_combo_id: "UKBB_HPA_COMPARISON"
        confidence_threshold: 0.7
        output_directory: "${parameters.output_dir}/UKBB_HPA"

  # Step 7: Export results
  - name: export_results
    action:
      type: EXPORT_DATASET
      params:
        input_key: "overlap_analysis"
        output_file: "${parameters.output_dir}/overlap_results.csv"
        format: "csv"
```

## Data Flow Between Steps

The context dictionary passes data between steps using the `output_key` from one step as input keys for subsequent steps:

```
Step 1: LOAD_DATASET_IDENTIFIERS → context["datasets"]["ukbb_proteins_raw"]
Step 2: PROTEIN_NORMALIZE_ACCESSIONS → context["datasets"]["ukbb_proteins"]
Step 3: LOAD_DATASET_IDENTIFIERS → context["datasets"]["hpa_proteins_raw"]
Step 4: PROTEIN_NORMALIZE_ACCESSIONS → context["datasets"]["hpa_proteins"]
Step 5: MERGE_DATASETS → context["datasets"]["merged_proteins"]
Step 6: CALCULATE_SET_OVERLAP → context["datasets"]["overlap_analysis"]
Step 7: EXPORT_DATASET → context["output_files"].append("overlap_results.csv")
```

## Variable Substitution

The strategy system supports multiple variable substitution patterns:

- **`${parameters.key}`**: Access strategy parameters
- **`${env.VAR_NAME}`**: Access environment variables explicitly
- **`${VAR_NAME}`**: Shorthand for environment variables
- **`${metadata.field}`**: Access metadata fields
- **`${VAR:-default}`**: Provide default value if variable not set

## File Path Considerations

- **Absolute paths recommended**: Use full paths like `/data/proteins.csv`
- **Relative paths supported**: Relative to the working directory where the strategy is executed
- **Variable substitution**: Use `${parameters.file_path}` for configurable paths
- **Output directories**: Created automatically if they don't exist

## Validation

The YAML strategy is validated at multiple levels:

- **Schema validation**: Ensures all required fields are present
- **Parameter validation**: Uses Pydantic models for type checking and constraints
- **Action validation**: Verifies action type exists in ACTION_REGISTRY
- **Reference validation**: Checks that referenced context keys exist during execution
- **File path validation**: Verifies input files exist at execution time

## Error Handling

When a step fails:
- Execution stops immediately
- Error details are logged
- Previous steps' results are preserved in context
- API returns error information with context state

## Best Practices

### Naming Conventions
- **Strategy names**: UPPERCASE_WITH_UNDERSCORES
- **Step names**: lowercase_with_underscores, descriptive
- **Output keys**: descriptive, reflect data content
- **Dataset names**: Human-readable for logging

### Strategy Design
- **Sequential steps**: Each step builds on previous results
- **Descriptive names**: Make the workflow self-documenting
- **Logical grouping**: Group related operations
- **Error consideration**: Plan for missing files or empty datasets

### File Organization
```
configs/
├── simple_strategies/
│   ├── load_single_dataset.yaml
│   └── basic_comparison.yaml
├── protein_strategies/
│   ├── ukbb_hpa_comparison.yaml
│   └── multi_source_analysis.yaml
└── production_strategies/
    └── comprehensive_protein_mapping.yaml
```

## Performance Considerations

- **File sizes**: Large files (>1M rows) may require increased timeouts
- **API calls**: UniProt resolution adds significant time for unmatched IDs
- **Memory usage**: Large datasets are processed in memory
- **Output files**: Venn diagrams and CSV files are generated for each analysis

## Integration with API

Strategies are executed via the REST API or Python client:

### Using Python Client (Synchronous)

```python
from src.client.client_v2 import BiomapperClient

client = BiomapperClient(base_url="http://localhost:8000")

# Execute with custom parameters
result = client.run(
    strategy_name="UKBB_HPA_PROTEIN_COMPARISON",
    parameters={
        "ukbb_file": "/custom/path/ukbb.tsv",
        "hpa_file": "/custom/path/hpa.csv",
        "output_dir": "/custom/output"
    }
)

print(f"Job ID: {result['job_id']}")
print(f"Status: {result['status']}")
print(f"Results: {result['results']}")
```

### Using REST API Directly

```bash
curl -X POST "http://localhost:8000/api/strategies/v2/" \
  -H "Content-Type: application/json" \
  -d '{
    "strategy_name": "UKBB_HPA_PROTEIN_COMPARISON",
    "parameters": {
      "ukbb_file": "/data/ukbb.tsv",
      "hpa_file": "/data/hpa.csv"
    }
  }'
```

## Available Actions Reference

BioMapper provides 37+ self-registering actions organized by entity type:

### Protein Actions
- `PROTEIN_NORMALIZE_ACCESSIONS` - Standardize UniProt identifiers
- `PROTEIN_EXTRACT_UNIPROT_FROM_XREFS` - Extract UniProt IDs from compound fields

### Metabolite Actions
- `NIGHTINGALE_NMR_MATCH` - Nightingale platform matching
- `SEMANTIC_METABOLITE_MATCH` - AI-powered matching

### Chemistry Actions
- `CHEMISTRY_FUZZY_TEST_MATCH` - Fuzzy clinical test matching

### Data Processing Actions
- `LOAD_DATASET_IDENTIFIERS` - Load identifiers from files
- `MERGE_DATASETS` - Merge datasets on common columns
- `EXPORT_DATASET` - Export results to files
- `FILTER_DATASET` - Apply filtering criteria
- `CUSTOM_TRANSFORM` - Apply custom transformations
- `PARSE_COMPOSITE_IDENTIFIERS` - Parse compound identifier fields

### Reporting Actions
- `GENERATE_MAPPING_VISUALIZATIONS` - Create mapping visualizations
- `GENERATE_LLM_ANALYSIS` - Generate AI-powered analysis reports

### I/O Actions
- `SYNC_TO_GOOGLE_DRIVE_V2` - Sync results to Google Drive

---

## Verification Sources
*Last verified: 2025-01-18*

This documentation was verified against the following project resources:

- `/home/ubuntu/biomapper/src/configs/strategies/` (YAML strategy organization by entity type)
- `/home/ubuntu/biomapper/src/core/minimal_strategy_service.py` (Parameter substitution logic and context management)
- `/home/ubuntu/biomapper/src/actions/load_dataset_identifiers.py` (LOAD_DATASET_IDENTIFIERS action parameters)
- `/home/ubuntu/biomapper/src/actions/merge_datasets.py` (MERGE_DATASETS action parameters)
- `/home/ubuntu/biomapper/src/actions/export_dataset.py` (EXPORT_DATASET action parameters)
- `/home/ubuntu/biomapper/src/actions/entities/proteins/annotation/normalize_accessions.py` (PROTEIN_NORMALIZE_ACCESSIONS action)
- `/home/ubuntu/biomapper/src/client/client_v2.py` (BiomapperClient.run() method and parameter passing)
- `/home/ubuntu/biomapper/src/actions/` (Action registry and available actions)

## See Also

- [Action System Architecture](action_system.rst)
- [API Documentation](../api/)
- [Usage Examples](../usage.rst)