Configuration Guide
===================

Biomapper uses YAML strategy files to define mapping workflows. Strategies can include metadata for tracking, runtime parameters with environment variable substitution, and a sequence of self-registering actions. This guide covers strategy configuration, action parameters, and best practices.

Strategy File Structure
-----------------------

Every strategy file follows this structure:

.. code-block:: yaml

    # Optional metadata for tracking and organization
    metadata:
      id: "strategy_unique_id"
      name: "Human Readable Name"
      version: "1.0.0"
      entity_type: "proteins"  # or metabolites, chemistry
      quality_tier: "experimental"  # or production, deprecated
    
    # Optional runtime parameters with defaults
    parameters:
      output_dir: "${OUTPUT_DIR:-/tmp/outputs}"
      threshold: 0.85
      batch_size: 1000
    
    # Required: strategy execution steps
    name: "STRATEGY_NAME" 
    description: "What this strategy does"
    
    steps:
      - name: step1
        action:
          type: ACTION_TYPE
          params:
            parameter1: "${parameters.threshold}"  # Use parameters
            parameter2: "/data/input.csv"
      
      - name: step2  
        action:
          type: ACTION_TYPE
          params:
            input_key: step1_output  # Reference previous outputs
            output_key: final_result

Required Fields
~~~~~~~~~~~~~~~

**name**
  Unique identifier for the strategy. Use UPPERCASE_WITH_UNDERSCORES.

**description** 
  Human-readable description of what the strategy accomplishes.

**steps**
  List of actions to execute in order.

Each step requires:

**name**
  Step identifier within the strategy.

**action.type**
  One of the 30+ registered action types (see Action Types section).

**action.params**
  Parameters specific to that action type.

Action Types
------------

Biomapper includes 30+ self-registering actions organized by category:

**Data Operations**

* ``LOAD_DATASET_IDENTIFIERS`` - Load identifiers from CSV/TSV files
* ``MERGE_DATASETS`` - Combine multiple datasets
* ``FILTER_DATASET`` - Apply filtering criteria
* ``EXPORT_DATASET`` - Export to various formats
* ``CUSTOM_TRANSFORM`` - Apply Python expressions

**Protein Actions**

* ``MERGE_WITH_UNIPROT_RESOLUTION`` - Historical UniProt ID resolution
* ``PROTEIN_EXTRACT_UNIPROT_FROM_XREFS`` - Extract IDs from compound fields
* ``PROTEIN_NORMALIZE_ACCESSIONS`` - Standardize protein identifiers
* ``PROTEIN_MULTI_BRIDGE`` - Cross-dataset resolution

**Metabolite Actions**

* ``CTS_ENRICHED_MATCH`` - Chemical Translation Service matching
* ``SEMANTIC_METABOLITE_MATCH`` - AI-powered semantic matching
* ``VECTOR_ENHANCED_MATCH`` - Vector similarity matching
* ``NIGHTINGALE_NMR_MATCH`` - Nightingale reference matching
* ``COMBINE_METABOLITE_MATCHES`` - Merge multiple approaches

**Chemistry Actions**

* ``CHEMISTRY_EXTRACT_LOINC`` - Extract LOINC codes
* ``CHEMISTRY_FUZZY_TEST_MATCH`` - Fuzzy test name matching
* ``CHEMISTRY_VENDOR_HARMONIZATION`` - Harmonize vendor data

**Analysis Actions**

* ``CALCULATE_SET_OVERLAP`` - Jaccard similarity analysis
* ``CALCULATE_THREE_WAY_OVERLAP`` - Three-dataset comparison
* ``CALCULATE_MAPPING_QUALITY`` - Quality metrics
* ``GENERATE_METABOLOMICS_REPORT`` - Comprehensive reports

Common Action Parameters
~~~~~~~~~~~~~~~~~~~~~~~~

**LOAD_DATASET_IDENTIFIERS**

Loads identifiers from CSV/TSV files.

Required Parameters:
* ``file_path``: Path to data file (supports environment variables)
* ``identifier_column``: Column name containing identifiers  
* ``output_key``: Key to store results in context

Optional Parameters:
* ``dataset_name``: Human-readable name for logging
* ``filter_empty``: Remove empty identifiers (default: true)
* ``additional_columns``: List of extra columns to preserve

.. code-block:: yaml

    - name: load_proteins
      action:
        type: LOAD_DATASET_IDENTIFIERS
        params:
          file_path: "${DATA_DIR:-/data}/proteins.csv"  # Environment variable
          identifier_column: "uniprot_id"
          output_key: "protein_list"
          dataset_name: "My Protein Dataset"
          additional_columns: ["gene_name", "description"]

MERGE_WITH_UNIPROT_RESOLUTION
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Merges two datasets with historical UniProt identifier resolution.

Required Parameters:
* ``source_dataset_key``: Context key of source dataset
* ``target_dataset_key``: Context key of target dataset  
* ``source_id_column``: Column name in source data
* ``target_id_column``: Column name in target data
* ``output_key``: Key to store merged results

.. code-block:: yaml

    - name: merge_data
      action:
        type: MERGE_WITH_UNIPROT_RESOLUTION  
        params:
          source_dataset_key: "dataset_a"
          target_dataset_key: "dataset_b"
          source_id_column: "UniProt"
          target_id_column: "uniprot"
          output_key: "merged_dataset"

**CALCULATE_SET_OVERLAP**

Calculates Jaccard similarity and generates Venn diagrams.

Required Parameters:
* ``dataset_a_key``: Context key of first dataset
* ``dataset_b_key``: Context key of second dataset  
* ``output_key``: Key to store overlap results

Optional Parameters:
* ``generate_venn``: Create Venn diagram (default: true)
* ``output_path``: Path for diagram file

.. code-block:: yaml

    - name: find_overlap
      action:
        type: CALCULATE_SET_OVERLAP
        params:
          dataset_a_key: "proteins_a"
          dataset_b_key: "proteins_b" 
          output_key: "overlap_stats"
          generate_venn: true
          output_path: "${parameters.output_dir}/venn_diagram.png"

Example Configurations
----------------------

Basic Protein Mapping
~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: yaml

    name: "BASIC_PROTEIN_MAPPING"
    description: "Load and analyze protein overlap"
    
    steps:
      - name: load_source
        action:
          type: LOAD_DATASET_IDENTIFIERS
          params:
            file_path: "/data/source_proteins.csv"
            identifier_column: "protein_id"
            output_key: "source_proteins"
      
      - name: load_target
        action:
          type: LOAD_DATASET_IDENTIFIERS
          params:
            file_path: "/data/target_proteins.csv"  
            identifier_column: "uniprot_ac"
            output_key: "target_proteins"
      
      - name: calculate_overlap
        action:
          type: CALCULATE_SET_OVERLAP
          params:
            dataset_a_key: "source_proteins"
            dataset_b_key: "target_proteins"
            output_key: "analysis_results"

Multi-Dataset Comparison
~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: yaml

    name: "MULTI_DATASET_COMPARISON"
    description: "Compare multiple protein datasets with UniProt resolution"
    
    steps:
      - name: load_arivale
        action:
          type: LOAD_DATASET_IDENTIFIERS
          params:
            file_path: "/data/arivale/proteomics_metadata.tsv"
            identifier_column: "uniprot"
            output_key: "arivale_proteins"
            dataset_name: "Arivale Proteomics"
      
      - name: load_hpa
        action:
          type: LOAD_DATASET_IDENTIFIERS  
          params:
            file_path: "/data/hpa_osps.csv"
            identifier_column: "uniprot"
            output_key: "hpa_proteins"
            dataset_name: "Human Protein Atlas"
      
      - name: merge_arivale_hpa
        action:
          type: MERGE_WITH_UNIPROT_RESOLUTION
          params:
            source_dataset_key: "arivale_proteins"
            target_dataset_key: "hpa_proteins"
            source_id_column: "uniprot" 
            target_id_column: "uniprot"
            output_key: "arivale_hpa_merged"
      
      - name: analyze_overlap
        action:
          type: CALCULATE_SET_OVERLAP
          params:
            dataset_a_key: "arivale_hpa_merged"
            dataset_b_key: "hpa_proteins"
            output_key: "final_analysis"

Strategy Organization
---------------------

File Naming
~~~~~~~~~~~

Use descriptive names that indicate the datasets and purpose:

* ``ukbb_hpa_mapping.yaml`` - Maps UKBB to HPA
* ``multi_protein_comparison.yaml`` - Compares multiple sources  
* ``arivale_qin_overlap.yaml`` - Analyzes Arivale vs QIN overlap

Directory Structure
~~~~~~~~~~~~~~~~~~~

Organize strategies in the ``configs/strategies/`` directory:

.. code-block:: text

    configs/strategies/
    ├── templates/                 # Reusable templates
    │   ├── protein_mapping_template.yaml
    │   ├── metabolite_mapping_template.yaml
    │   └── chemistry_mapping_template.yaml
    ├── experimental/              # In development
    │   ├── prot_arv_to_kg2c_uniprot_v2.yaml
    │   └── met_multi_to_unified_semantic.yaml
    └── production/               # Validated strategies
        └── (strategies promoted from experimental)

Data Requirements
-----------------

File Formats
~~~~~~~~~~~~

Strategies work with CSV and TSV files. Ensure your data files:

* Have headers in the first row
* Use consistent delimiter (comma for CSV, tab for TSV)
* Contain the identifier columns referenced in strategies
* Use UTF-8 encoding

File Paths
~~~~~~~~~~

Use **absolute paths** or **environment variables** in strategy files:

.. code-block:: yaml

    # Good - absolute path
    file_path: "/data/proteins/ukbb_data.csv"
    
    # Better - environment variable with default
    file_path: "${DATA_DIR:-/data}/proteins/ukbb_data.csv"
    
    # Best - use parameters section
    parameters:
      data_dir: "${DATA_DIR:-/data}"
    steps:
      - action:
          params:
            file_path: "${parameters.data_dir}/proteins/ukbb_data.csv"

Column Names  
~~~~~~~~~~~~

Ensure the ``identifier_column`` exactly matches your CSV headers:

.. code-block:: yaml

    # If your CSV header is "UniProt_ID"
    identifier_column: "UniProt_ID"
    
    # Not "uniprot_id" or "UniProt"

Best Practices
--------------

1. **Use descriptive names** for steps and output keys
2. **Test with small datasets** before running on large files  
3. **Keep strategies focused** on specific comparisons
4. **Document with metadata** including version, quality tier, and expected match rates
5. **Use environment variables** for portable file paths
6. **Follow naming conventions**:
   - Strategy IDs: ``entity_source_to_target_bridge_version``
   - Output keys: ``entity_type_stage`` (e.g., ``proteins_normalized``)
7. **Track data lineage** with source_files and target_files metadata
8. **Set quality expectations** with expected_match_rate

Troubleshooting
---------------

Common Configuration Errors
~~~~~~~~~~~~~~~~~~~~~~~~~~~

**YAML syntax errors**
  Validate YAML syntax with an online checker.

**Missing required parameters**  
  Check that all required params are provided for each action.

**File path issues**
  Use absolute paths and verify files exist.

**Column name mismatches**
  Ensure identifier_column matches CSV headers exactly.

**Key conflicts**
  Use unique output_key names within each strategy.

Validation
~~~~~~~~~~

Before deploying strategies:

1. Check YAML syntax is valid
2. Verify all file paths exist and are readable
3. Confirm column names match data files  
4. Test with small sample datasets first
5. Review logs for any warnings or errors

Environment Variables
---------------------

Strategies support variable substitution:

* ``${VAR}`` or ``${env.VAR}`` - Environment variable
* ``${VAR:-default}`` - With default value
* ``${parameters.key}`` - Reference parameters section
* ``${metadata.field}`` - Reference metadata fields

Common environment variables:

* ``DATA_DIR`` - Base data directory
* ``OUTPUT_DIR`` - Output directory
* ``BIOMAPPER_CONFIG`` - Configuration path

Next Steps
----------

* See :doc:`usage` for executing strategies
* Check :doc:`actions/index` for complete action reference
* Review templates in ``configs/strategies/templates/``
* Learn about the :doc:`api/rest_endpoints` for programmatic execution

---

Verification Sources
--------------------
*Last verified: 2025-08-17*

This documentation was verified against the following project resources:

- ``/biomapper/CLAUDE.md`` (Best practices and conventions)
- ``/biomapper/README.md`` (Configuration overview)
- ``/biomapper/pyproject.toml`` (Project configuration)