Usage Guide
===========

This guide demonstrates how to use Biomapper's YAML strategy system for biological entity mapping through the REST API.

Installation
------------

Install Biomapper using Poetry:

.. code-block:: bash

    # Clone the repository
    git clone https://github.com/arpanauts/biomapper.git
    cd biomapper
    
    # Install dependencies
    poetry install --with dev,docs,api
    
    # Activate the environment
    poetry shell

Quick Start
-----------

Biomapper uses YAML strategies executed through a REST API. Here's the basic workflow:

1. **Start the API Server**

.. code-block:: bash

    cd biomapper-api
    poetry run uvicorn app.main:app --reload --port 8000

2. **Create or Use a Strategy YAML**

Place strategy in ``configs/strategies/`` or create ``my_strategy.yaml``:

.. code-block:: yaml

    # Optional metadata for tracking
    metadata:
      entity_type: "proteins"
      quality_tier: "experimental"
      expected_match_rate: 0.85
    
    # Optional runtime parameters
    parameters:
      data_dir: "${DATA_DIR:-/data}"
      output_dir: "${OUTPUT_DIR:-/tmp/results}"
    
    # Required strategy definition
    name: "BASIC_PROTEIN_MAPPING"
    description: "Map proteins between datasets"
    
    steps:
      - name: load_data
        action:
          type: LOAD_DATASET_IDENTIFIERS
          params:
            file_path: "${parameters.data_dir}/proteins.csv"
            identifier_column: "uniprot"
            output_key: "proteins"
            additional_columns: ["gene_name", "description"]
      
      - name: normalize
        action:
          type: PROTEIN_NORMALIZE_ACCESSIONS
          params:
            input_key: "proteins"
            output_key: "normalized_proteins"
      
      - name: calculate_quality
        action:
          type: CALCULATE_MAPPING_QUALITY
          params:
            dataset_key: "normalized_proteins"
            output_key: "quality_metrics"

3. **Execute via Python Client**

.. code-block:: python

    from biomapper_client import BiomapperClient
    
    # Simple synchronous usage (recommended)
    client = BiomapperClient("http://localhost:8000")
    
    # Execute strategy by name (if in configs/strategies/)
    result = client.run("BASIC_PROTEIN_MAPPING")
    
    # Or execute with custom YAML file
    result = client.run("/path/to/my_strategy.yaml")
    
    # Check results
    print(f"Status: {result['status']}")
    if result['status'] == 'success':
        stats = result['results'].get('overlap_stats', {})
        print(f"Overlap: {stats.get('jaccard_similarity', 0):.2%}")

4. **Execute via CLI**

.. code-block:: bash

    # Using the biomapper CLI
    poetry run biomapper --help
    poetry run biomapper health
    poetry run biomapper metadata list
    
    # Or use the client directly
    poetry run python -c "from biomapper_client import BiomapperClient; print(BiomapperClient().run('test_metabolite_simple'))"

Core Concepts
-------------

Core Actions
~~~~~~~~~~~~

Biomapper provides 30+ self-registering actions organized by category:

**Data Operations**
  - ``LOAD_DATASET_IDENTIFIERS``: Load identifiers from CSV/TSV files
  - ``MERGE_DATASETS``: Combine multiple datasets
  - ``FILTER_DATASET``: Apply filtering criteria
  - ``EXPORT_DATASET``: Export to various formats
  - ``CUSTOM_TRANSFORM``: Apply Python expressions

**Protein Actions**
  - ``MERGE_WITH_UNIPROT_RESOLUTION``: Historical UniProt ID resolution
  - ``PROTEIN_EXTRACT_UNIPROT_FROM_XREFS``: Extract IDs from compound fields
  - ``PROTEIN_NORMALIZE_ACCESSIONS``: Standardize protein identifiers

**Metabolite Actions**
  - ``CTS_ENRICHED_MATCH``: Chemical Translation Service matching
  - ``SEMANTIC_METABOLITE_MATCH``: AI-powered semantic matching
  - ``NIGHTINGALE_NMR_MATCH``: Nightingale reference matching

**Analysis Actions**
  - ``CALCULATE_SET_OVERLAP``: Jaccard similarity and Venn diagrams
  - ``CALCULATE_THREE_WAY_OVERLAP``: Three-dataset comparison
  - ``GENERATE_METABOLOMICS_REPORT``: Comprehensive reports

Strategy Configuration
~~~~~~~~~~~~~~~~~~~~~~

Strategies are defined in YAML files with these sections:

**Required Fields:**

* ``name``: Strategy identifier (use UPPERCASE_WITH_UNDERSCORES)
* ``description``: Human-readable description  
* ``steps``: Ordered list of actions to execute

**Optional Fields:**

* ``metadata``: Tracking information (version, quality tier, expected match rates)
* ``parameters``: Runtime parameters with environment variable support

Each step contains:

* ``name``: Step identifier
* ``action.type``: One of the registered action types
* ``action.params``: Parameters specific to the action

Data Flow
~~~~~~~~~

1. Data is loaded into a shared context dictionary
2. Each action reads from and writes to this context
3. Actions use ``output_key`` to store results
4. Subsequent actions reference data using these keys
5. Final results include all context data plus execution metadata

Working with Real Data
----------------------

Protein Mapping Example
~~~~~~~~~~~~~~~~~~~~~~~

Here's a complete example mapping UKBB proteins to HPA:

.. code-block:: yaml

    name: "UKBB_HPA_PROTEIN_MAPPING"
    description: "Map UK Biobank proteins to Human Protein Atlas"
    
    steps:
      - name: load_ukbb_data
        action:
          type: LOAD_DATASET_IDENTIFIERS
          params:
            file_path: "/data/UKBB_Protein_Meta.tsv"
            identifier_column: "UniProt"
            output_key: "ukbb_proteins"
      
      - name: load_hpa_data  
        action:
          type: LOAD_DATASET_IDENTIFIERS
          params:
            file_path: "/data/hpa_osps.csv"
            identifier_column: "uniprot"
            output_key: "hpa_proteins"
      
      - name: merge_ukbb_uniprot
        action:
          type: MERGE_WITH_UNIPROT_RESOLUTION
          params:
            source_dataset_key: "ukbb_proteins"
            target_dataset_key: "hpa_proteins" 
            source_id_column: "UniProt"
            target_id_column: "uniprot"
            output_key: "ukbb_merged"
      
      - name: calculate_overlap
        action:
          type: CALCULATE_SET_OVERLAP
          params:
            dataset_a_key: "ukbb_merged"
            dataset_b_key: "hpa_proteins"
            output_key: "overlap_analysis"

Multi-Dataset Analysis
~~~~~~~~~~~~~~~~~~~~~~

Compare multiple datasets by loading each one and calculating pairwise overlaps:

.. code-block:: yaml

    name: "MULTI_DATASET_ANALYSIS"
    description: "Compare proteins across multiple sources"
    
    steps:
      # Load all datasets
      - name: load_arivale
        action:
          type: LOAD_DATASET_IDENTIFIERS
          params:
            file_path: "/data/arivale/proteomics_metadata.tsv"
            identifier_column: "uniprot"
            output_key: "arivale_proteins"
      
      - name: load_qin
        action:
          type: LOAD_DATASET_IDENTIFIERS
          params:
            file_path: "/data/qin_osps.csv"
            identifier_column: "uniprot"
            output_key: "qin_proteins"
            
      # Calculate overlaps
      - name: arivale_vs_qin
        action:
          type: CALCULATE_SET_OVERLAP
          params:
            dataset_a_key: "arivale_proteins"
            dataset_b_key: "qin_proteins"
            output_key: "arivale_qin_overlap"

Error Handling
--------------

Common Issues and Solutions
~~~~~~~~~~~~~~~~~~~~~~~~~~~

**File not found errors**
  Check file paths are absolute and files exist.

**Column not found errors**
  Verify the ``identifier_column`` matches your CSV headers exactly.

**Timeout errors**
  Large datasets may take time. Default timeout is 5 minutes, but can be increased:
  
  .. code-block:: python
  
      client = BiomapperClient(timeout=3600)  # 1 hour

**Validation errors**
  Ensure YAML syntax is correct and all required parameters are provided.

Debugging
~~~~~~~~~

Enable detailed logging:

.. code-block:: python

    import logging
    logging.basicConfig(level=logging.DEBUG)
    
    async with BiomapperClient("http://localhost:8000") as client:
        result = await client.execute_strategy_file("strategy.yaml")

Check API server logs for detailed error messages and execution progress.

Performance Tips
----------------

* Use environment variables for portable file paths
* For large datasets (>100K rows), increase client timeout and consider chunking
* Monitor API server resources during execution
* Use the ``watch=True`` parameter to see real-time progress:
  
  .. code-block:: python
  
      result = client.run("large_strategy", watch=True)

* Consider using ``CHUNK_PROCESSOR`` action for very large files
* Enable job persistence for recovery from failures

Advanced Features
-----------------

**Environment Variables**

Strategies support variable substitution:

.. code-block:: yaml

    parameters:
      data_dir: "${DATA_DIR:-/default/path}"
    steps:
      - action:
          params:
            file_path: "${parameters.data_dir}/file.csv"

**Progress Tracking**

Use Server-Sent Events for real-time progress:

.. code-block:: python

    result = client.run_with_progress("my_strategy")

**Job Recovery**

Jobs are persisted to SQLite for recovery:

.. code-block:: python

    # Check job status
    job = client.get_job(job_id)
    if job.status == "failed":
        # Retry from last checkpoint
        result = client.retry_job(job_id)

Next Steps
----------

* See :doc:`configuration` for advanced YAML strategy options
* Check :doc:`api/index` for complete API reference  
* Review :doc:`actions/index` for all available actions
* Explore templates in ``configs/strategies/templates/``
* Read :doc:`development/creating_actions` to add custom actions

---

Verification Sources
--------------------
*Last verified: 2025-08-17*

This documentation was verified against the following project resources:

- ``/biomapper/CLAUDE.md`` (CLI commands and best practices)
- ``/biomapper/README.md`` (Installation and quick start)
- ``/biomapper/pyproject.toml`` (Project configuration)