PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
===================================

Extract UniProt accession IDs from compound xrefs fields in protein datasets.

Purpose
-------

This action extracts UniProt accession IDs from xrefs fields commonly found in KG2c and SPOKE protein datasets. It provides:

* Pattern-based extraction using regex matching
* Multiple output format options
* Isoform handling (keep or strip -1, -2 suffixes)
* Validation of extracted UniProt IDs
* Row expansion for multiple matches
* Comprehensive statistics and metadata

Parameters
----------

Required Parameters
~~~~~~~~~~~~~~~~~~~

**input_key** (string)
  Key of the dataset in context to process.

**xrefs_column** (string)
  Name of the column containing xrefs data with UniProt references.

Optional Parameters
~~~~~~~~~~~~~~~~~~~

**output_column** (string)
  Name of the output column for extracted UniProt IDs.
  Default: "uniprot_id"

**output_key** (string)
  Optional output dataset key. If not provided, modifies dataset in-place.
  Default: None

**handle_multiple** (string)
  How to handle multiple UniProt IDs: 'list', 'first', or 'expand_rows'.
  Default: 'list'

**keep_isoforms** (boolean)
  Whether to keep isoform suffixes (e.g., P12345-1, P12345-2).
  Default: false

**drop_na** (boolean)
  Whether to drop rows with no UniProt IDs found.
  Default: true

UniProt Extraction Pattern
--------------------------

The action uses the regex pattern: ``UniProtKB:([A-Z0-9]+(?:-\d+)?)``

This pattern matches:
* Standard UniProt format: ``UniProtKB:P12345``
* Isoform variants: ``UniProtKB:P12345-1``
* Newer formats: ``UniProtKB:A0A123B4C5``

Handle Multiple Options
-----------------------

**list** (default)
  Keep all extracted UniProt IDs as a list in the output column.

**first** 
  Take only the first UniProt ID found and store as a single value.

**expand_rows**
  Create separate rows for each UniProt ID found.

Example Usage
-------------

Basic UniProt Extraction
~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: yaml

    - name: extract_uniprot_ids
      action:
        type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
        params:
          input_key: "kg2c_proteins"
          xrefs_column: "all_node_curie"
          output_column: "uniprot_id"
          handle_multiple: "list"
          keep_isoforms: false

First Match Only
~~~~~~~~~~~~~~~~

.. code-block:: yaml

    - name: extract_primary_uniprot
      action:
        type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
        params:
          input_key: "spoke_proteins"
          xrefs_column: "xrefs"
          output_column: "primary_uniprot"
          handle_multiple: "first"
          drop_na: true

Expand Rows for Each UniProt ID
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: yaml

    - name: expand_uniprot_matches
      action:
        type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
        params:
          input_key: "protein_data"
          xrefs_column: "external_refs"
          output_column: "uniprot_id"
          handle_multiple: "expand_rows"
          keep_isoforms: true

Keep Isoform Information
~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: yaml

    - name: extract_with_isoforms
      action:
        type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
        params:
          input_key: "detailed_proteins"
          xrefs_column: "cross_references"
          output_column: "uniprot_accession"
          handle_multiple: "list"
          keep_isoforms: true
          drop_na: false

Input Data Format
-----------------

**Typical xrefs format:**
.. code-block::

    # Example xrefs content
    "NCBIGene:1234|UniProtKB:P12345|HGNC:5678|UniProtKB:P12345-1"
    
    # Multiple references separated by pipes
    "ENSEMBL:ENSG123|UniProtKB:Q67890|RefSeq:NP_001234"
    
    # Complex format with various databases
    "MONDO:0001234|HP:5678901|UniProtKB:O11111|UniProtKB:O11111-2|KEGG:hsa:999"

**Expected input dataset structure:**
.. code-block:: python

    [
        {
            "gene_name": "EXAMPLE1",
            "all_node_curie": "NCBIGene:1234|UniProtKB:P12345|HGNC:5678",
            "description": "Example protein 1"
        },
        {
            "gene_name": "EXAMPLE2", 
            "all_node_curie": "UniProtKB:Q67890|UniProtKB:Q67890-1",
            "description": "Example protein 2"
        }
    ]

Output Formats
--------------

**List Output (handle_multiple='list')**
.. code-block:: python

    [
        {
            "gene_name": "EXAMPLE1",
            "all_node_curie": "NCBIGene:1234|UniProtKB:P12345|HGNC:5678",
            "uniprot_id": ["P12345"],
            "description": "Example protein 1"
        },
        {
            "gene_name": "EXAMPLE2",
            "all_node_curie": "UniProtKB:Q67890|UniProtKB:Q67890-1", 
            "uniprot_id": ["Q67890"],  # Isoforms stripped if keep_isoforms=false
            "description": "Example protein 2"
        }
    ]

**First Match Output (handle_multiple='first')**
.. code-block:: python

    [
        {
            "gene_name": "EXAMPLE1",
            "all_node_curie": "NCBIGene:1234|UniProtKB:P12345|HGNC:5678",
            "uniprot_id": "P12345",
            "description": "Example protein 1"
        }
    ]

**Expanded Rows Output (handle_multiple='expand_rows')**
.. code-block:: python

    [
        {
            "gene_name": "EXAMPLE1",
            "all_node_curie": "NCBIGene:1234|UniProtKB:P12345|HGNC:5678",
            "uniprot_id": "P12345",
            "description": "Example protein 1"
        },
        {
            "gene_name": "EXAMPLE2",
            "all_node_curie": "UniProtKB:Q67890|UniProtKB:Q67890-1",
            "uniprot_id": "Q67890",
            "description": "Example protein 2"
        },
        {
            "gene_name": "EXAMPLE2",
            "all_node_curie": "UniProtKB:Q67890|UniProtKB:Q67890-1",
            "uniprot_id": "Q67890",  # If keep_isoforms=false, duplicates removed
            "description": "Example protein 2"
        }
    ]

Statistics and Metadata
------------------------

The action provides detailed statistics in the context:

.. code-block:: python

    {
        "statistics": {
            "uniprot_extraction": {
                "total_rows_processed": 1000,
                "rows_with_uniprot_ids": 847,
                "extraction_rate": 0.847
            }
        }
    }

UniProt ID Validation
---------------------

**Valid Format Patterns:**
* Standard: 6-10 alphanumeric characters (e.g., P12345, Q9Y6K1)
* Newer format: Up to 10 characters (e.g., A0A123B4C5)
* Isoforms: Base ID + dash + number (e.g., P12345-1)

**Invalid IDs are filtered out:**
* Too short: < 6 characters
* Too long: > 10 characters (excluding isoform suffix)
* Invalid characters: Only A-Z and 0-9 allowed
* Malformed isoforms: Invalid suffix patterns

Error Handling
--------------

**Column not found**
  .. code-block::
  
      Error: Column 'missing_xrefs' not found in dataset
      
  Solution: Verify the xrefs_column name matches exactly.

**Dataset not found**
  .. code-block::
  
      Error: Dataset key 'missing_data' not found in context
      
  Solution: Ensure dataset exists in context from previous actions.

**No UniProt IDs found**
  .. code-block::
  
      Warning: No valid UniProt IDs extracted from dataset
      
  Solution: Check xrefs format and UniProt reference patterns.

Best Practices
--------------

1. **Inspect xrefs format** before extraction to understand data structure
2. **Choose appropriate handling** for multiple IDs based on downstream needs
3. **Consider isoform requirements** - biological significance vs. analysis complexity
4. **Validate extraction results** by checking statistics and sample outputs
5. **Use expand_rows carefully** - can significantly increase dataset size
6. **Filter empty results** appropriately with drop_na parameter

Performance Notes
-----------------

* Regex extraction is efficient for datasets up to 100K+ rows
* Row expansion can significantly increase memory usage
* Validation adds minimal overhead
* Processing time scales linearly with dataset size and xrefs complexity

Common Use Cases
----------------

**Knowledge Graph Integration**
  Extract UniProt IDs from KG2c or SPOKE protein nodes for mapping

**Data Standardization**
  Convert complex xrefs to standardized UniProt identifiers

**Multi-Database Reconciliation**
  Extract UniProt IDs as primary keys for cross-database mapping

**Protein Network Analysis**
  Prepare protein datasets with clean UniProt identifiers

Integration
-----------

This action typically follows data loading and precedes mapping operations:

.. code-block:: yaml

    steps:
      # 1. Load protein data with xrefs
      - name: load_kg2c_proteins
        action:
          type: LOAD_DATASET_IDENTIFIERS
          params:
            file_path: "/data/kg2c_proteins.csv"
            identifier_column: "node_id"
            output_key: "kg2c_raw"
      
      # 2. Extract UniProt IDs
      - name: extract_uniprot
        action:
          type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
          params:
            input_key: "kg2c_raw"
            xrefs_column: "all_node_curie"
            output_column: "uniprot_id"
            handle_multiple: "first"
            keep_isoforms: false
            drop_na: true
      
      # 3. Continue with protein mapping
      - name: map_to_reference
        action:
          type: MERGE_WITH_UNIPROT_RESOLUTION
          params:
            source_dataset_key: "kg2c_raw"
            target_dataset_key: "reference_proteins"
            output_key: "mapped_proteins"

---

## Verification Sources
*Last verified: 2025-08-22*

This documentation was verified against the following project resources:

- `/biomapper/src/actions/entities/proteins/annotation/extract_uniprot_from_xrefs.py` (actual implementation with regex pattern and multiple handling modes)
- `/biomapper/src/actions/typed_base.py` (TypedStrategyAction base class)
- `/biomapper/src/actions/registry.py` (self-registration via @register_action decorator)
- `/biomapper/CLAUDE.md` (2025 standardization requirements for parameter naming)
- `/biomapper/pyproject.toml` (pandas dependency for DataFrame operations)