PROTEIN_EXTRACT_UNIPROT_FROM_XREFS

Extract UniProt accession IDs from compound xrefs fields in protein datasets.

Purpose

This action extracts UniProt accession IDs from xrefs fields commonly found in KG2c and SPOKE protein datasets. It provides:

  • Pattern-based extraction using regex matching

  • Multiple output format options

  • Isoform handling (keep or strip -1, -2 suffixes)

  • Validation of extracted UniProt IDs

  • Row expansion for multiple matches

  • Comprehensive statistics and metadata

Parameters

Required Parameters

input_key (string)

Key of the dataset in context to process.

xrefs_column (string)

Name of the column containing xrefs data with UniProt references.

Optional Parameters

output_column (string)

Name of the output column for extracted UniProt IDs. Default: “uniprot_id”

output_key (string)

Optional output dataset key. If not provided, modifies dataset in-place. Default: None

handle_multiple (string)

How to handle multiple UniProt IDs: ‘list’, ‘first’, or ‘expand_rows’. Default: ‘list’

keep_isoforms (boolean)

Whether to keep isoform suffixes (e.g., P12345-1, P12345-2). Default: false

drop_na (boolean)

Whether to drop rows with no UniProt IDs found. Default: true

UniProt Extraction Pattern

The action uses the regex pattern: UniProtKB:([A-Z0-9]+(?:-\d+)?)

This pattern matches: * Standard UniProt format: UniProtKB:P12345 * Isoform variants: UniProtKB:P12345-1 * Newer formats: UniProtKB:A0A123B4C5

Handle Multiple Options

list (default)

Keep all extracted UniProt IDs as a list in the output column.

first

Take only the first UniProt ID found and store as a single value.

expand_rows

Create separate rows for each UniProt ID found.

Example Usage

Basic UniProt Extraction

- name: extract_uniprot_ids
  action:
    type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
    params:
      input_key: "kg2c_proteins"
      xrefs_column: "all_node_curie"
      output_column: "uniprot_id"
      handle_multiple: "list"
      keep_isoforms: false

First Match Only

- name: extract_primary_uniprot
  action:
    type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
    params:
      input_key: "spoke_proteins"
      xrefs_column: "xrefs"
      output_column: "primary_uniprot"
      handle_multiple: "first"
      drop_na: true

Expand Rows for Each UniProt ID

- name: expand_uniprot_matches
  action:
    type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
    params:
      input_key: "protein_data"
      xrefs_column: "external_refs"
      output_column: "uniprot_id"
      handle_multiple: "expand_rows"
      keep_isoforms: true

Keep Isoform Information

- name: extract_with_isoforms
  action:
    type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
    params:
      input_key: "detailed_proteins"
      xrefs_column: "cross_references"
      output_column: "uniprot_accession"
      handle_multiple: "list"
      keep_isoforms: true
      drop_na: false

Input Data Format

Typical xrefs format: .. code-block:

# Example xrefs content
"NCBIGene:1234|UniProtKB:P12345|HGNC:5678|UniProtKB:P12345-1"

# Multiple references separated by pipes
"ENSEMBL:ENSG123|UniProtKB:Q67890|RefSeq:NP_001234"

# Complex format with various databases
"MONDO:0001234|HP:5678901|UniProtKB:O11111|UniProtKB:O11111-2|KEGG:hsa:999"

Expected input dataset structure: .. code-block:: python

[
{

“gene_name”: “EXAMPLE1”, “all_node_curie”: “NCBIGene:1234|UniProtKB:P12345|HGNC:5678”, “description”: “Example protein 1”

}, {

“gene_name”: “EXAMPLE2”, “all_node_curie”: “UniProtKB:Q67890|UniProtKB:Q67890-1”, “description”: “Example protein 2”

}

]

Output Formats

List Output (handle_multiple=’list’) .. code-block:: python

[
{

“gene_name”: “EXAMPLE1”, “all_node_curie”: “NCBIGene:1234|UniProtKB:P12345|HGNC:5678”, “uniprot_id”: [“P12345”], “description”: “Example protein 1”

}, {

“gene_name”: “EXAMPLE2”, “all_node_curie”: “UniProtKB:Q67890|UniProtKB:Q67890-1”, “uniprot_id”: [“Q67890”], # Isoforms stripped if keep_isoforms=false “description”: “Example protein 2”

}

]

First Match Output (handle_multiple=’first’) .. code-block:: python

[
{

“gene_name”: “EXAMPLE1”, “all_node_curie”: “NCBIGene:1234|UniProtKB:P12345|HGNC:5678”, “uniprot_id”: “P12345”, “description”: “Example protein 1”

}

]

Expanded Rows Output (handle_multiple=’expand_rows’) .. code-block:: python

[
{

“gene_name”: “EXAMPLE1”, “all_node_curie”: “NCBIGene:1234|UniProtKB:P12345|HGNC:5678”, “uniprot_id”: “P12345”, “description”: “Example protein 1”

}, {

“gene_name”: “EXAMPLE2”, “all_node_curie”: “UniProtKB:Q67890|UniProtKB:Q67890-1”, “uniprot_id”: “Q67890”, “description”: “Example protein 2”

}, {

“gene_name”: “EXAMPLE2”, “all_node_curie”: “UniProtKB:Q67890|UniProtKB:Q67890-1”, “uniprot_id”: “Q67890”, # If keep_isoforms=false, duplicates removed “description”: “Example protein 2”

}

]

Statistics and Metadata

The action provides detailed statistics in the context:

{
    "statistics": {
        "uniprot_extraction": {
            "total_rows_processed": 1000,
            "rows_with_uniprot_ids": 847,
            "extraction_rate": 0.847
        }
    }
}

UniProt ID Validation

Valid Format Patterns: * Standard: 6-10 alphanumeric characters (e.g., P12345, Q9Y6K1) * Newer format: Up to 10 characters (e.g., A0A123B4C5) * Isoforms: Base ID + dash + number (e.g., P12345-1)

Invalid IDs are filtered out: * Too short: < 6 characters * Too long: > 10 characters (excluding isoform suffix) * Invalid characters: Only A-Z and 0-9 allowed * Malformed isoforms: Invalid suffix patterns

Error Handling

Column not found
Error: Column 'missing_xrefs' not found in dataset

Solution: Verify the xrefs_column name matches exactly.

Dataset not found
Error: Dataset key 'missing_data' not found in context

Solution: Ensure dataset exists in context from previous actions.

No UniProt IDs found
Warning: No valid UniProt IDs extracted from dataset

Solution: Check xrefs format and UniProt reference patterns.

Best Practices

  1. Inspect xrefs format before extraction to understand data structure

  2. Choose appropriate handling for multiple IDs based on downstream needs

  3. Consider isoform requirements - biological significance vs. analysis complexity

  4. Validate extraction results by checking statistics and sample outputs

  5. Use expand_rows carefully - can significantly increase dataset size

  6. Filter empty results appropriately with drop_na parameter

Performance Notes

  • Regex extraction is efficient for datasets up to 100K+ rows

  • Row expansion can significantly increase memory usage

  • Validation adds minimal overhead

  • Processing time scales linearly with dataset size and xrefs complexity

Common Use Cases

Knowledge Graph Integration

Extract UniProt IDs from KG2c or SPOKE protein nodes for mapping

Data Standardization

Convert complex xrefs to standardized UniProt identifiers

Multi-Database Reconciliation

Extract UniProt IDs as primary keys for cross-database mapping

Protein Network Analysis

Prepare protein datasets with clean UniProt identifiers

Integration

This action typically follows data loading and precedes mapping operations:

steps:
  # 1. Load protein data with xrefs
  - name: load_kg2c_proteins
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "/data/kg2c_proteins.csv"
        identifier_column: "node_id"
        output_key: "kg2c_raw"

  # 2. Extract UniProt IDs
  - name: extract_uniprot
    action:
      type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
      params:
        input_key: "kg2c_raw"
        xrefs_column: "all_node_curie"
        output_column: "uniprot_id"
        handle_multiple: "first"
        keep_isoforms: false
        drop_na: true

  # 3. Continue with protein mapping
  - name: map_to_reference
    action:
      type: MERGE_WITH_UNIPROT_RESOLUTION
      params:
        source_dataset_key: "kg2c_raw"
        target_dataset_key: "reference_proteins"
        output_key: "mapped_proteins"

## Verification Sources Last verified: 2025-08-22

This documentation was verified against the following project resources:

  • /biomapper/src/actions/entities/proteins/annotation/extract_uniprot_from_xrefs.py (actual implementation with regex pattern and multiple handling modes)

  • /biomapper/src/actions/typed_base.py (TypedStrategyAction base class)

  • /biomapper/src/actions/registry.py (self-registration via @register_action decorator)

  • /biomapper/CLAUDE.md (2025 standardization requirements for parameter naming)

  • /biomapper/pyproject.toml (pandas dependency for DataFrame operations)