PROTEIN_EXTRACT_UNIPROT_FROM_XREFS

Extract UniProt accession IDs from compound xrefs fields in protein datasets.

Purpose

This action extracts UniProt accession IDs from xrefs fields commonly found in KG2c and SPOKE protein datasets. It provides:

Pattern-based extraction using regex matching
Multiple output format options
Isoform handling (keep or strip -1, -2 suffixes)
Validation of extracted UniProt IDs
Row expansion for multiple matches
Comprehensive statistics and metadata

Parameters

Required Parameters

input_key (string): Key of the dataset in context to process.
xrefs_column (string): Name of the column containing xrefs data with UniProt references.

Optional Parameters

output_column (string): Name of the output column for extracted UniProt IDs. Default: “uniprot_id”
output_key (string): Optional output dataset key. If not provided, modifies dataset in-place. Default: None
handle_multiple (string): How to handle multiple UniProt IDs: ‘list’, ‘first’, or ‘expand_rows’. Default: ‘list’
keep_isoforms (boolean): Whether to keep isoform suffixes (e.g., P12345-1, P12345-2). Default: false
drop_na (boolean): Whether to drop rows with no UniProt IDs found. Default: true

UniProt Extraction Pattern

The action uses the regex pattern: UniProtKB:([A-Z0-9]+(?:-\d+)?)

This pattern matches: * Standard UniProt format: UniProtKB:P12345 * Isoform variants: UniProtKB:P12345-1 * Newer formats: UniProtKB:A0A123B4C5

Handle Multiple Options

list (default): Keep all extracted UniProt IDs as a list in the output column.
first: Take only the first UniProt ID found and store as a single value.
expand_rows: Create separate rows for each UniProt ID found.

Example Usage

Basic UniProt Extraction

- name: extract_uniprot_ids
  action:
    type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
    params:
      input_key: "kg2c_proteins"
      xrefs_column: "all_node_curie"
      output_column: "uniprot_id"
      handle_multiple: "list"
      keep_isoforms: false

First Match Only

- name: extract_primary_uniprot
  action:
    type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
    params:
      input_key: "spoke_proteins"
      xrefs_column: "xrefs"
      output_column: "primary_uniprot"
      handle_multiple: "first"
      drop_na: true

Expand Rows for Each UniProt ID

- name: expand_uniprot_matches
  action:
    type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
    params:
      input_key: "protein_data"
      xrefs_column: "external_refs"
      output_column: "uniprot_id"
      handle_multiple: "expand_rows"
      keep_isoforms: true

Keep Isoform Information

- name: extract_with_isoforms
  action:
    type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
    params:
      input_key: "detailed_proteins"
      xrefs_column: "cross_references"
      output_column: "uniprot_accession"
      handle_multiple: "list"
      keep_isoforms: true
      drop_na: false

Input Data Format

Typical xrefs format: .. code-block:

# Example xrefs content
"NCBIGene:1234|UniProtKB:P12345|HGNC:5678|UniProtKB:P12345-1"

# Multiple references separated by pipes
"ENSEMBL:ENSG123|UniProtKB:Q67890|RefSeq:NP_001234"

# Complex format with various databases
"MONDO:0001234|HP:5678901|UniProtKB:O11111|UniProtKB:O11111-2|KEGG:hsa:999"

Expected input dataset structure: .. code-block:: python

[

{
“gene_name”: “EXAMPLE1”, “all_node_curie”: “NCBIGene:1234|UniProtKB:P12345|HGNC:5678”, “description”: “Example protein 1”

}, {

“gene_name”: “EXAMPLE2”, “all_node_curie”: “UniProtKB:Q67890|UniProtKB:Q67890-1”, “description”: “Example protein 2”

}

]

Output Formats

List Output (handle_multiple=’list’) .. code-block:: python

[

{
“gene_name”: “EXAMPLE1”, “all_node_curie”: “NCBIGene:1234|UniProtKB:P12345|HGNC:5678”, “uniprot_id”: [“P12345”], “description”: “Example protein 1”

}, {

“gene_name”: “EXAMPLE2”, “all_node_curie”: “UniProtKB:Q67890|UniProtKB:Q67890-1”, “uniprot_id”: [“Q67890”], # Isoforms stripped if keep_isoforms=false “description”: “Example protein 2”

}

]

First Match Output (handle_multiple=’first’) .. code-block:: python

[

{
“gene_name”: “EXAMPLE1”, “all_node_curie”: “NCBIGene:1234|UniProtKB:P12345|HGNC:5678”, “uniprot_id”: “P12345”, “description”: “Example protein 1”

}

]

Expanded Rows Output (handle_multiple=’expand_rows’) .. code-block:: python

[

{
“gene_name”: “EXAMPLE1”, “all_node_curie”: “NCBIGene:1234|UniProtKB:P12345|HGNC:5678”, “uniprot_id”: “P12345”, “description”: “Example protein 1”

}, {

“gene_name”: “EXAMPLE2”, “all_node_curie”: “UniProtKB:Q67890|UniProtKB:Q67890-1”, “uniprot_id”: “Q67890”, “description”: “Example protein 2”

}, {

“gene_name”: “EXAMPLE2”, “all_node_curie”: “UniProtKB:Q67890|UniProtKB:Q67890-1”, “uniprot_id”: “Q67890”, # If keep_isoforms=false, duplicates removed “description”: “Example protein 2”

}

]

Statistics and Metadata

The action provides detailed statistics in the context:

{
    "statistics": {
        "uniprot_extraction": {
            "total_rows_processed": 1000,
            "rows_with_uniprot_ids": 847,
            "extraction_rate": 0.847
        }
    }
}

UniProt ID Validation

Valid Format Patterns: * Standard: 6-10 alphanumeric characters (e.g., P12345, Q9Y6K1) * Newer format: Up to 10 characters (e.g., A0A123B4C5) * Isoforms: Base ID + dash + number (e.g., P12345-1)

Invalid IDs are filtered out: * Too short: < 6 characters * Too long: > 10 characters (excluding isoform suffix) * Invalid characters: Only A-Z and 0-9 allowed * Malformed isoforms: Invalid suffix patterns

Error Handling

Column not found

Error: Column 'missing_xrefs' not found in dataset

Solution: Verify the xrefs_column name matches exactly.

Dataset not found

Error: Dataset key 'missing_data' not found in context

Solution: Ensure dataset exists in context from previous actions.

No UniProt IDs found

Warning: No valid UniProt IDs extracted from dataset

Solution: Check xrefs format and UniProt reference patterns.

Best Practices

Inspect xrefs format before extraction to understand data structure
Choose appropriate handling for multiple IDs based on downstream needs
Consider isoform requirements - biological significance vs. analysis complexity
Validate extraction results by checking statistics and sample outputs
Use expand_rows carefully - can significantly increase dataset size
Filter empty results appropriately with drop_na parameter

Performance Notes

Regex extraction is efficient for datasets up to 100K+ rows
Row expansion can significantly increase memory usage
Validation adds minimal overhead
Processing time scales linearly with dataset size and xrefs complexity

Common Use Cases

Knowledge Graph Integration: Extract UniProt IDs from KG2c or SPOKE protein nodes for mapping
Data Standardization: Convert complex xrefs to standardized UniProt identifiers
Multi-Database Reconciliation: Extract UniProt IDs as primary keys for cross-database mapping
Protein Network Analysis: Prepare protein datasets with clean UniProt identifiers

Integration

This action typically follows data loading and precedes mapping operations:

steps:
  # 1. Load protein data with xrefs
  - name: load_kg2c_proteins
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "/data/kg2c_proteins.csv"
        identifier_column: "node_id"
        output_key: "kg2c_raw"

  # 2. Extract UniProt IDs
  - name: extract_uniprot
    action:
      type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
      params:
        input_key: "kg2c_raw"
        xrefs_column: "all_node_curie"
        output_column: "uniprot_id"
        handle_multiple: "first"
        keep_isoforms: false
        drop_na: true

  # 3. Continue with protein mapping
  - name: map_to_reference
    action:
      type: MERGE_WITH_UNIPROT_RESOLUTION
      params:
        source_dataset_key: "kg2c_raw"
        target_dataset_key: "reference_proteins"
        output_key: "mapped_proteins"

—

## Verification Sources Last verified: 2025-08-22

This documentation was verified against the following project resources:

/biomapper/src/actions/entities/proteins/annotation/extract_uniprot_from_xrefs.py (actual implementation with regex pattern and multiple handling modes)
/biomapper/src/actions/typed_base.py (TypedStrategyAction base class)
/biomapper/src/actions/registry.py (self-registration via @register_action decorator)
/biomapper/CLAUDE.md (2025 standardization requirements for parameter naming)
/biomapper/pyproject.toml (pandas dependency for DataFrame operations)