PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
Extract UniProt accession IDs from compound xrefs fields in protein datasets.
Purpose
This action extracts UniProt accession IDs from xrefs fields commonly found in KG2c and SPOKE protein datasets. It provides:
Pattern-based extraction using regex matching
Multiple output format options
Isoform handling (keep or strip -1, -2 suffixes)
Validation of extracted UniProt IDs
Row expansion for multiple matches
Comprehensive statistics and metadata
Parameters
Required Parameters
- input_key (string)
Key of the dataset in context to process.
- xrefs_column (string)
Name of the column containing xrefs data with UniProt references.
Optional Parameters
- output_column (string)
Name of the output column for extracted UniProt IDs. Default: “uniprot_id”
- output_key (string)
Optional output dataset key. If not provided, modifies dataset in-place. Default: None
- handle_multiple (string)
How to handle multiple UniProt IDs: ‘list’, ‘first’, or ‘expand_rows’. Default: ‘list’
- keep_isoforms (boolean)
Whether to keep isoform suffixes (e.g., P12345-1, P12345-2). Default: false
- drop_na (boolean)
Whether to drop rows with no UniProt IDs found. Default: true
UniProt Extraction Pattern
The action uses the regex pattern: UniProtKB:([A-Z0-9]+(?:-\d+)?)
This pattern matches:
* Standard UniProt format: UniProtKB:P12345
* Isoform variants: UniProtKB:P12345-1
* Newer formats: UniProtKB:A0A123B4C5
Handle Multiple Options
- list (default)
Keep all extracted UniProt IDs as a list in the output column.
- first
Take only the first UniProt ID found and store as a single value.
- expand_rows
Create separate rows for each UniProt ID found.
Example Usage
Basic UniProt Extraction
- name: extract_uniprot_ids
action:
type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
params:
input_key: "kg2c_proteins"
xrefs_column: "all_node_curie"
output_column: "uniprot_id"
handle_multiple: "list"
keep_isoforms: false
First Match Only
- name: extract_primary_uniprot
action:
type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
params:
input_key: "spoke_proteins"
xrefs_column: "xrefs"
output_column: "primary_uniprot"
handle_multiple: "first"
drop_na: true
Expand Rows for Each UniProt ID
- name: expand_uniprot_matches
action:
type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
params:
input_key: "protein_data"
xrefs_column: "external_refs"
output_column: "uniprot_id"
handle_multiple: "expand_rows"
keep_isoforms: true
Keep Isoform Information
- name: extract_with_isoforms
action:
type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
params:
input_key: "detailed_proteins"
xrefs_column: "cross_references"
output_column: "uniprot_accession"
handle_multiple: "list"
keep_isoforms: true
drop_na: false
Input Data Format
Typical xrefs format: .. code-block:
# Example xrefs content
"NCBIGene:1234|UniProtKB:P12345|HGNC:5678|UniProtKB:P12345-1"
# Multiple references separated by pipes
"ENSEMBL:ENSG123|UniProtKB:Q67890|RefSeq:NP_001234"
# Complex format with various databases
"MONDO:0001234|HP:5678901|UniProtKB:O11111|UniProtKB:O11111-2|KEGG:hsa:999"
Expected input dataset structure: .. code-block:: python
- [
- {
“gene_name”: “EXAMPLE1”, “all_node_curie”: “NCBIGene:1234|UniProtKB:P12345|HGNC:5678”, “description”: “Example protein 1”
}, {
“gene_name”: “EXAMPLE2”, “all_node_curie”: “UniProtKB:Q67890|UniProtKB:Q67890-1”, “description”: “Example protein 2”
}
]
Output Formats
List Output (handle_multiple=’list’) .. code-block:: python
- [
- {
“gene_name”: “EXAMPLE1”, “all_node_curie”: “NCBIGene:1234|UniProtKB:P12345|HGNC:5678”, “uniprot_id”: [“P12345”], “description”: “Example protein 1”
}, {
“gene_name”: “EXAMPLE2”, “all_node_curie”: “UniProtKB:Q67890|UniProtKB:Q67890-1”, “uniprot_id”: [“Q67890”], # Isoforms stripped if keep_isoforms=false “description”: “Example protein 2”
}
]
First Match Output (handle_multiple=’first’) .. code-block:: python
- [
- {
“gene_name”: “EXAMPLE1”, “all_node_curie”: “NCBIGene:1234|UniProtKB:P12345|HGNC:5678”, “uniprot_id”: “P12345”, “description”: “Example protein 1”
}
]
Expanded Rows Output (handle_multiple=’expand_rows’) .. code-block:: python
- [
- {
“gene_name”: “EXAMPLE1”, “all_node_curie”: “NCBIGene:1234|UniProtKB:P12345|HGNC:5678”, “uniprot_id”: “P12345”, “description”: “Example protein 1”
}, {
“gene_name”: “EXAMPLE2”, “all_node_curie”: “UniProtKB:Q67890|UniProtKB:Q67890-1”, “uniprot_id”: “Q67890”, “description”: “Example protein 2”
}, {
“gene_name”: “EXAMPLE2”, “all_node_curie”: “UniProtKB:Q67890|UniProtKB:Q67890-1”, “uniprot_id”: “Q67890”, # If keep_isoforms=false, duplicates removed “description”: “Example protein 2”
}
]
Statistics and Metadata
The action provides detailed statistics in the context:
{
"statistics": {
"uniprot_extraction": {
"total_rows_processed": 1000,
"rows_with_uniprot_ids": 847,
"extraction_rate": 0.847
}
}
}
UniProt ID Validation
Valid Format Patterns: * Standard: 6-10 alphanumeric characters (e.g., P12345, Q9Y6K1) * Newer format: Up to 10 characters (e.g., A0A123B4C5) * Isoforms: Base ID + dash + number (e.g., P12345-1)
Invalid IDs are filtered out: * Too short: < 6 characters * Too long: > 10 characters (excluding isoform suffix) * Invalid characters: Only A-Z and 0-9 allowed * Malformed isoforms: Invalid suffix patterns
Error Handling
- Column not found
Error: Column 'missing_xrefs' not found in dataset
Solution: Verify the xrefs_column name matches exactly.
- Dataset not found
Error: Dataset key 'missing_data' not found in context
Solution: Ensure dataset exists in context from previous actions.
- No UniProt IDs found
Warning: No valid UniProt IDs extracted from dataset
Solution: Check xrefs format and UniProt reference patterns.
Best Practices
Inspect xrefs format before extraction to understand data structure
Choose appropriate handling for multiple IDs based on downstream needs
Consider isoform requirements - biological significance vs. analysis complexity
Validate extraction results by checking statistics and sample outputs
Use expand_rows carefully - can significantly increase dataset size
Filter empty results appropriately with drop_na parameter
Performance Notes
Regex extraction is efficient for datasets up to 100K+ rows
Row expansion can significantly increase memory usage
Validation adds minimal overhead
Processing time scales linearly with dataset size and xrefs complexity
Common Use Cases
- Knowledge Graph Integration
Extract UniProt IDs from KG2c or SPOKE protein nodes for mapping
- Data Standardization
Convert complex xrefs to standardized UniProt identifiers
- Multi-Database Reconciliation
Extract UniProt IDs as primary keys for cross-database mapping
- Protein Network Analysis
Prepare protein datasets with clean UniProt identifiers
Integration
This action typically follows data loading and precedes mapping operations:
steps:
# 1. Load protein data with xrefs
- name: load_kg2c_proteins
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "/data/kg2c_proteins.csv"
identifier_column: "node_id"
output_key: "kg2c_raw"
# 2. Extract UniProt IDs
- name: extract_uniprot
action:
type: PROTEIN_EXTRACT_UNIPROT_FROM_XREFS
params:
input_key: "kg2c_raw"
xrefs_column: "all_node_curie"
output_column: "uniprot_id"
handle_multiple: "first"
keep_isoforms: false
drop_na: true
# 3. Continue with protein mapping
- name: map_to_reference
action:
type: MERGE_WITH_UNIPROT_RESOLUTION
params:
source_dataset_key: "kg2c_raw"
target_dataset_key: "reference_proteins"
output_key: "mapped_proteins"
—
## Verification Sources Last verified: 2025-08-22
This documentation was verified against the following project resources:
/biomapper/src/actions/entities/proteins/annotation/extract_uniprot_from_xrefs.py (actual implementation with regex pattern and multiple handling modes)
/biomapper/src/actions/typed_base.py (TypedStrategyAction base class)
/biomapper/src/actions/registry.py (self-registration via @register_action decorator)
/biomapper/CLAUDE.md (2025 standardization requirements for parameter naming)
/biomapper/pyproject.toml (pandas dependency for DataFrame operations)