UniProt Historical ID Resolution
This document explains how BioMapper handles historical, secondary, and demerged UniProt identifiers through the MERGE_WITH_UNIPROT_RESOLUTION action and related protein normalization actions.
Background
UniProt protein identifiers can change over time as protein entries are:
Merged: Multiple entries are merged into a single entry, causing some accessions to become secondary IDs
Split/Demerged: One entry is split into multiple entries, where the original ID becomes a secondary ID to multiple primary IDs
Obsoleted: Entries are removed from the database when they are no longer considered valid proteins
Updated: Primary accessions can become secondary accessions when entries are reorganized
When mapping protein identifiers from one system to another, these historical changes must be handled to ensure accurate and complete mapping.
Types of UniProt IDs
The Biomapper framework handles these types of UniProt identifiers:
Primary Accessions: Current, active UniProt identifiers (e.g., P01308 for human insulin)
Secondary Accessions: Former primary IDs that now point to a current primary ID (e.g., Q99895 → P01308)
Demerged Accessions: IDs that now point to multiple primary IDs after being split (e.g., P0CG05 → P0DOY2, P0DOY3)
Obsolete Accessions: IDs that no longer exist in UniProt
Implementation in BioMapper Actions
BioMapper provides several actions for UniProt ID handling:
PROTEIN_NORMALIZE_ACCESSIONS
Standardizes UniProt accessions by removing isoform suffixes and validating format:
- name: normalize_proteins
action:
type: PROTEIN_NORMALIZE_ACCESSIONS
params:
input_key: "raw_proteins"
output_key: "normalized_proteins"
remove_isoforms: true # P01308-1 → P01308
validate_format: true # Validates UniProt regex pattern
UniProt Historical Resolution Client
BioMapper provides a UniProtHistoricalResolverClient for resolving historical and secondary UniProt IDs via API calls. This client can be integrated into custom actions for historical ID resolution.
How It Works
The client submits queries to the UniProt REST API to search for both primary and secondary accessions
It searches in both the primary accession and secondary accession fields
For each match, it processes the response to determine the correct resolution:
If the ID is found as a primary accession, it returns it unchanged
If the ID is found as a secondary accession, it returns the matching primary accession(s)
If the ID appears as a secondary accession in multiple entries, it returns all primary accessions (demerged case)
If no match is found, it marks the ID as obsolete
The client includes rich metadata in the return value to indicate the resolution type
Resolution Results in Context
The actions store resolution results in the shared execution context:
context["datasets"]["merged_dataset"] = [
{
"source_id": "P01308",
"target_id": "P01308",
"match_type": "primary",
"match_confidence": 1.0
},
{
"source_id": "Q99895",
"target_id": "P01308",
"match_type": "secondary",
"match_confidence": 0.9
},
{
"source_id": "P0CG05",
"target_id": "P0DOY2,P0DOY3",
"match_type": "demerged",
"match_confidence": 0.8
}
]
YAML Strategy Configuration
A complete protein harmonization strategy with UniProt resolution:
name: PROTEIN_HARMONIZATION
description: Harmonize protein datasets with historical ID resolution
parameters:
source_file: "${SOURCE_FILE}"
target_file: "${TARGET_FILE}"
output_dir: "${OUTPUT_DIR:-/tmp/results}"
steps:
# Step 1: Load source proteins
- name: load_source
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "${parameters.source_file}"
identifier_column: "uniprot_id"
output_key: "source_proteins_raw"
# Step 2: Normalize source proteins
- name: normalize_source
action:
type: PROTEIN_NORMALIZE_ACCESSIONS
params:
input_key: "source_proteins_raw"
output_key: "source_proteins"
remove_isoforms: true
validate_format: true
# Step 3: Load target proteins
- name: load_target
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "${parameters.target_file}"
identifier_column: "protein_accession"
output_key: "target_proteins"
# Step 4: Merge datasets (Note: Historical UniProt resolution would be implemented as a custom action)
- name: merge_datasets
action:
type: MERGE_DATASETS
params:
dataset1_key: "source_proteins"
dataset2_key: "target_proteins"
merge_column1: "identifier"
merge_column2: "identifier"
output_key: "merged_proteins"
# Step 5: Calculate overlap statistics
- name: analyze_overlap
action:
type: CALCULATE_SET_OVERLAP
params:
merged_dataset_key: "merged_proteins"
source_name: "Source"
target_name: "Target"
output_key: "overlap_stats"
output_directory: "${parameters.output_dir}"
Python Client Usage
Execute the strategy using BiomapperClient:
from src.client.client_v2 import BiomapperClient
client = BiomapperClient(base_url="http://localhost:8000")
# Execute protein harmonization with UniProt resolution
result = client.run(
strategy_name="PROTEIN_HARMONIZATION",
parameters={
"source_file": "/data/ukbb_proteins.tsv",
"target_file": "/data/hpa_proteins.csv",
"output_dir": "/results/protein_harmonization"
}
)
print(f"Job completed: {result['status']}")
print(f"Merged proteins: {result['results']['merged_proteins_count']}")
print(f"Resolution stats: {result['results']['resolution_stats']}")
Testing Historical Resolution
Test the MERGE_WITH_UNIPROT_RESOLUTION action:
# tests/unit/core/strategy_actions/test_merge_with_uniprot_resolution.py
class TestMergeWithUniprotResolution:
@pytest.mark.asyncio
async def test_secondary_id_resolution(self, mock_context):
"""Test resolution of secondary UniProt IDs."""
# Setup test data with known secondary IDs
mock_context["datasets"]["source"] = [
{"identifier": "Q99895"}, # Secondary ID
{"identifier": "P01308"} # Primary ID
]
mock_context["datasets"]["target"] = [
{"identifier": "P01308"} # Primary ID
]
action = MergeWithUniprotResolutionAction()
params = MergeWithUniprotResolutionParams(
source_dataset_key="source",
target_dataset_key="target",
source_id_column="identifier",
target_id_column="identifier",
output_key="merged",
enable_api_resolution=True
)
result = await action.execute_typed(params, mock_context)
assert result.success
merged = mock_context["datasets"]["merged"]
# Both Q99895 and P01308 should map to P01308
assert len([m for m in merged if m["target_id"] == "P01308"]) == 2
Performance Considerations
Batch Processing
The MERGE_WITH_UNIPROT_RESOLUTION action automatically batches API requests:
Default batch size: 250 IDs
Configurable via
batch_sizeparameterAutomatic retry on API failures
Caching
Results cached in execution context
SQLite persistence for job recovery
Consider using CHUNK_PROCESSOR for very large datasets
API Rate Limits
UniProt API has rate limits
Action implements exponential backoff
Large datasets may take time to process
Best Practices
Always normalize first: Use PROTEIN_NORMALIZE_ACCESSIONS before merging
Set appropriate confidence thresholds: Use 0.5-0.7 for historical matches
Monitor API resolution: Check statistics for resolution success rates
Use chunking for large datasets: Combine with CHUNK_PROCESSOR action
Validate results: Use CALCULATE_SET_OVERLAP to verify mappings
Verification Sources
Last verified: 2025-01-18
This documentation was verified against the following project resources:
/home/ubuntu/biomapper/src/integrations/clients/uniprot_historical_resolver_client.py(UniProt historical resolution client)/home/ubuntu/biomapper/src/actions/entities/proteins/annotation/(Protein normalization actions)/home/ubuntu/biomapper/src/actions/entities/proteins/annotation/normalize_accessions.py(PROTEIN_NORMALIZE_ACCESSIONS action)/home/ubuntu/biomapper/src/actions/entities/proteins/annotation/extract_uniprot_from_xrefs.py(UniProt extraction action)/home/ubuntu/biomapper/src/client/client_v2.py(BiomapperClient with strategy execution)/home/ubuntu/biomapper/src/actions/merge_datasets.py(Dataset merging action)/home/ubuntu/biomapper/CLAUDE.md(Protein action documentation and historical ID handling patterns)