# UniProt Historical ID Resolution This document explains how BioMapper handles historical, secondary, and demerged UniProt identifiers through the `MERGE_WITH_UNIPROT_RESOLUTION` action and related protein normalization actions. ## Background UniProt protein identifiers can change over time as protein entries are: 1. **Merged**: Multiple entries are merged into a single entry, causing some accessions to become secondary IDs 2. **Split/Demerged**: One entry is split into multiple entries, where the original ID becomes a secondary ID to multiple primary IDs 3. **Obsoleted**: Entries are removed from the database when they are no longer considered valid proteins 4. **Updated**: Primary accessions can become secondary accessions when entries are reorganized When mapping protein identifiers from one system to another, these historical changes must be handled to ensure accurate and complete mapping. ## Types of UniProt IDs The Biomapper framework handles these types of UniProt identifiers: 1. **Primary Accessions**: Current, active UniProt identifiers (e.g., P01308 for human insulin) 2. **Secondary Accessions**: Former primary IDs that now point to a current primary ID (e.g., Q99895 → P01308) 3. **Demerged Accessions**: IDs that now point to multiple primary IDs after being split (e.g., P0CG05 → P0DOY2, P0DOY3) 4. **Obsolete Accessions**: IDs that no longer exist in UniProt ## Implementation in BioMapper Actions BioMapper provides several actions for UniProt ID handling: ### PROTEIN_NORMALIZE_ACCESSIONS Standardizes UniProt accessions by removing isoform suffixes and validating format: ```python - name: normalize_proteins action: type: PROTEIN_NORMALIZE_ACCESSIONS params: input_key: "raw_proteins" output_key: "normalized_proteins" remove_isoforms: true # P01308-1 → P01308 validate_format: true # Validates UniProt regex pattern ``` ### UniProt Historical Resolution Client BioMapper provides a UniProtHistoricalResolverClient for resolving historical and secondary UniProt IDs via API calls. This client can be integrated into custom actions for historical ID resolution. ### How It Works 1. The client submits queries to the UniProt REST API to search for both primary and secondary accessions 2. It searches in both the primary accession and secondary accession fields 3. For each match, it processes the response to determine the correct resolution: - If the ID is found as a primary accession, it returns it unchanged - If the ID is found as a secondary accession, it returns the matching primary accession(s) - If the ID appears as a secondary accession in multiple entries, it returns all primary accessions (demerged case) - If no match is found, it marks the ID as obsolete 4. The client includes rich metadata in the return value to indicate the resolution type ### Resolution Results in Context The actions store resolution results in the shared execution context: ```python context["datasets"]["merged_dataset"] = [ { "source_id": "P01308", "target_id": "P01308", "match_type": "primary", "match_confidence": 1.0 }, { "source_id": "Q99895", "target_id": "P01308", "match_type": "secondary", "match_confidence": 0.9 }, { "source_id": "P0CG05", "target_id": "P0DOY2,P0DOY3", "match_type": "demerged", "match_confidence": 0.8 } ] ``` ## YAML Strategy Configuration A complete protein harmonization strategy with UniProt resolution: ```yaml name: PROTEIN_HARMONIZATION description: Harmonize protein datasets with historical ID resolution parameters: source_file: "${SOURCE_FILE}" target_file: "${TARGET_FILE}" output_dir: "${OUTPUT_DIR:-/tmp/results}" steps: # Step 1: Load source proteins - name: load_source action: type: LOAD_DATASET_IDENTIFIERS params: file_path: "${parameters.source_file}" identifier_column: "uniprot_id" output_key: "source_proteins_raw" # Step 2: Normalize source proteins - name: normalize_source action: type: PROTEIN_NORMALIZE_ACCESSIONS params: input_key: "source_proteins_raw" output_key: "source_proteins" remove_isoforms: true validate_format: true # Step 3: Load target proteins - name: load_target action: type: LOAD_DATASET_IDENTIFIERS params: file_path: "${parameters.target_file}" identifier_column: "protein_accession" output_key: "target_proteins" # Step 4: Merge datasets (Note: Historical UniProt resolution would be implemented as a custom action) - name: merge_datasets action: type: MERGE_DATASETS params: dataset1_key: "source_proteins" dataset2_key: "target_proteins" merge_column1: "identifier" merge_column2: "identifier" output_key: "merged_proteins" # Step 5: Calculate overlap statistics - name: analyze_overlap action: type: CALCULATE_SET_OVERLAP params: merged_dataset_key: "merged_proteins" source_name: "Source" target_name: "Target" output_key: "overlap_stats" output_directory: "${parameters.output_dir}" ``` ## Python Client Usage Execute the strategy using BiomapperClient: ```python from src.client.client_v2 import BiomapperClient client = BiomapperClient(base_url="http://localhost:8000") # Execute protein harmonization with UniProt resolution result = client.run( strategy_name="PROTEIN_HARMONIZATION", parameters={ "source_file": "/data/ukbb_proteins.tsv", "target_file": "/data/hpa_proteins.csv", "output_dir": "/results/protein_harmonization" } ) print(f"Job completed: {result['status']}") print(f"Merged proteins: {result['results']['merged_proteins_count']}") print(f"Resolution stats: {result['results']['resolution_stats']}") ``` ## Testing Historical Resolution Test the MERGE_WITH_UNIPROT_RESOLUTION action: ```python # tests/unit/core/strategy_actions/test_merge_with_uniprot_resolution.py class TestMergeWithUniprotResolution: @pytest.mark.asyncio async def test_secondary_id_resolution(self, mock_context): """Test resolution of secondary UniProt IDs.""" # Setup test data with known secondary IDs mock_context["datasets"]["source"] = [ {"identifier": "Q99895"}, # Secondary ID {"identifier": "P01308"} # Primary ID ] mock_context["datasets"]["target"] = [ {"identifier": "P01308"} # Primary ID ] action = MergeWithUniprotResolutionAction() params = MergeWithUniprotResolutionParams( source_dataset_key="source", target_dataset_key="target", source_id_column="identifier", target_id_column="identifier", output_key="merged", enable_api_resolution=True ) result = await action.execute_typed(params, mock_context) assert result.success merged = mock_context["datasets"]["merged"] # Both Q99895 and P01308 should map to P01308 assert len([m for m in merged if m["target_id"] == "P01308"]) == 2 ``` ## Performance Considerations ### Batch Processing The MERGE_WITH_UNIPROT_RESOLUTION action automatically batches API requests: - Default batch size: 250 IDs - Configurable via `batch_size` parameter - Automatic retry on API failures ### Caching - Results cached in execution context - SQLite persistence for job recovery - Consider using CHUNK_PROCESSOR for very large datasets ### API Rate Limits - UniProt API has rate limits - Action implements exponential backoff - Large datasets may take time to process ## Best Practices 1. **Always normalize first**: Use PROTEIN_NORMALIZE_ACCESSIONS before merging 2. **Set appropriate confidence thresholds**: Use 0.5-0.7 for historical matches 3. **Monitor API resolution**: Check statistics for resolution success rates 4. **Use chunking for large datasets**: Combine with CHUNK_PROCESSOR action 5. **Validate results**: Use CALCULATE_SET_OVERLAP to verify mappings ## Related Actions - `PROTEIN_EXTRACT_UNIPROT_FROM_XREFS`: Extract UniProt IDs from compound fields - `PROTEIN_MULTI_BRIDGE`: Multi-source protein identifier resolution - `CALCULATE_MAPPING_QUALITY`: Assess quality of UniProt mappings - `GENERATE_ENHANCEMENT_REPORT`: Detailed report on resolution statistics --- --- ## Verification Sources *Last verified: 2025-01-18* This documentation was verified against the following project resources: - `/home/ubuntu/biomapper/src/integrations/clients/uniprot_historical_resolver_client.py` (UniProt historical resolution client) - `/home/ubuntu/biomapper/src/actions/entities/proteins/annotation/` (Protein normalization actions) - `/home/ubuntu/biomapper/src/actions/entities/proteins/annotation/normalize_accessions.py` (PROTEIN_NORMALIZE_ACCESSIONS action) - `/home/ubuntu/biomapper/src/actions/entities/proteins/annotation/extract_uniprot_from_xrefs.py` (UniProt extraction action) - `/home/ubuntu/biomapper/src/client/client_v2.py` (BiomapperClient with strategy execution) - `/home/ubuntu/biomapper/src/actions/merge_datasets.py` (Dataset merging action) - `/home/ubuntu/biomapper/CLAUDE.md` (Protein action documentation and historical ID handling patterns)