protein_normalize_accessions
The PROTEIN_NORMALIZE_ACCESSIONS action standardizes UniProt accession identifiers to ensure consistent formatting across protein datasets.
Overview
UniProt accessions can appear in various formats that need normalization:
Different cases (P12345 vs p12345)
Various prefixes (sp|P12345|GENE, tr|Q67890|PROTEIN)
Version suffixes (P12345.1, P12345.2)
Isoform suffixes (P12345-1, P12345-2)
This action normalizes these variations to a consistent format for accurate protein matching.
Parameters
action:
type: PROTEIN_NORMALIZE_ACCESSIONS
params:
input_key: "protein_data"
id_columns: ["uniprot_id", "accession"]
strip_isoforms: true
strip_versions: true
validate_format: true
output_key: "normalized_proteins"
add_normalization_log: true
Required Parameters
- input_keystr
Dataset key from context[‘datasets’] containing protein identifiers
- id_columnslist[str]
Column names containing UniProt IDs to normalize
- output_keystr
Where to store the normalized dataset
Optional Parameters
- strip_isoformsbool, default=True
Remove isoform suffixes (-1, -2, etc.)
- strip_versionsbool, default=True
Remove version numbers (.1, .2, etc.)
- validate_formatbool, default=True
Validate UniProt ID format and flag invalid entries
- add_normalization_logbool, default=True
Add columns showing what was normalized
Normalization Rules
Case Normalization: All accessions converted to uppercase
Prefix Removal: Common prefixes are stripped: - sp|P12345|GENE → P12345 - tr|Q67890|PROTEIN → Q67890 - UniProt:P12345 → P12345
Version Removal: Version suffixes removed if enabled: - P12345.1 → P12345 - Q67890.2 → Q67890
Isoform Handling: Isoform suffixes removed if enabled: - P12345-1 → P12345 - Q67890-2 → Q67890
Format Validation: Validates against UniProt pattern: - Standard: [A-Z][0-9][A-Z0-9]{4,8} - Examples: P12345, Q123A5, A0A123456
Example Usage
Basic Normalization
steps:
- name: normalize_uniprot_ids
action:
type: PROTEIN_NORMALIZE_ACCESSIONS
params:
input_key: "raw_protein_data"
id_columns: ["protein_id"]
output_key: "normalized_proteins"
Multiple Columns with Detailed Logging
steps:
- name: normalize_multiple_columns
action:
type: PROTEIN_NORMALIZE_ACCESSIONS
params:
input_key: "protein_annotations"
id_columns: ["primary_accession", "secondary_accession", "related_proteins"]
strip_isoforms: true
strip_versions: true
validate_format: true
add_normalization_log: true
output_key: "clean_proteins"
Conservative Normalization
steps:
- name: conservative_normalize
action:
type: PROTEIN_NORMALIZE_ACCESSIONS
params:
input_key: "sensitive_protein_data"
id_columns: ["uniprot_id"]
strip_isoforms: false # Keep isoform information
strip_versions: false # Keep version information
validate_format: false # Don't reject potentially valid IDs
output_key: "conservatively_normalized"
Output Format
The action outputs a dataset with normalized identifiers and optional logging columns:
original_data + normalized columns + (optional) logging columns
Example output with logging enabled:
protein_name | uniprot_id | uniprot_id_original | uniprot_id_normalized
Insulin | P01308 | sp|P01308|INS_HUMAN | true
Hemoglobin | P69905 | P69905.1 | true
Albumin | P02768 | p02768 | true
Statistics Tracking
The action tracks comprehensive normalization statistics:
{
"total_processed": 1000,
"case_normalized": 150,
"prefixes_stripped": 200,
"versions_removed": 50,
"isoforms_handled": 30,
"validation_failures": 5
}
Validation Patterns
The action uses strict UniProt format validation:
Standard Format: [A-Z][0-9][A-Z0-9]{4,8}
Must start: Letter followed by digit
Length: 6-10 characters total
Examples: P12345, Q123A5, A0A123456, O95342
Invalid examples that would be flagged: - PP12345 (starts with two letters) - 123456 (starts with digit) - P1234 (too short)
Error Handling
The action handles various error conditions gracefully:
Missing columns: Returns error with specific column names
Empty values: Skips null/empty entries without errors
Invalid formats: Logs warnings but continues processing
Non-string values: Converts to string before processing
Best Practices
Always validate: Keep
validate_format=Trueto catch data quality issuesLog changes: Use
add_normalization_log=Truefor audit trailsHandle isoforms carefully: Consider whether your analysis needs isoform-specific data
Batch process: Process multiple columns together for efficiency
Review statistics: Check normalization statistics to identify data quality patterns
Integration Examples
With Database Matching
steps:
- name: normalize_for_database
action:
type: PROTEIN_NORMALIZE_ACCESSIONS
params:
input_key: "experimental_proteins"
id_columns: ["protein_accession"]
strip_isoforms: true # Database typically stores canonical forms
strip_versions: true # Use latest version
validate_format: true # Ensure compatibility
output_key: "db_ready_proteins"
- name: match_to_uniprot
action:
type: MERGE_WITH_UNIPROT_RESOLUTION
params:
dataset_key: "db_ready_proteins"
# ... other parameters
With Cross-Dataset Comparison
steps:
- name: normalize_dataset_a
action:
type: PROTEIN_NORMALIZE_ACCESSIONS
params:
input_key: "dataset_a"
id_columns: ["protein_id"]
output_key: "normalized_a"
- name: normalize_dataset_b
action:
type: PROTEIN_NORMALIZE_ACCESSIONS
params:
input_key: "dataset_b"
id_columns: ["uniprot_accession"]
output_key: "normalized_b"
- name: calculate_overlap
action:
type: CALCULATE_SET_OVERLAP
params:
dataset_a: "normalized_a"
dataset_b: "normalized_b"
# ... other parameters
Performance Notes
Memory efficient: Processes data in-place when possible
Regex optimized: Uses compiled patterns for fast validation
Statistics tracking: Minimal overhead for comprehensive metrics
Batch friendly: Handles large datasets efficiently
The normalization is highly optimized for large protein datasets while maintaining data integrity and providing detailed audit trails.