Parse Composite Identifiers
Overview
The PARSE_COMPOSITE_IDENTIFIERS action handles the common biological data challenge of composite identifiers - single fields containing multiple identifiers separated by various delimiters. This action safely parses, expands, and normalizes composite identifiers while preserving original values for traceability.
This is essential for processing real-world biological datasets where identifiers are often stored as comma-separated or pipe-separated values (e.g., “P12345,P67890|Q11128”).
Key Features
Multi-delimiter Support: Handles commas, pipes, semicolons, and custom separators
Normalization: Standardizes identifier formats (removes prefixes, versions)
Expansion: Creates multiple rows from single composite entries
Preservation: Maintains original composite values for audit trails
Validation: Filters invalid or malformed identifiers
Common Use Cases
KG2c xrefs Fields: “UniProtKB:P12345|RefSeq:NP_001234|KEGG:K12345”
SPOKE Identifiers: “P12345,P67890;Q11128”
Literature Mining: “protein1|protein2|protein3”
Cross-references: Multiple database IDs in single field
Parameters
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Key for the input dataset |
|
string |
Yes |
Key for the expanded output dataset |
|
string |
Yes |
Column containing composite identifiers |
|
string |
No |
Regex pattern for separators (default: “[,;|]”) |
|
string |
No |
Regex pattern to remove prefixes (default: UniProt patterns) |
|
boolean |
No |
Enable ID normalization (default: true) |
|
boolean |
No |
Keep original composite value (default: true) |
|
boolean |
No |
Remove invalid identifiers (default: true) |
Example Usage
YAML Strategy
steps:
- name: parse_composites
action:
type: PARSE_COMPOSITE_IDENTIFIERS
params:
input_key: kg2c_proteins
output_key: expanded_proteins
identifier_column: xrefs
separator_pattern: "[,;|]"
normalize_ids: true
preserve_original: true
Python Client
from src.client.client_v2 import BiomapperClient
client = BiomapperClient(base_url="http://localhost:8000")
# Load dataset with composite identifiers
context = {"datasets": {"composite_data": composite_df}}
result = await client.run_action(
action_type="PARSE_COMPOSITE_IDENTIFIERS",
params={
"input_key": "composite_data",
"output_key": "parsed_data",
"identifier_column": "protein_ids",
"separator_pattern": "[,|]",
"normalize_ids": True
},
context=context
)
Input/Output Example
Input Dataset
gene_name |
xrefs |
|---|---|
CD4 |
UniProtKB:P01730|RefSeq:NP_000607 |
TP53 |
UniProtKB:P04637,P04637-1|KEGG:hsa:7157 |
BRCA1 |
P38398;Q3LRH6|UniProtKB:P38398 |
Output Dataset
gene_name |
extracted_uniprot |
_original_xrefs |
_row_id |
|---|---|---|---|
CD4 |
P01730 |
UniProtKB:P01730|RefSeq:NP_000607 |
1 |
TP53 |
P04637 |
UniProtKB:P04637,P04637-1|KEGG:hsa:7157 |
2a |
TP53 |
P04637 |
UniProtKB:P04637,P04637-1|KEGG:hsa:7157 |
2b |
BRCA1 |
P38398 |
P38398;Q3LRH6|UniProtKB:P38398 |
3a |
BRCA1 |
Q3LRH6 |
P38398;Q3LRH6|UniProtKB:P38398 |
3b |
BRCA1 |
P38398 |
P38398;Q3LRH6|UniProtKB:P38398 |
3c |
Processing Steps
Parsing Phase
Split composite strings using separator pattern
Handle nested separators (commas within pipe-separated groups)
Remove empty strings and whitespace
Normalization Phase
Remove database prefixes (UniProtKB:, RefSeq:, etc.)
Strip version numbers (P12345.2 → P12345)
Handle isoform suffixes (P12345-1 → P12345 or preserve)
Apply format validation
Expansion Phase
Create multiple rows for each parsed identifier
Preserve all original columns
Add tracking columns (_original_*, _row_id)
Validation Phase
Filter malformed identifiers
Remove duplicates within same composite
Validate against expected patterns
Advanced Configuration
Custom Separator Patterns
# For complex separators
separator_pattern: "[,;|\\s]+" # Includes whitespace
# For specific formats
separator_pattern: "\\s*[,|]\\s*" # Comma or pipe with optional spaces
Custom Prefix Removal
# Remove specific database prefixes
prefix_pattern: "^(UniProtKB|RefSeq|KEGG):"
# Remove version numbers
prefix_pattern: "\\.[0-9]+$"
Normalization Options
# Preserve isoforms
normalize_ids: true
isoform_handling: "preserve" # keep -1, -2 suffixes
# Strict normalization
normalize_ids: true
isoform_handling: "remove" # P12345-1 → P12345
Performance Considerations
Processing Speed
Typical performance for different dataset sizes: - Small (1K rows): <1 second - Medium (10K rows): 2-5 seconds - Large (100K rows): 20-60 seconds - Very Large (1M+ rows): 2-10 minutes
Performance factors: - Number of composite identifiers per row - Complexity of separator patterns - Normalization processing enabled - Output dataset size after expansion
Memory Usage
Memory usage increases with expansion ratio: - Input: 10K rows with average 2 IDs per composite - Output: ~20K rows (2x expansion) - Memory: ~3x input size during processing
Optimization Tips
Batch Processing: Process large datasets in chunks
Pattern Optimization: Use simple patterns when possible
Selective Normalization: Disable if not needed
Memory Management: Monitor expansion ratios
Real-World Examples
KG2c Protein Processing
# Process KG2c xrefs for UniProt extraction
- name: parse_kg2c_xrefs
action:
type: PARSE_COMPOSITE_IDENTIFIERS
params:
input_key: kg2c_proteins
output_key: expanded_proteins
identifier_column: xrefs
separator_pattern: "[|,]"
prefix_pattern: "^UniProtKB:"
normalize_ids: true
SPOKE Identifier Expansion
# Handle SPOKE multi-identifier format
- name: expand_spoke_ids
action:
type: PARSE_COMPOSITE_IDENTIFIERS
params:
input_key: spoke_data
output_key: individual_proteins
identifier_column: protein_identifiers
separator_pattern: "[,;]"
preserve_original: true
Coverage Impact Analysis
Typical coverage improvements: - Before parsing: 1,165 composite entries - After parsing: 2,500+ individual identifiers - Unique identifiers: 1,800+ after deduplication - Coverage gain: 15-25% in subsequent matching stages
Best Practices
Pattern Testing: Validate separator patterns on sample data
Original Preservation: Always preserve original composite values
Row Tracking: Use _row_id for linking back to original entries
Validation: Check expansion ratios for reasonableness
Deduplication: Handle duplicates in downstream processing
Common Issues and Solutions
Issue |
Solution |
|---|---|
Unexpected expansion ratio |
Check separator pattern specificity |
Missing identifiers after parsing |
Verify prefix_pattern doesn’t over-filter |
Performance issues |
Process in smaller batches |
Memory errors |
Reduce chunk size or increase memory limits |
Duplicate identifiers |
Enable deduplication in downstream actions |
Integration with Matching Actions
The parsed identifiers typically feed into matching actions:
steps:
- name: parse_composites
action:
type: PARSE_COMPOSITE_IDENTIFIERS
params:
input_key: raw_data
output_key: expanded_data
- name: direct_matching
action:
type: PROTEIN_NORMALIZE_ACCESSIONS
params:
input_key: expanded_data # Uses parsed output
output_key: normalized_data
- name: merge_results
action:
type: MERGE_DATASETS
params:
input_keys: [normalized_data, reference_data]
output_key: matched_results
See Also
PROTEIN_EXTRACT_UNIPROT_FROM_XREFS - UniProt-specific extraction
protein_normalize_accessions - Identifier normalization
../workflows/protein_mapping - Complete protein workflows
Real-World Case Studies - Production use cases