Parse Composite Identifiers

Overview

The PARSE_COMPOSITE_IDENTIFIERS action handles the common biological data challenge of composite identifiers - single fields containing multiple identifiers separated by various delimiters. This action safely parses, expands, and normalizes composite identifiers while preserving original values for traceability.

This is essential for processing real-world biological datasets where identifiers are often stored as comma-separated or pipe-separated values (e.g., “P12345,P67890|Q11128”).

Key Features

Multi-delimiter Support: Handles commas, pipes, semicolons, and custom separators
Normalization: Standardizes identifier formats (removes prefixes, versions)
Expansion: Creates multiple rows from single composite entries
Preservation: Maintains original composite values for audit trails
Validation: Filters invalid or malformed identifiers

Common Use Cases

KG2c xrefs Fields: “UniProtKB:P12345|RefSeq:NP_001234|KEGG:K12345”
SPOKE Identifiers: “P12345,P67890;Q11128”
Literature Mining: “protein1|protein2|protein3”
Cross-references: Multiple database IDs in single field

Parameters

Parameter	Type	Required	Description
`input_key`	string	Yes	Key for the input dataset
`output_key`	string	Yes	Key for the expanded output dataset
`identifier_column`	string	Yes	Column containing composite identifiers
`separator_pattern`	string	No	Regex pattern for separators (default: “[,;\|]”)
`prefix_pattern`	string	No	Regex pattern to remove prefixes (default: UniProt patterns)
`normalize_ids`	boolean	No	Enable ID normalization (default: true)
`preserve_original`	boolean	No	Keep original composite value (default: true)
`filter_invalid`	boolean	No	Remove invalid identifiers (default: true)

Example Usage

YAML Strategy

steps:
  - name: parse_composites
    action:
      type: PARSE_COMPOSITE_IDENTIFIERS
      params:
        input_key: kg2c_proteins
        output_key: expanded_proteins
        identifier_column: xrefs
        separator_pattern: "[,;|]"
        normalize_ids: true
        preserve_original: true

Python Client

from src.client.client_v2 import BiomapperClient

client = BiomapperClient(base_url="http://localhost:8000")

# Load dataset with composite identifiers
context = {"datasets": {"composite_data": composite_df}}

result = await client.run_action(
    action_type="PARSE_COMPOSITE_IDENTIFIERS",
    params={
        "input_key": "composite_data",
        "output_key": "parsed_data",
        "identifier_column": "protein_ids",
        "separator_pattern": "[,|]",
        "normalize_ids": True
    },
    context=context
)

Input/Output Example

Input Dataset

gene_name	xrefs
CD4	UniProtKB:P01730\|RefSeq:NP_000607
TP53	UniProtKB:P04637,P04637-1\|KEGG:hsa:7157
BRCA1	P38398;Q3LRH6\|UniProtKB:P38398

Output Dataset

gene_name	extracted_uniprot	_original_xrefs	_row_id
CD4	P01730	UniProtKB:P01730\|RefSeq:NP_000607	1
TP53	P04637	UniProtKB:P04637,P04637-1\|KEGG:hsa:7157	2a
TP53	P04637	UniProtKB:P04637,P04637-1\|KEGG:hsa:7157	2b
BRCA1	P38398	P38398;Q3LRH6\|UniProtKB:P38398	3a
BRCA1	Q3LRH6	P38398;Q3LRH6\|UniProtKB:P38398	3b
BRCA1	P38398	P38398;Q3LRH6\|UniProtKB:P38398	3c

Processing Steps

Parsing Phase
- Split composite strings using separator pattern
- Handle nested separators (commas within pipe-separated groups)
- Remove empty strings and whitespace
Normalization Phase
- Remove database prefixes (UniProtKB:, RefSeq:, etc.)
- Strip version numbers (P12345.2 → P12345)
- Handle isoform suffixes (P12345-1 → P12345 or preserve)
- Apply format validation
Expansion Phase
- Create multiple rows for each parsed identifier
- Preserve all original columns
- Add tracking columns (_original_*, _row_id)
Validation Phase
- Filter malformed identifiers
- Remove duplicates within same composite
- Validate against expected patterns

Advanced Configuration

Custom Separator Patterns

# For complex separators
separator_pattern: "[,;|\\s]+"  # Includes whitespace

# For specific formats
separator_pattern: "\\s*[,|]\\s*"  # Comma or pipe with optional spaces

Custom Prefix Removal

# Remove specific database prefixes
prefix_pattern: "^(UniProtKB|RefSeq|KEGG):"

# Remove version numbers
prefix_pattern: "\\.[0-9]+$"

Normalization Options

# Preserve isoforms
normalize_ids: true
isoform_handling: "preserve"  # keep -1, -2 suffixes

# Strict normalization
normalize_ids: true
isoform_handling: "remove"   # P12345-1 → P12345

Performance Considerations

Processing Speed

Typical performance for different dataset sizes: - Small (1K rows): <1 second - Medium (10K rows): 2-5 seconds - Large (100K rows): 20-60 seconds - Very Large (1M+ rows): 2-10 minutes

Performance factors: - Number of composite identifiers per row - Complexity of separator patterns - Normalization processing enabled - Output dataset size after expansion

Memory Usage

Memory usage increases with expansion ratio: - Input: 10K rows with average 2 IDs per composite - Output: ~20K rows (2x expansion) - Memory: ~3x input size during processing

Optimization Tips

Batch Processing: Process large datasets in chunks
Pattern Optimization: Use simple patterns when possible
Selective Normalization: Disable if not needed
Memory Management: Monitor expansion ratios

Real-World Examples

KG2c Protein Processing

# Process KG2c xrefs for UniProt extraction
- name: parse_kg2c_xrefs
  action:
    type: PARSE_COMPOSITE_IDENTIFIERS
    params:
      input_key: kg2c_proteins
      output_key: expanded_proteins
      identifier_column: xrefs
      separator_pattern: "[|,]"
      prefix_pattern: "^UniProtKB:"
      normalize_ids: true

SPOKE Identifier Expansion

# Handle SPOKE multi-identifier format
- name: expand_spoke_ids
  action:
    type: PARSE_COMPOSITE_IDENTIFIERS
    params:
      input_key: spoke_data
      output_key: individual_proteins
      identifier_column: protein_identifiers
      separator_pattern: "[,;]"
      preserve_original: true

Coverage Impact Analysis

Typical coverage improvements: - Before parsing: 1,165 composite entries - After parsing: 2,500+ individual identifiers - Unique identifiers: 1,800+ after deduplication - Coverage gain: 15-25% in subsequent matching stages

Best Practices

Pattern Testing: Validate separator patterns on sample data
Original Preservation: Always preserve original composite values
Row Tracking: Use _row_id for linking back to original entries
Validation: Check expansion ratios for reasonableness
Deduplication: Handle duplicates in downstream processing

Common Issues and Solutions

Issue	Solution
Unexpected expansion ratio	Check separator pattern specificity
Missing identifiers after parsing	Verify prefix_pattern doesn’t over-filter
Performance issues	Process in smaller batches
Memory errors	Reduce chunk size or increase memory limits
Duplicate identifiers	Enable deduplication in downstream actions

Integration with Matching Actions

The parsed identifiers typically feed into matching actions:

steps:
  - name: parse_composites
    action:
      type: PARSE_COMPOSITE_IDENTIFIERS
      params:
        input_key: raw_data
        output_key: expanded_data

  - name: direct_matching
    action:
      type: PROTEIN_NORMALIZE_ACCESSIONS
      params:
        input_key: expanded_data  # Uses parsed output
        output_key: normalized_data

  - name: merge_results
    action:
      type: MERGE_DATASETS
      params:
        input_keys: [normalized_data, reference_data]
        output_key: matched_results