Parse Composite Identifiers

Overview

The PARSE_COMPOSITE_IDENTIFIERS action handles the common biological data challenge of composite identifiers - single fields containing multiple identifiers separated by various delimiters. This action safely parses, expands, and normalizes composite identifiers while preserving original values for traceability.

This is essential for processing real-world biological datasets where identifiers are often stored as comma-separated or pipe-separated values (e.g., “P12345,P67890|Q11128”).

Key Features

  • Multi-delimiter Support: Handles commas, pipes, semicolons, and custom separators

  • Normalization: Standardizes identifier formats (removes prefixes, versions)

  • Expansion: Creates multiple rows from single composite entries

  • Preservation: Maintains original composite values for audit trails

  • Validation: Filters invalid or malformed identifiers

Common Use Cases

  1. KG2c xrefs Fields: “UniProtKB:P12345|RefSeq:NP_001234|KEGG:K12345”

  2. SPOKE Identifiers: “P12345,P67890;Q11128”

  3. Literature Mining: “protein1|protein2|protein3”

  4. Cross-references: Multiple database IDs in single field

Parameters

Parameter

Type

Required

Description

input_key

string

Yes

Key for the input dataset

output_key

string

Yes

Key for the expanded output dataset

identifier_column

string

Yes

Column containing composite identifiers

separator_pattern

string

No

Regex pattern for separators (default: “[,;|]”)

prefix_pattern

string

No

Regex pattern to remove prefixes (default: UniProt patterns)

normalize_ids

boolean

No

Enable ID normalization (default: true)

preserve_original

boolean

No

Keep original composite value (default: true)

filter_invalid

boolean

No

Remove invalid identifiers (default: true)

Example Usage

YAML Strategy

steps:
  - name: parse_composites
    action:
      type: PARSE_COMPOSITE_IDENTIFIERS
      params:
        input_key: kg2c_proteins
        output_key: expanded_proteins
        identifier_column: xrefs
        separator_pattern: "[,;|]"
        normalize_ids: true
        preserve_original: true

Python Client

from src.client.client_v2 import BiomapperClient

client = BiomapperClient(base_url="http://localhost:8000")

# Load dataset with composite identifiers
context = {"datasets": {"composite_data": composite_df}}

result = await client.run_action(
    action_type="PARSE_COMPOSITE_IDENTIFIERS",
    params={
        "input_key": "composite_data",
        "output_key": "parsed_data",
        "identifier_column": "protein_ids",
        "separator_pattern": "[,|]",
        "normalize_ids": True
    },
    context=context
)

Input/Output Example

Input Dataset

gene_name

xrefs

CD4

UniProtKB:P01730|RefSeq:NP_000607

TP53

UniProtKB:P04637,P04637-1|KEGG:hsa:7157

BRCA1

P38398;Q3LRH6|UniProtKB:P38398

Output Dataset

gene_name

extracted_uniprot

_original_xrefs

_row_id

CD4

P01730

UniProtKB:P01730|RefSeq:NP_000607

1

TP53

P04637

UniProtKB:P04637,P04637-1|KEGG:hsa:7157

2a

TP53

P04637

UniProtKB:P04637,P04637-1|KEGG:hsa:7157

2b

BRCA1

P38398

P38398;Q3LRH6|UniProtKB:P38398

3a

BRCA1

Q3LRH6

P38398;Q3LRH6|UniProtKB:P38398

3b

BRCA1

P38398

P38398;Q3LRH6|UniProtKB:P38398

3c

Processing Steps

  1. Parsing Phase

    • Split composite strings using separator pattern

    • Handle nested separators (commas within pipe-separated groups)

    • Remove empty strings and whitespace

  2. Normalization Phase

    • Remove database prefixes (UniProtKB:, RefSeq:, etc.)

    • Strip version numbers (P12345.2 → P12345)

    • Handle isoform suffixes (P12345-1 → P12345 or preserve)

    • Apply format validation

  3. Expansion Phase

    • Create multiple rows for each parsed identifier

    • Preserve all original columns

    • Add tracking columns (_original_*, _row_id)

  4. Validation Phase

    • Filter malformed identifiers

    • Remove duplicates within same composite

    • Validate against expected patterns

Advanced Configuration

Custom Separator Patterns

# For complex separators
separator_pattern: "[,;|\\s]+"  # Includes whitespace

# For specific formats
separator_pattern: "\\s*[,|]\\s*"  # Comma or pipe with optional spaces

Custom Prefix Removal

# Remove specific database prefixes
prefix_pattern: "^(UniProtKB|RefSeq|KEGG):"

# Remove version numbers
prefix_pattern: "\\.[0-9]+$"

Normalization Options

# Preserve isoforms
normalize_ids: true
isoform_handling: "preserve"  # keep -1, -2 suffixes

# Strict normalization
normalize_ids: true
isoform_handling: "remove"   # P12345-1 → P12345

Performance Considerations

Processing Speed

Typical performance for different dataset sizes: - Small (1K rows): <1 second - Medium (10K rows): 2-5 seconds - Large (100K rows): 20-60 seconds - Very Large (1M+ rows): 2-10 minutes

Performance factors: - Number of composite identifiers per row - Complexity of separator patterns - Normalization processing enabled - Output dataset size after expansion

Memory Usage

Memory usage increases with expansion ratio: - Input: 10K rows with average 2 IDs per composite - Output: ~20K rows (2x expansion) - Memory: ~3x input size during processing

Optimization Tips

  1. Batch Processing: Process large datasets in chunks

  2. Pattern Optimization: Use simple patterns when possible

  3. Selective Normalization: Disable if not needed

  4. Memory Management: Monitor expansion ratios

Real-World Examples

KG2c Protein Processing

# Process KG2c xrefs for UniProt extraction
- name: parse_kg2c_xrefs
  action:
    type: PARSE_COMPOSITE_IDENTIFIERS
    params:
      input_key: kg2c_proteins
      output_key: expanded_proteins
      identifier_column: xrefs
      separator_pattern: "[|,]"
      prefix_pattern: "^UniProtKB:"
      normalize_ids: true

SPOKE Identifier Expansion

# Handle SPOKE multi-identifier format
- name: expand_spoke_ids
  action:
    type: PARSE_COMPOSITE_IDENTIFIERS
    params:
      input_key: spoke_data
      output_key: individual_proteins
      identifier_column: protein_identifiers
      separator_pattern: "[,;]"
      preserve_original: true

Coverage Impact Analysis

Typical coverage improvements: - Before parsing: 1,165 composite entries - After parsing: 2,500+ individual identifiers - Unique identifiers: 1,800+ after deduplication - Coverage gain: 15-25% in subsequent matching stages

Best Practices

  1. Pattern Testing: Validate separator patterns on sample data

  2. Original Preservation: Always preserve original composite values

  3. Row Tracking: Use _row_id for linking back to original entries

  4. Validation: Check expansion ratios for reasonableness

  5. Deduplication: Handle duplicates in downstream processing

Common Issues and Solutions

Issue

Solution

Unexpected expansion ratio

Check separator pattern specificity

Missing identifiers after parsing

Verify prefix_pattern doesn’t over-filter

Performance issues

Process in smaller batches

Memory errors

Reduce chunk size or increase memory limits

Duplicate identifiers

Enable deduplication in downstream actions

Integration with Matching Actions

The parsed identifiers typically feed into matching actions:

steps:
  - name: parse_composites
    action:
      type: PARSE_COMPOSITE_IDENTIFIERS
      params:
        input_key: raw_data
        output_key: expanded_data

  - name: direct_matching
    action:
      type: PROTEIN_NORMALIZE_ACCESSIONS
      params:
        input_key: expanded_data  # Uses parsed output
        output_key: normalized_data

  - name: merge_results
    action:
      type: MERGE_DATASETS
      params:
        input_keys: [normalized_data, reference_data]
        output_key: matched_results

See Also