Progressive Semantic Match

Overview

The PROGRESSIVE_SEMANTIC_MATCH action orchestrates a comprehensive 4-stage metabolite matching pipeline that progressively applies different matching strategies to achieve maximum coverage. This is the primary action for metabolomics identifier harmonization in the biomapper framework.

The progressive approach starts with high-confidence exact matches and gradually relaxes criteria to capture more difficult cases, achieving typical coverage rates of 75-80% for well-curated metabolomics datasets.

Pipeline Architecture

4-Stage Progressive Strategy

        graph TD
  A[Input Dataset] --> B[Stage 1: Direct Matching]
  B --> C[Stage 2: Fuzzy String Matching]
  C --> D[Stage 3: RampDB API Bridge]
  D --> E[Stage 4: HMDB Vector Matching]
  E --> F[Results Consolidation]
  F --> G[Coverage Analysis & Reporting]
    

Stage Breakdown

Stage 1: Direct/Exact Matching
  • Method: NIGHTINGALE_NMR_MATCH

  • Coverage: 45-55% (high confidence)

  • Speed: <2 seconds for 10K identifiers

  • Strategy: Exact string matching against curated reference

Stage 2: Fuzzy String Matching
  • Method: METABOLITE_FUZZY_STRING_MATCH

  • Coverage: +15-20% additional

  • Speed: 5-10 seconds

  • Strategy: Levenshtein distance with biological awareness

Stage 3: External API Integration
  • Method: METABOLITE_RAMPDB_BRIDGE

  • Coverage: +8-12% additional

  • Speed: 30-60 seconds (API dependent)

  • Strategy: Cross-database lookups via RampDB

Stage 4: Vector Semantic Matching
  • Method: HMDB_VECTOR_MATCH

  • Coverage: +5-10% additional

  • Speed: 10-20 seconds

  • Strategy: Embedding-based semantic similarity

Parameters

Parameter

Type

Required

Description

input_key

string

Yes

Key for the input dataset containing metabolite identifiers

output_key

string

Yes

Key for the final consolidated output dataset

identifier_column

string

Yes

Column name containing metabolite identifiers to match

stage1_threshold

float

No

Direct matching threshold (default: 0.95)

stage2_threshold

float

No

Fuzzy matching threshold (default: 0.8)

stage3_batch_size

integer

No

RampDB API batch size (default: 50)

stage4_threshold

float

No

Vector similarity threshold (default: 0.75)

enable_quality_control

boolean

No

Enable inter-stage validation (default: true)

export_stage_results

boolean

No

Export individual stage results (default: false)

Performance Metrics

Expected Coverage by Dataset Type

Based on production runs with real biological datasets:

Dataset Type

Expected Coverage

Processing Time

Confidence Level

Arivale Metabolomics

75-80% (1,053/1,351)

2-3 minutes

High

UK Biobank Subset

40-45% (varies)

1-2 minutes

Medium

Custom Datasets

50-70% (varies)

Variable

Variable

Cumulative Coverage Progression

Typical progression for a 1,000 metabolite dataset:

  • After Stage 1: ~500 matched (50%)

  • After Stage 2: ~650 matched (65%)

  • After Stage 3: ~720 matched (72%)

  • After Stage 4: ~750 matched (75%)

Example Usage

YAML Strategy

steps:
  - name: progressive_metabolite_matching
    action:
      type: PROGRESSIVE_SEMANTIC_MATCH
      params:
        input_key: raw_metabolites
        output_key: matched_metabolites
        identifier_column: compound_name
        stage1_threshold: 0.95
        stage2_threshold: 0.8
        stage3_batch_size: 40
        stage4_threshold: 0.75
        enable_quality_control: true
        export_stage_results: true

Python Client

from src.client.client_v2 import BiomapperClient

client = BiomapperClient(base_url="http://localhost:8000")

# Load metabolite dataset
context = {"datasets": {"metabolites": metabolite_df}}

result = await client.run_action(
    action_type="PROGRESSIVE_SEMANTIC_MATCH",
    params={
        "input_key": "metabolites",
        "output_key": "harmonized_metabolites",
        "identifier_column": "metabolite_name",
        "stage1_threshold": 0.9,
        "stage2_threshold": 0.8,
        "enable_quality_control": True
    },
    context=context
)

# Access consolidated results
matched_data = result.context["datasets"]["harmonized_metabolites"]
print(f"Coverage achieved: {len(matched_data)} metabolites")

Output Format

The action returns a comprehensive dataset with matching provenance:

Column

Description

original_id

Original metabolite identifier from input

matched_hmdb_id

Final HMDB identifier (primary output)

matched_name

Standardized metabolite name

matching_stage

Stage that produced the match (1-4)

matching_method

Specific method used (exact, fuzzy, api, vector)

confidence_score

Overall confidence score (0.0-1.0)

stage1_candidate

Stage 1 matching candidate (if any)

stage2_candidate

Stage 2 matching candidate (if any)

stage3_candidate

Stage 3 matching candidate (if any)

stage4_candidate

Stage 4 matching candidate (if any)

quality_flags

Quality control flags and warnings

Stage Implementation Details

Stage 1: Direct Matching

High-confidence exact matches using curated reference data:

stage1_configuration:
  method: NIGHTINGALE_NMR_MATCH
  reference_data: nightingale_nmr_metabolites
  matching_strategy: exact_string
  case_sensitive: false
  normalize_whitespace: true

Stage 2: Fuzzy String Matching

Captures near-matches with controlled edit distance:

stage2_configuration:
  method: METABOLITE_FUZZY_STRING_MATCH
  algorithm: biological_levenshtein
  max_edit_distance: 2
  ignore_case: true
  ignore_punctuation: true

Stage 3: External API Integration

Leverages RampDB for cross-database resolution:

stage3_configuration:
  method: METABOLITE_RAMPDB_BRIDGE
  target_databases: [hmdb, kegg, chebi]
  timeout: 45
  retry_attempts: 3
  rate_limit_delay: 1.0

Stage 4: Vector Semantic Matching

Uses embedding-based similarity for complex cases:

stage4_configuration:
  method: HMDB_VECTOR_MATCH
  embedding_model: all-MiniLM-L6-v2
  vector_db: qdrant_hmdb_collection
  similarity_metric: cosine
  max_candidates: 5

Quality Control Features

Inter-Stage Validation

Validates consistency between matching stages:

  • Conflict Detection: Identifies when stages produce conflicting matches

  • Confidence Scoring: Weights matches based on stage reliability

  • Consensus Building: Resolves conflicts using multiple signals

Quality Metrics Tracking

Monitors pipeline health and performance:

  • Coverage Progression: Tracks cumulative coverage by stage

  • Confidence Distribution: Analyzes confidence score distributions

  • Error Rate Analysis: Identifies systematic matching failures

  • Performance Benchmarks: Compares against expected baselines

Example Quality Report

{
  "total_identifiers": 1351,
  "total_matched": 1053,
  "overall_coverage": 77.9,
  "stage_breakdown": {
    "stage1": {"matched": 692, "coverage": 51.2},
    "stage2": {"matched": 201, "coverage": 14.9},
    "stage3": {"matched": 105, "coverage": 7.8},
    "stage4": {"matched": 55, "coverage": 4.0}
  },
  "quality_metrics": {
    "high_confidence": 847,
    "medium_confidence": 156,
    "low_confidence": 50,
    "flagged_for_review": 23
  }
}

Advanced Configuration

Threshold Optimization

Fine-tune thresholds based on dataset characteristics:

# Conservative profile (high precision)
conservative_config:
  stage1_threshold: 0.98
  stage2_threshold: 0.85
  stage4_threshold: 0.80

# Aggressive profile (high recall)
aggressive_config:
  stage1_threshold: 0.92
  stage2_threshold: 0.75
  stage4_threshold: 0.70

# Balanced profile (recommended)
balanced_config:
  stage1_threshold: 0.95
  stage2_threshold: 0.80
  stage4_threshold: 0.75

Performance Optimization

For large datasets (>10K metabolites):

performance_config:
  enable_parallel_stages: true
  stage2_chunk_size: 2000
  stage3_batch_size: 100
  stage4_batch_size: 500
  cache_intermediate_results: true

Custom Stage Configuration

Override individual stage parameters:

custom_stage_config:
  stage1_params:
    reference_source: "custom_metabolite_db"
    normalization_rules: ["remove_stereochemistry"]
  stage2_params:
    algorithm: "jaro_winkler"
    biological_synonyms: true
  stage3_params:
    target_databases: ["hmdb", "kegg", "chebi", "pubchem"]
    concurrent_requests: 3
  stage4_params:
    use_llm_validation: true
    embedding_cache: true

Error Handling and Recovery

Stage-Level Error Recovery

Each stage includes independent error handling:

  • Stage Isolation: Failure in one stage doesn’t affect others

  • Partial Results: Successfully completed stages preserve their matches

  • Error Reporting: Detailed logs for debugging stage-specific issues

  • Graceful Degradation: Pipeline continues with reduced functionality

Common Recovery Patterns

error_handling:
  stage1_fallback: "skip_to_stage2"
  stage2_timeout_action: "reduce_batch_size"
  stage3_api_failure: "retry_with_exponential_backoff"
  stage4_memory_error: "process_in_smaller_chunks"

Monitoring and Alerting

Built-in monitoring for production deployments:

  • Progress Tracking: Real-time progress updates for long-running pipelines

  • Performance Alerts: Notifications when stages exceed expected runtimes

  • Quality Alerts: Warnings when coverage drops below baselines

  • Resource Monitoring: Memory and API quota usage tracking

Real-World Case Studies

Arivale Metabolomics Dataset

Dataset: 1,351 unique metabolites from personalized medicine platform Result: 77.9% coverage (1,053 matched metabolites) Processing Time: 2 minutes 34 seconds Stage Performance: - Stage 1: 692 matches (51.2%) - Stage 2: 201 additional (14.9%) - Stage 3: 105 additional (7.8%) - Stage 4: 55 additional (4.0%)

UK Biobank Subset

Dataset: 2,847 metabolite measurements from population study Result: 42.3% coverage (1,204 matched metabolites) Challenge: Heterogeneous naming conventions Processing Time: 1 minute 47 seconds

Best Practices

  1. Start with Balanced Configuration: Use recommended thresholds initially

  2. Monitor Stage Performance: Track individual stage contributions

  3. Validate Results: Manually review sample matches from each stage

  4. Document Parameters: Record threshold choices and rationale

  5. Benchmark Regularly: Compare performance against known datasets

Integration Patterns

Complete Pipeline Integration

name: comprehensive_metabolomics_pipeline
steps:
  - name: load_metabolites
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "${parameters.input_file}"
        identifier_column: metabolite_name
        output_key: raw_metabolites

  - name: progressive_matching
    action:
      type: PROGRESSIVE_SEMANTIC_MATCH
      params:
        input_key: raw_metabolites
        output_key: matched_metabolites
        identifier_column: metabolite_name

  - name: generate_report
    action:
      type: GENERATE_METABOLOMICS_REPORT
      params:
        input_key: matched_metabolites
        output_key: coverage_report

  - name: export_results
    action:
      type: EXPORT_DATASET
      params:
        input_key: matched_metabolites
        file_path: "${parameters.output_file}"

Multi-Dataset Comparison

# Compare matching across multiple datasets
steps:
  - name: match_dataset_a
    action:
      type: PROGRESSIVE_SEMANTIC_MATCH
      params:
        input_key: dataset_a
        output_key: matched_a

  - name: match_dataset_b
    action:
      type: PROGRESSIVE_SEMANTIC_MATCH
      params:
        input_key: dataset_b
        output_key: matched_b

  - name: compare_coverage
    action:
      type: CALCULATE_SET_OVERLAP
      params:
        dataset1_key: matched_a
        dataset2_key: matched_b

Troubleshooting

Low Overall Coverage

  1. Check Input Data Quality: Verify metabolite names are clean and standardized

  2. Adjust Thresholds: Lower thresholds to increase recall

  3. Enable All Stages: Ensure no stages are being skipped

  4. Review Reference Data: Confirm reference databases are current

Stage-Specific Issues

  • Stage 1 Low Coverage: Check reference data alignment and normalization

  • Stage 2 Timeouts: Reduce batch sizes or enable chunking

  • Stage 3 API Failures: Verify API credentials and network connectivity

  • Stage 4 Performance: Ensure vector database is properly indexed

Performance Problems

  1. Memory Issues: Enable chunked processing for large datasets

  2. API Timeouts: Increase timeout values and reduce batch sizes

  3. Slow Processing: Enable parallel processing where supported

  4. Resource Limits: Monitor CPU and memory usage during execution

See Also