Progressive Semantic Match

Overview

The PROGRESSIVE_SEMANTIC_MATCH action orchestrates a comprehensive 4-stage metabolite matching pipeline that progressively applies different matching strategies to achieve maximum coverage. This is the primary action for metabolomics identifier harmonization in the biomapper framework.

The progressive approach starts with high-confidence exact matches and gradually relaxes criteria to capture more difficult cases, achieving typical coverage rates of 75-80% for well-curated metabolomics datasets.

Pipeline Architecture

4-Stage Progressive Strategy

        graph TD
  A[Input Dataset] --> B[Stage 1: Direct Matching]
  B --> C[Stage 2: Fuzzy String Matching]
  C --> D[Stage 3: RampDB API Bridge]
  D --> E[Stage 4: HMDB Vector Matching]
  E --> F[Results Consolidation]
  F --> G[Coverage Analysis & Reporting]

Stage Breakdown

Stage 1: Direct/Exact Matching

Method: NIGHTINGALE_NMR_MATCH
Coverage: 45-55% (high confidence)
Speed: <2 seconds for 10K identifiers
Strategy: Exact string matching against curated reference

Stage 2: Fuzzy String Matching

Method: METABOLITE_FUZZY_STRING_MATCH
Coverage: +15-20% additional
Speed: 5-10 seconds
Strategy: Levenshtein distance with biological awareness

Stage 3: External API Integration

Method: METABOLITE_RAMPDB_BRIDGE
Coverage: +8-12% additional
Speed: 30-60 seconds (API dependent)
Strategy: Cross-database lookups via RampDB

Stage 4: Vector Semantic Matching

Method: HMDB_VECTOR_MATCH
Coverage: +5-10% additional
Speed: 10-20 seconds
Strategy: Embedding-based semantic similarity

Parameters

Parameter	Type	Required	Description
`input_key`	string	Yes	Key for the input dataset containing metabolite identifiers
`output_key`	string	Yes	Key for the final consolidated output dataset
`identifier_column`	string	Yes	Column name containing metabolite identifiers to match
`stage1_threshold`	float	No	Direct matching threshold (default: 0.95)
`stage2_threshold`	float	No	Fuzzy matching threshold (default: 0.8)
`stage3_batch_size`	integer	No	RampDB API batch size (default: 50)
`stage4_threshold`	float	No	Vector similarity threshold (default: 0.75)
`enable_quality_control`	boolean	No	Enable inter-stage validation (default: true)
`export_stage_results`	boolean	No	Export individual stage results (default: false)

Performance Metrics

Expected Coverage by Dataset Type

Based on production runs with real biological datasets:

Dataset Type	Expected Coverage	Processing Time	Confidence Level
Arivale Metabolomics	75-80% (1,053/1,351)	2-3 minutes	High
UK Biobank Subset	40-45% (varies)	1-2 minutes	Medium
Custom Datasets	50-70% (varies)	Variable	Variable

Cumulative Coverage Progression

Typical progression for a 1,000 metabolite dataset:

After Stage 1: ~500 matched (50%)
After Stage 2: ~650 matched (65%)
After Stage 3: ~720 matched (72%)
After Stage 4: ~750 matched (75%)

Example Usage

YAML Strategy

steps:
  - name: progressive_metabolite_matching
    action:
      type: PROGRESSIVE_SEMANTIC_MATCH
      params:
        input_key: raw_metabolites
        output_key: matched_metabolites
        identifier_column: compound_name
        stage1_threshold: 0.95
        stage2_threshold: 0.8
        stage3_batch_size: 40
        stage4_threshold: 0.75
        enable_quality_control: true
        export_stage_results: true

Python Client

from src.client.client_v2 import BiomapperClient

client = BiomapperClient(base_url="http://localhost:8000")

# Load metabolite dataset
context = {"datasets": {"metabolites": metabolite_df}}

result = await client.run_action(
    action_type="PROGRESSIVE_SEMANTIC_MATCH",
    params={
        "input_key": "metabolites",
        "output_key": "harmonized_metabolites",
        "identifier_column": "metabolite_name",
        "stage1_threshold": 0.9,
        "stage2_threshold": 0.8,
        "enable_quality_control": True
    },
    context=context
)

# Access consolidated results
matched_data = result.context["datasets"]["harmonized_metabolites"]
print(f"Coverage achieved: {len(matched_data)} metabolites")

Output Format

The action returns a comprehensive dataset with matching provenance:

Column	Description
`original_id`	Original metabolite identifier from input
`matched_hmdb_id`	Final HMDB identifier (primary output)
`matched_name`	Standardized metabolite name
`matching_stage`	Stage that produced the match (1-4)
`matching_method`	Specific method used (exact, fuzzy, api, vector)
`confidence_score`	Overall confidence score (0.0-1.0)
`stage1_candidate`	Stage 1 matching candidate (if any)
`stage2_candidate`	Stage 2 matching candidate (if any)
`stage3_candidate`	Stage 3 matching candidate (if any)
`stage4_candidate`	Stage 4 matching candidate (if any)
`quality_flags`	Quality control flags and warnings

Stage Implementation Details

Stage 1: Direct Matching

High-confidence exact matches using curated reference data:

stage1_configuration:
  method: NIGHTINGALE_NMR_MATCH
  reference_data: nightingale_nmr_metabolites
  matching_strategy: exact_string
  case_sensitive: false
  normalize_whitespace: true

Stage 2: Fuzzy String Matching

Captures near-matches with controlled edit distance:

stage2_configuration:
  method: METABOLITE_FUZZY_STRING_MATCH
  algorithm: biological_levenshtein
  max_edit_distance: 2
  ignore_case: true
  ignore_punctuation: true

Stage 3: External API Integration

Leverages RampDB for cross-database resolution:

stage3_configuration:
  method: METABOLITE_RAMPDB_BRIDGE
  target_databases: [hmdb, kegg, chebi]
  timeout: 45
  retry_attempts: 3
  rate_limit_delay: 1.0

Stage 4: Vector Semantic Matching

Uses embedding-based similarity for complex cases:

stage4_configuration:
  method: HMDB_VECTOR_MATCH
  embedding_model: all-MiniLM-L6-v2
  vector_db: qdrant_hmdb_collection
  similarity_metric: cosine
  max_candidates: 5

Quality Control Features

Inter-Stage Validation

Validates consistency between matching stages:

Conflict Detection: Identifies when stages produce conflicting matches
Confidence Scoring: Weights matches based on stage reliability
Consensus Building: Resolves conflicts using multiple signals

Quality Metrics Tracking

Monitors pipeline health and performance:

Coverage Progression: Tracks cumulative coverage by stage
Confidence Distribution: Analyzes confidence score distributions
Error Rate Analysis: Identifies systematic matching failures
Performance Benchmarks: Compares against expected baselines

Example Quality Report

{
  "total_identifiers": 1351,
  "total_matched": 1053,
  "overall_coverage": 77.9,
  "stage_breakdown": {
    "stage1": {"matched": 692, "coverage": 51.2},
    "stage2": {"matched": 201, "coverage": 14.9},
    "stage3": {"matched": 105, "coverage": 7.8},
    "stage4": {"matched": 55, "coverage": 4.0}
  },
  "quality_metrics": {
    "high_confidence": 847,
    "medium_confidence": 156,
    "low_confidence": 50,
    "flagged_for_review": 23
  }
}

Advanced Configuration

Threshold Optimization

Fine-tune thresholds based on dataset characteristics:

# Conservative profile (high precision)
conservative_config:
  stage1_threshold: 0.98
  stage2_threshold: 0.85
  stage4_threshold: 0.80

# Aggressive profile (high recall)
aggressive_config:
  stage1_threshold: 0.92
  stage2_threshold: 0.75
  stage4_threshold: 0.70

# Balanced profile (recommended)
balanced_config:
  stage1_threshold: 0.95
  stage2_threshold: 0.80
  stage4_threshold: 0.75

Performance Optimization

For large datasets (>10K metabolites):

performance_config:
  enable_parallel_stages: true
  stage2_chunk_size: 2000
  stage3_batch_size: 100
  stage4_batch_size: 500
  cache_intermediate_results: true

Custom Stage Configuration

Override individual stage parameters:

custom_stage_config:
  stage1_params:
    reference_source: "custom_metabolite_db"
    normalization_rules: ["remove_stereochemistry"]
  stage2_params:
    algorithm: "jaro_winkler"
    biological_synonyms: true
  stage3_params:
    target_databases: ["hmdb", "kegg", "chebi", "pubchem"]
    concurrent_requests: 3
  stage4_params:
    use_llm_validation: true
    embedding_cache: true

Error Handling and Recovery

Stage-Level Error Recovery

Each stage includes independent error handling:

Stage Isolation: Failure in one stage doesn’t affect others
Partial Results: Successfully completed stages preserve their matches
Error Reporting: Detailed logs for debugging stage-specific issues
Graceful Degradation: Pipeline continues with reduced functionality

Common Recovery Patterns

error_handling:
  stage1_fallback: "skip_to_stage2"
  stage2_timeout_action: "reduce_batch_size"
  stage3_api_failure: "retry_with_exponential_backoff"
  stage4_memory_error: "process_in_smaller_chunks"

Monitoring and Alerting

Built-in monitoring for production deployments:

Progress Tracking: Real-time progress updates for long-running pipelines
Performance Alerts: Notifications when stages exceed expected runtimes
Quality Alerts: Warnings when coverage drops below baselines
Resource Monitoring: Memory and API quota usage tracking

Real-World Case Studies

Arivale Metabolomics Dataset

Dataset: 1,351 unique metabolites from personalized medicine platform Result: 77.9% coverage (1,053 matched metabolites) Processing Time: 2 minutes 34 seconds Stage Performance: - Stage 1: 692 matches (51.2%) - Stage 2: 201 additional (14.9%) - Stage 3: 105 additional (7.8%) - Stage 4: 55 additional (4.0%)

UK Biobank Subset

Dataset: 2,847 metabolite measurements from population study Result: 42.3% coverage (1,204 matched metabolites) Challenge: Heterogeneous naming conventions Processing Time: 1 minute 47 seconds

Best Practices

Start with Balanced Configuration: Use recommended thresholds initially
Monitor Stage Performance: Track individual stage contributions
Validate Results: Manually review sample matches from each stage
Document Parameters: Record threshold choices and rationale
Benchmark Regularly: Compare performance against known datasets

Integration Patterns

Complete Pipeline Integration

name: comprehensive_metabolomics_pipeline
steps:
  - name: load_metabolites
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "${parameters.input_file}"
        identifier_column: metabolite_name
        output_key: raw_metabolites

  - name: progressive_matching
    action:
      type: PROGRESSIVE_SEMANTIC_MATCH
      params:
        input_key: raw_metabolites
        output_key: matched_metabolites
        identifier_column: metabolite_name

  - name: generate_report
    action:
      type: GENERATE_METABOLOMICS_REPORT
      params:
        input_key: matched_metabolites
        output_key: coverage_report

  - name: export_results
    action:
      type: EXPORT_DATASET
      params:
        input_key: matched_metabolites
        file_path: "${parameters.output_file}"

Multi-Dataset Comparison

# Compare matching across multiple datasets
steps:
  - name: match_dataset_a
    action:
      type: PROGRESSIVE_SEMANTIC_MATCH
      params:
        input_key: dataset_a
        output_key: matched_a

  - name: match_dataset_b
    action:
      type: PROGRESSIVE_SEMANTIC_MATCH
      params:
        input_key: dataset_b
        output_key: matched_b

  - name: compare_coverage
    action:
      type: CALCULATE_SET_OVERLAP
      params:
        dataset1_key: matched_a
        dataset2_key: matched_b

Troubleshooting

Low Overall Coverage

Check Input Data Quality: Verify metabolite names are clean and standardized
Adjust Thresholds: Lower thresholds to increase recall
Enable All Stages: Ensure no stages are being skipped
Review Reference Data: Confirm reference databases are current

Stage-Specific Issues

Stage 1 Low Coverage: Check reference data alignment and normalization
Stage 2 Timeouts: Reduce batch sizes or enable chunking
Stage 3 API Failures: Verify API credentials and network connectivity
Stage 4 Performance: Ensure vector database is properly indexed

Performance Problems

Memory Issues: Enable chunked processing for large datasets
API Timeouts: Increase timeout values and reduce batch sizes
Slow Processing: Enable parallel processing where supported
Resource Limits: Monitor CPU and memory usage during execution