Metabolomics Progressive Pipeline

Overview

The Metabolomics Progressive Pipeline v4.0 represents the state-of-the-art approach for harmonizing metabolite identifiers across different biological datasets. This 4-stage progressive matching pipeline achieves 75-80% coverage for typical metabolomics datasets.

The pipeline implements a sophisticated “progressive matching” strategy, starting with high-confidence exact matches and progressively relaxing matching criteria to capture more difficult cases. Version 4.0 includes consolidated debugging, pre-flight validation, and incremental stage enabling for systematic testing.

Pipeline Architecture

4-Stage Progressive Matching

        graph TD
  A[Input Dataset] --> B[Stage 1: Direct Matching]
  B --> C[Stage 2: Fuzzy String Matching]
  C --> D[Stage 3: RampDB Bridge]
  D --> E[Stage 4: HMDB Vector Matching]
  E --> F[Results Export & Visualization]
  F --> G[Google Drive Upload]

Stage Breakdown

Stage 1: Direct/Exact Matching

Action: NIGHTINGALE_NMR_MATCH
Coverage: ~45-55% (high confidence)
Speed: <2 seconds for 10K identifiers
Method: Exact string matching against Nightingale reference with fuzzy fallback
Features: Built-in biomarker patterns, abbreviation expansion, lipoprotein recognition

Stage 2: Fuzzy String Matching

Action: METABOLITE_FUZZY_STRING_MATCH
Coverage: +15-20% additional
Speed: ~5-10 seconds
Method: Token sort ratio with fuzzywuzzy (threshold: 85%)
Cost: $0.00 (algorithmic matching, no API calls)

Stage 3: RampDB API Bridge

Action: METABOLITE_RAMPDB_BRIDGE
Coverage: +8-12% additional
Speed: ~30-60 seconds (API dependent)
Method: External API calls to RampDB service
Note: Requires active RampDB API access

Stage 4: Vector Semantic Matching

Action: HMDB_VECTOR_MATCH
Coverage: +5-10% additional
Speed: ~10-20 seconds
Method: FastEmbed with Qdrant vector database similarity search
Requirements: Qdrant storage with pre-computed HMDB embeddings

Expected Performance Metrics

Coverage Statistics

Based on production runs with real biological datasets:

Dataset Type	Expected Coverage	Processing Time	Confidence Level
Arivale Metabolomics	75-80% (1,053/1,351)	2-3 minutes	High
UK Biobank	40-45% (varies by subset)	1-2 minutes	Medium
Custom Datasets	50-70% (depends on quality)	Variable	Variable

Stage-by-Stage Coverage Accumulation

Typical progression for a 1,000 metabolite dataset:

After Stage 1: ~500 matched (50%)
After Stage 2: ~650 matched (65%)
After Stage 3: ~720 matched (72%)
After Stage 4: ~750 matched (75%)

Implementation

YAML Strategy Configuration

name: met_arv_to_ukbb_progressive_v4.0
description: |
  Consolidated progressive metabolomics mapping pipeline with extensive debugging.
  Features systematic stage-by-stage execution with comprehensive logging.

parameters:
  # Core paths - MUST be absolute paths
  file_path: /procedure/data/local_data/MAPPING_ONTOLOGIES/arivale/metabolomics_metadata.tsv
  reference_file: /procedure/data/local_data/MAPPING_ONTOLOGIES/ukbb/UKBB_NMR_Meta.tsv
  output_dir: ${OUTPUT_DIR:-/tmp/biomapper/met_arv_to_ukbb_v4.0}

  # Debug controls - CRITICAL for troubleshooting
  debug_mode: true
  verbose_logging: true
  fail_on_warning: false
  validate_parameters: true

  # Stage control - Enable incrementally for testing
  stages_to_run: [1,2,3,4]  # Full pipeline

  # Column specifications
  identifier_column: BIOCHEMICAL_NAME
  hmdb_column: HMDB
  pubchem_column: PUBCHEM
  kegg_column: KEGG
  cas_column: CAS

  # Thresholds (conservative)
  stage_1_threshold: 0.95
  stage_2_threshold: 0.85
  stage_3_threshold: 0.70
  stage_4_threshold: 0.75

steps:
  # Pre-flight validation
  - name: validate_environment
    action:
      type: CUSTOM_TRANSFORM
      params:
        input_key: dummy
        output_key: validation_results
        transformations:
          - column: timestamp
            expression: |
              # Validate output directory and parameters
              from pathlib import Path
              Path("${parameters.output_dir}").mkdir(parents=True, exist_ok=True)
              datetime.now().isoformat()

  # Stage 1: Nightingale NMR matching
  - name: stage_1_nightingale_match
    action:
      type: NIGHTINGALE_NMR_MATCH
      params:
        input_key: arivale_raw
        output_key: nightingale_matched
        biomarker_column: "${parameters.identifier_column}"
        match_threshold: "${parameters.stage_1_threshold}"
        target_format: both
        add_metadata: true

  # Stage 2: Fuzzy string matching
  - name: stage_2_fuzzy_match
    action:
      type: METABOLITE_FUZZY_STRING_MATCH
      params:
        unmapped_key: nightingale_unmapped
        reference_key: reference_raw
        output_key: fuzzy_matched
        final_unmapped_key: fuzzy_unmapped
        fuzzy_threshold: "${parameters.stage_2_threshold}"
    condition: 2 in ${parameters.stages_to_run}

  # Stage 3: RampDB API bridge
  - name: stage_3_rampdb_bridge
    action:
      type: METABOLITE_RAMPDB_BRIDGE
      params:
        unmapped_key: fuzzy_unmapped
        output_key: rampdb_matched
        final_unmapped_key: rampdb_unmapped
        confidence_threshold: "${parameters.stage_3_threshold}"
    condition: 3 in ${parameters.stages_to_run}

  # Stage 4: HMDB vector matching
  - name: stage_4_hmdb_vector
    action:
      type: HMDB_VECTOR_MATCH
      params:
        input_key: rampdb_unmapped
        output_key: stage_4_matched
        unmatched_key: stage_4_unmatched
        identifier_column: "${parameters.identifier_column}"
        threshold: "${parameters.stage_4_threshold}"
        collection_name: hmdb_metabolites
        qdrant_path: /home/ubuntu/biomapper/data/qdrant_storage
        embedding_model: sentence-transformers/all-MiniLM-L6-v2
        enable_llm_validation: false
    condition: 4 in ${parameters.stages_to_run}

  # Results consolidation
  - name: merge_all_matches
    action:
      type: MERGE_DATASETS
      params:
        dataset_keys:
          - nightingale_matched
          - fuzzy_matched
          - rampdb_matched
          - stage_4_matched
        merge_type: concat
        deduplicate: true
        output_key: all_matches

  # Export final results
  - name: export_final_results
    action:
      type: EXPORT_DATASET
      params:
        input_key: all_matches
        output_path: "${parameters.output_dir}/final_results.tsv"
        format: tsv

Python Client Usage

from src.client.client_v2 import BiomapperClient
import asyncio

async def run_metabolomics_pipeline():
    client = BiomapperClient(base_url="http://localhost:8000")

    # Run the complete v4.0 pipeline
    result = await client.run_strategy(
        strategy_name="met_arv_to_ukbb_progressive_v4.0",
        parameters={
            "file_path": "/data/arivale_metabolites.tsv",
            "reference_file": "/data/ukbb_nmr_reference.tsv",
            "output_dir": "/results/metabolomics_v4",
            "stages_to_run": [1, 2, 3, 4],  # Full pipeline
            "debug_mode": True
        }
    )

    print(f"Pipeline completed with {result.total_matched} matches")
    print(f"Coverage: {result.coverage:.1f}%")
    print(f"Results saved to: {result.output_files}")

    return result

# Synchronous wrapper for scripts
def run_pipeline_sync():
    client = BiomapperClient()
    return client.run("met_arv_to_ukbb_progressive_v4.0")

# Run the pipeline
if __name__ == "__main__":
    result = run_pipeline_sync()
    print(f"Final coverage: {result.coverage:.1f}%")

Advanced Configuration

Threshold Optimization

Fine-tune matching thresholds based on your dataset characteristics:

# Conservative (higher precision)
stage1_threshold: 0.98
stage2_threshold: 0.85
stage4_threshold: 0.80

# Aggressive (higher recall)
stage1_threshold: 0.90
stage2_threshold: 0.75
stage4_threshold: 0.70

Performance Tuning

For large datasets (>10K metabolites):

# Enable chunking for large datasets
chunk_processing: true
chunk_size: 5000

# Optimize API calls
rampdb_batch_size: 100
rampdb_timeout: 45

# Vector search optimization
vector_max_results: 5
vector_batch_size: 200

Quality Control Configuration

Add validation and quality checks:

# Enable LLM validation for Stage 4
use_llm_validation: true
llm_confidence_threshold: 0.7

# Add quality metrics tracking
track_confidence_scores: true
generate_quality_report: true

# Export unmatched for manual review
export_unmatched: true
unmatched_file_path: "${parameters.output_dir}/unmatched_metabolites.csv"

Real-World Case Studies

Arivale Metabolomics Dataset

Dataset Characteristics: - Size: 1,351 unique metabolites after filtering - Source: Arivale personalized medicine platform - Quality: High-quality, curated metabolite names

Results: - Total Coverage: 77.9% (1,053 matched metabolites) - Stage 1: 692 matches (51.2%) - Stage 2: 201 additional matches (14.9%) - Stage 3: 105 additional matches (7.8%) - Stage 4: 55 additional matches (4.0%) - Processing Time: 2 minutes 34 seconds

UK Biobank Subset

Dataset Characteristics: - Size: 2,847 metabolite measurements - Source: UK Biobank metabolomics data - Quality: Variable, research-grade identifiers

Results: - Total Coverage: 42.3% (1,204 matched metabolites) - Processing Time: 1 minute 47 seconds - Challenge: More heterogeneous naming conventions

Troubleshooting Common Issues

Low Coverage Issues

Check Data Quality
- Verify metabolite names are clean (no extra whitespace)
- Check for non-standard naming conventions
- Review identifier column selection
Adjust Thresholds
- Lower fuzzy matching threshold (0.8 → 0.7)
- Increase vector similarity candidates
- Enable LLM validation for borderline cases
Data Preprocessing
- Normalize metabolite names (case, punctuation)
- Handle synonyms and alternative names
- Remove or standardize chemical formulas

Performance Issues

API Timeouts (Stage 3)
- Increase RampDB timeout settings
- Reduce batch sizes for API calls
- Implement retry logic with exponential backoff
Memory Issues
- Enable chunked processing for large datasets
- Reduce vector search candidates
- Process dataset in smaller batches
Slow Processing
- Skip stages with low expected yield
- Parallelize independent operations
- Use cached results when available

Quality Validation

Confidence Score Review
- Check distribution of matching scores
- Manually validate low-confidence matches
- Adjust thresholds based on validation results
Coverage Analysis
- Compare against expected baselines
- Identify systematic naming issues
- Review unmatched metabolites for patterns

Best Practices

Pipeline Design

Start Conservative: Use high thresholds initially, then relax
Track Provenance: Maintain matching source information
Quality Metrics: Monitor confidence scores throughout
Incremental Improvement: Optimize one stage at a time

Data Preparation

Clean Input Data: Remove duplicates, normalize formatting
Validate Identifiers: Check for common naming issues
Backup Originals: Preserve original identifiers for reference
Document Assumptions: Record data preprocessing decisions

Production Deployment

Version Control: Tag strategy versions for reproducibility
Monitoring: Track pipeline performance over time
Validation: Regular spot-checks of matching quality
Documentation: Maintain parameter reasoning and tuning history

Integration with Other Pipelines

Multi-Omics Workflows

The metabolomics pipeline integrates with protein and chemistry pipelines:

# Combined multi-omics strategy
steps:
  - name: process_metabolites
    strategy: metabolomics_progressive_production_v3

  - name: process_proteins
    strategy: protein_harmonization_v2

  - name: cross_validate_results
    action:
      type: CALCULATE_SET_OVERLAP
      params:
        dataset1_key: "metabolite_results"
        dataset2_key: "protein_results"

Downstream Analysis Integration

Pipeline results feed into analysis workflows:

Pathway Analysis: Matched identifiers → pathway enrichment
Network Analysis: Cross-dataset connections and interactions
Visualization: Comprehensive multi-omics visualizations
Statistics: Coverage and quality metrics reporting