Metabolomics Progressive Pipeline

Overview

The Metabolomics Progressive Pipeline v4.0 represents the state-of-the-art approach for harmonizing metabolite identifiers across different biological datasets. This 4-stage progressive matching pipeline achieves 75-80% coverage for typical metabolomics datasets.

The pipeline implements a sophisticated “progressive matching” strategy, starting with high-confidence exact matches and progressively relaxing matching criteria to capture more difficult cases. Version 4.0 includes consolidated debugging, pre-flight validation, and incremental stage enabling for systematic testing.

Pipeline Architecture

4-Stage Progressive Matching

        graph TD
  A[Input Dataset] --> B[Stage 1: Direct Matching]
  B --> C[Stage 2: Fuzzy String Matching]
  C --> D[Stage 3: RampDB Bridge]
  D --> E[Stage 4: HMDB Vector Matching]
  E --> F[Results Export & Visualization]
  F --> G[Google Drive Upload]
    

Stage Breakdown

Stage 1: Direct/Exact Matching
  • Action: NIGHTINGALE_NMR_MATCH

  • Coverage: ~45-55% (high confidence)

  • Speed: <2 seconds for 10K identifiers

  • Method: Exact string matching against Nightingale reference with fuzzy fallback

  • Features: Built-in biomarker patterns, abbreviation expansion, lipoprotein recognition

Stage 2: Fuzzy String Matching
  • Action: METABOLITE_FUZZY_STRING_MATCH

  • Coverage: +15-20% additional

  • Speed: ~5-10 seconds

  • Method: Token sort ratio with fuzzywuzzy (threshold: 85%)

  • Cost: $0.00 (algorithmic matching, no API calls)

Stage 3: RampDB API Bridge
  • Action: METABOLITE_RAMPDB_BRIDGE

  • Coverage: +8-12% additional

  • Speed: ~30-60 seconds (API dependent)

  • Method: External API calls to RampDB service

  • Note: Requires active RampDB API access

Stage 4: Vector Semantic Matching
  • Action: HMDB_VECTOR_MATCH

  • Coverage: +5-10% additional

  • Speed: ~10-20 seconds

  • Method: FastEmbed with Qdrant vector database similarity search

  • Requirements: Qdrant storage with pre-computed HMDB embeddings

Expected Performance Metrics

Coverage Statistics

Based on production runs with real biological datasets:

Dataset Type

Expected Coverage

Processing Time

Confidence Level

Arivale Metabolomics

75-80% (1,053/1,351)

2-3 minutes

High

UK Biobank

40-45% (varies by subset)

1-2 minutes

Medium

Custom Datasets

50-70% (depends on quality)

Variable

Variable

Stage-by-Stage Coverage Accumulation

Typical progression for a 1,000 metabolite dataset:

  • After Stage 1: ~500 matched (50%)

  • After Stage 2: ~650 matched (65%)

  • After Stage 3: ~720 matched (72%)

  • After Stage 4: ~750 matched (75%)

Implementation

YAML Strategy Configuration

name: met_arv_to_ukbb_progressive_v4.0
description: |
  Consolidated progressive metabolomics mapping pipeline with extensive debugging.
  Features systematic stage-by-stage execution with comprehensive logging.

parameters:
  # Core paths - MUST be absolute paths
  file_path: /procedure/data/local_data/MAPPING_ONTOLOGIES/arivale/metabolomics_metadata.tsv
  reference_file: /procedure/data/local_data/MAPPING_ONTOLOGIES/ukbb/UKBB_NMR_Meta.tsv
  output_dir: ${OUTPUT_DIR:-/tmp/biomapper/met_arv_to_ukbb_v4.0}

  # Debug controls - CRITICAL for troubleshooting
  debug_mode: true
  verbose_logging: true
  fail_on_warning: false
  validate_parameters: true

  # Stage control - Enable incrementally for testing
  stages_to_run: [1,2,3,4]  # Full pipeline

  # Column specifications
  identifier_column: BIOCHEMICAL_NAME
  hmdb_column: HMDB
  pubchem_column: PUBCHEM
  kegg_column: KEGG
  cas_column: CAS

  # Thresholds (conservative)
  stage_1_threshold: 0.95
  stage_2_threshold: 0.85
  stage_3_threshold: 0.70
  stage_4_threshold: 0.75

steps:
  # Pre-flight validation
  - name: validate_environment
    action:
      type: CUSTOM_TRANSFORM
      params:
        input_key: dummy
        output_key: validation_results
        transformations:
          - column: timestamp
            expression: |
              # Validate output directory and parameters
              from pathlib import Path
              Path("${parameters.output_dir}").mkdir(parents=True, exist_ok=True)
              datetime.now().isoformat()

  # Stage 1: Nightingale NMR matching
  - name: stage_1_nightingale_match
    action:
      type: NIGHTINGALE_NMR_MATCH
      params:
        input_key: arivale_raw
        output_key: nightingale_matched
        biomarker_column: "${parameters.identifier_column}"
        match_threshold: "${parameters.stage_1_threshold}"
        target_format: both
        add_metadata: true

  # Stage 2: Fuzzy string matching
  - name: stage_2_fuzzy_match
    action:
      type: METABOLITE_FUZZY_STRING_MATCH
      params:
        unmapped_key: nightingale_unmapped
        reference_key: reference_raw
        output_key: fuzzy_matched
        final_unmapped_key: fuzzy_unmapped
        fuzzy_threshold: "${parameters.stage_2_threshold}"
    condition: 2 in ${parameters.stages_to_run}

  # Stage 3: RampDB API bridge
  - name: stage_3_rampdb_bridge
    action:
      type: METABOLITE_RAMPDB_BRIDGE
      params:
        unmapped_key: fuzzy_unmapped
        output_key: rampdb_matched
        final_unmapped_key: rampdb_unmapped
        confidence_threshold: "${parameters.stage_3_threshold}"
    condition: 3 in ${parameters.stages_to_run}

  # Stage 4: HMDB vector matching
  - name: stage_4_hmdb_vector
    action:
      type: HMDB_VECTOR_MATCH
      params:
        input_key: rampdb_unmapped
        output_key: stage_4_matched
        unmatched_key: stage_4_unmatched
        identifier_column: "${parameters.identifier_column}"
        threshold: "${parameters.stage_4_threshold}"
        collection_name: hmdb_metabolites
        qdrant_path: /home/ubuntu/biomapper/data/qdrant_storage
        embedding_model: sentence-transformers/all-MiniLM-L6-v2
        enable_llm_validation: false
    condition: 4 in ${parameters.stages_to_run}

  # Results consolidation
  - name: merge_all_matches
    action:
      type: MERGE_DATASETS
      params:
        dataset_keys:
          - nightingale_matched
          - fuzzy_matched
          - rampdb_matched
          - stage_4_matched
        merge_type: concat
        deduplicate: true
        output_key: all_matches

  # Export final results
  - name: export_final_results
    action:
      type: EXPORT_DATASET
      params:
        input_key: all_matches
        output_path: "${parameters.output_dir}/final_results.tsv"
        format: tsv

Python Client Usage

from src.client.client_v2 import BiomapperClient
import asyncio

async def run_metabolomics_pipeline():
    client = BiomapperClient(base_url="http://localhost:8000")

    # Run the complete v4.0 pipeline
    result = await client.run_strategy(
        strategy_name="met_arv_to_ukbb_progressive_v4.0",
        parameters={
            "file_path": "/data/arivale_metabolites.tsv",
            "reference_file": "/data/ukbb_nmr_reference.tsv",
            "output_dir": "/results/metabolomics_v4",
            "stages_to_run": [1, 2, 3, 4],  # Full pipeline
            "debug_mode": True
        }
    )

    print(f"Pipeline completed with {result.total_matched} matches")
    print(f"Coverage: {result.coverage:.1f}%")
    print(f"Results saved to: {result.output_files}")

    return result

# Synchronous wrapper for scripts
def run_pipeline_sync():
    client = BiomapperClient()
    return client.run("met_arv_to_ukbb_progressive_v4.0")

# Run the pipeline
if __name__ == "__main__":
    result = run_pipeline_sync()
    print(f"Final coverage: {result.coverage:.1f}%")

Advanced Configuration

Threshold Optimization

Fine-tune matching thresholds based on your dataset characteristics:

# Conservative (higher precision)
stage1_threshold: 0.98
stage2_threshold: 0.85
stage4_threshold: 0.80

# Aggressive (higher recall)
stage1_threshold: 0.90
stage2_threshold: 0.75
stage4_threshold: 0.70

Performance Tuning

For large datasets (>10K metabolites):

# Enable chunking for large datasets
chunk_processing: true
chunk_size: 5000

# Optimize API calls
rampdb_batch_size: 100
rampdb_timeout: 45

# Vector search optimization
vector_max_results: 5
vector_batch_size: 200

Quality Control Configuration

Add validation and quality checks:

# Enable LLM validation for Stage 4
use_llm_validation: true
llm_confidence_threshold: 0.7

# Add quality metrics tracking
track_confidence_scores: true
generate_quality_report: true

# Export unmatched for manual review
export_unmatched: true
unmatched_file_path: "${parameters.output_dir}/unmatched_metabolites.csv"

Real-World Case Studies

Arivale Metabolomics Dataset

Dataset Characteristics: - Size: 1,351 unique metabolites after filtering - Source: Arivale personalized medicine platform - Quality: High-quality, curated metabolite names

Results: - Total Coverage: 77.9% (1,053 matched metabolites) - Stage 1: 692 matches (51.2%) - Stage 2: 201 additional matches (14.9%) - Stage 3: 105 additional matches (7.8%) - Stage 4: 55 additional matches (4.0%) - Processing Time: 2 minutes 34 seconds

UK Biobank Subset

Dataset Characteristics: - Size: 2,847 metabolite measurements - Source: UK Biobank metabolomics data - Quality: Variable, research-grade identifiers

Results: - Total Coverage: 42.3% (1,204 matched metabolites) - Processing Time: 1 minute 47 seconds - Challenge: More heterogeneous naming conventions

Troubleshooting Common Issues

Low Coverage Issues

  1. Check Data Quality

    • Verify metabolite names are clean (no extra whitespace)

    • Check for non-standard naming conventions

    • Review identifier column selection

  2. Adjust Thresholds

    • Lower fuzzy matching threshold (0.8 → 0.7)

    • Increase vector similarity candidates

    • Enable LLM validation for borderline cases

  3. Data Preprocessing

    • Normalize metabolite names (case, punctuation)

    • Handle synonyms and alternative names

    • Remove or standardize chemical formulas

Performance Issues

  1. API Timeouts (Stage 3)

    • Increase RampDB timeout settings

    • Reduce batch sizes for API calls

    • Implement retry logic with exponential backoff

  2. Memory Issues

    • Enable chunked processing for large datasets

    • Reduce vector search candidates

    • Process dataset in smaller batches

  3. Slow Processing

    • Skip stages with low expected yield

    • Parallelize independent operations

    • Use cached results when available

Quality Validation

  1. Confidence Score Review

    • Check distribution of matching scores

    • Manually validate low-confidence matches

    • Adjust thresholds based on validation results

  2. Coverage Analysis

    • Compare against expected baselines

    • Identify systematic naming issues

    • Review unmatched metabolites for patterns

Best Practices

Pipeline Design

  1. Start Conservative: Use high thresholds initially, then relax

  2. Track Provenance: Maintain matching source information

  3. Quality Metrics: Monitor confidence scores throughout

  4. Incremental Improvement: Optimize one stage at a time

Data Preparation

  1. Clean Input Data: Remove duplicates, normalize formatting

  2. Validate Identifiers: Check for common naming issues

  3. Backup Originals: Preserve original identifiers for reference

  4. Document Assumptions: Record data preprocessing decisions

Production Deployment

  1. Version Control: Tag strategy versions for reproducibility

  2. Monitoring: Track pipeline performance over time

  3. Validation: Regular spot-checks of matching quality

  4. Documentation: Maintain parameter reasoning and tuning history

Integration with Other Pipelines

Multi-Omics Workflows

The metabolomics pipeline integrates with protein and chemistry pipelines:

# Combined multi-omics strategy
steps:
  - name: process_metabolites
    strategy: metabolomics_progressive_production_v3

  - name: process_proteins
    strategy: protein_harmonization_v2

  - name: cross_validate_results
    action:
      type: CALCULATE_SET_OVERLAP
      params:
        dataset1_key: "metabolite_results"
        dataset2_key: "protein_results"

Downstream Analysis Integration

Pipeline results feed into analysis workflows:

  • Pathway Analysis: Matched identifiers → pathway enrichment

  • Network Analysis: Cross-dataset connections and interactions

  • Visualization: Comprehensive multi-omics visualizations

  • Statistics: Coverage and quality metrics reporting

See Also

## Verification Sources

Last verified: 2025-08-22

This documentation was verified against the following project resources:

  • /biomapper/src/actions/entities/metabolites/matching/nightingale_nmr_match.py (Stage 1 action implementation with built-in patterns and fuzzy matching)

  • /biomapper/src/actions/entities/metabolites/matching/fuzzy_string_match.py (Stage 2 algorithmic fuzzy matching with fuzzywuzzy)

  • /biomapper/src/actions/entities/metabolites/matching/hmdb_vector_match.py (Stage 4 vector matching with FastEmbed and Qdrant)

  • /biomapper/src/configs/strategies/experimental/met_arv_to_ukbb_progressive_v4.0.yaml (Complete v4.0 strategy configuration with debugging features)

  • /biomapper/src/client/client_v2.py (Enhanced BiomapperClient with async/sync execution patterns)

  • /biomapper/README.md (Project architecture overview and action registry documentation)

  • /biomapper/CLAUDE.md (Development standards and 2025 standardizations including parameter naming conventions)

  • /biomapper/pyproject.toml (Dependencies including fuzzywuzzy, qdrant-client, fastembed, and sentence-transformers)