Metabolomics Progressive Pipeline
Overview
The Metabolomics Progressive Pipeline v4.0 represents the state-of-the-art approach for harmonizing metabolite identifiers across different biological datasets. This 4-stage progressive matching pipeline achieves 75-80% coverage for typical metabolomics datasets.
The pipeline implements a sophisticated “progressive matching” strategy, starting with high-confidence exact matches and progressively relaxing matching criteria to capture more difficult cases. Version 4.0 includes consolidated debugging, pre-flight validation, and incremental stage enabling for systematic testing.
Pipeline Architecture
4-Stage Progressive Matching
graph TD
A[Input Dataset] --> B[Stage 1: Direct Matching]
B --> C[Stage 2: Fuzzy String Matching]
C --> D[Stage 3: RampDB Bridge]
D --> E[Stage 4: HMDB Vector Matching]
E --> F[Results Export & Visualization]
F --> G[Google Drive Upload]
Stage Breakdown
- Stage 1: Direct/Exact Matching
Action:
NIGHTINGALE_NMR_MATCHCoverage: ~45-55% (high confidence)
Speed: <2 seconds for 10K identifiers
Method: Exact string matching against Nightingale reference with fuzzy fallback
Features: Built-in biomarker patterns, abbreviation expansion, lipoprotein recognition
- Stage 2: Fuzzy String Matching
Action:
METABOLITE_FUZZY_STRING_MATCHCoverage: +15-20% additional
Speed: ~5-10 seconds
Method: Token sort ratio with fuzzywuzzy (threshold: 85%)
Cost: $0.00 (algorithmic matching, no API calls)
- Stage 3: RampDB API Bridge
Action:
METABOLITE_RAMPDB_BRIDGECoverage: +8-12% additional
Speed: ~30-60 seconds (API dependent)
Method: External API calls to RampDB service
Note: Requires active RampDB API access
- Stage 4: Vector Semantic Matching
Action:
HMDB_VECTOR_MATCHCoverage: +5-10% additional
Speed: ~10-20 seconds
Method: FastEmbed with Qdrant vector database similarity search
Requirements: Qdrant storage with pre-computed HMDB embeddings
Expected Performance Metrics
Coverage Statistics
Based on production runs with real biological datasets:
Dataset Type |
Expected Coverage |
Processing Time |
Confidence Level |
|---|---|---|---|
Arivale Metabolomics |
75-80% (1,053/1,351) |
2-3 minutes |
High |
UK Biobank |
40-45% (varies by subset) |
1-2 minutes |
Medium |
Custom Datasets |
50-70% (depends on quality) |
Variable |
Variable |
Stage-by-Stage Coverage Accumulation
Typical progression for a 1,000 metabolite dataset:
After Stage 1: ~500 matched (50%)
After Stage 2: ~650 matched (65%)
After Stage 3: ~720 matched (72%)
After Stage 4: ~750 matched (75%)
Implementation
YAML Strategy Configuration
name: met_arv_to_ukbb_progressive_v4.0
description: |
Consolidated progressive metabolomics mapping pipeline with extensive debugging.
Features systematic stage-by-stage execution with comprehensive logging.
parameters:
# Core paths - MUST be absolute paths
file_path: /procedure/data/local_data/MAPPING_ONTOLOGIES/arivale/metabolomics_metadata.tsv
reference_file: /procedure/data/local_data/MAPPING_ONTOLOGIES/ukbb/UKBB_NMR_Meta.tsv
output_dir: ${OUTPUT_DIR:-/tmp/biomapper/met_arv_to_ukbb_v4.0}
# Debug controls - CRITICAL for troubleshooting
debug_mode: true
verbose_logging: true
fail_on_warning: false
validate_parameters: true
# Stage control - Enable incrementally for testing
stages_to_run: [1,2,3,4] # Full pipeline
# Column specifications
identifier_column: BIOCHEMICAL_NAME
hmdb_column: HMDB
pubchem_column: PUBCHEM
kegg_column: KEGG
cas_column: CAS
# Thresholds (conservative)
stage_1_threshold: 0.95
stage_2_threshold: 0.85
stage_3_threshold: 0.70
stage_4_threshold: 0.75
steps:
# Pre-flight validation
- name: validate_environment
action:
type: CUSTOM_TRANSFORM
params:
input_key: dummy
output_key: validation_results
transformations:
- column: timestamp
expression: |
# Validate output directory and parameters
from pathlib import Path
Path("${parameters.output_dir}").mkdir(parents=True, exist_ok=True)
datetime.now().isoformat()
# Stage 1: Nightingale NMR matching
- name: stage_1_nightingale_match
action:
type: NIGHTINGALE_NMR_MATCH
params:
input_key: arivale_raw
output_key: nightingale_matched
biomarker_column: "${parameters.identifier_column}"
match_threshold: "${parameters.stage_1_threshold}"
target_format: both
add_metadata: true
# Stage 2: Fuzzy string matching
- name: stage_2_fuzzy_match
action:
type: METABOLITE_FUZZY_STRING_MATCH
params:
unmapped_key: nightingale_unmapped
reference_key: reference_raw
output_key: fuzzy_matched
final_unmapped_key: fuzzy_unmapped
fuzzy_threshold: "${parameters.stage_2_threshold}"
condition: 2 in ${parameters.stages_to_run}
# Stage 3: RampDB API bridge
- name: stage_3_rampdb_bridge
action:
type: METABOLITE_RAMPDB_BRIDGE
params:
unmapped_key: fuzzy_unmapped
output_key: rampdb_matched
final_unmapped_key: rampdb_unmapped
confidence_threshold: "${parameters.stage_3_threshold}"
condition: 3 in ${parameters.stages_to_run}
# Stage 4: HMDB vector matching
- name: stage_4_hmdb_vector
action:
type: HMDB_VECTOR_MATCH
params:
input_key: rampdb_unmapped
output_key: stage_4_matched
unmatched_key: stage_4_unmatched
identifier_column: "${parameters.identifier_column}"
threshold: "${parameters.stage_4_threshold}"
collection_name: hmdb_metabolites
qdrant_path: /home/ubuntu/biomapper/data/qdrant_storage
embedding_model: sentence-transformers/all-MiniLM-L6-v2
enable_llm_validation: false
condition: 4 in ${parameters.stages_to_run}
# Results consolidation
- name: merge_all_matches
action:
type: MERGE_DATASETS
params:
dataset_keys:
- nightingale_matched
- fuzzy_matched
- rampdb_matched
- stage_4_matched
merge_type: concat
deduplicate: true
output_key: all_matches
# Export final results
- name: export_final_results
action:
type: EXPORT_DATASET
params:
input_key: all_matches
output_path: "${parameters.output_dir}/final_results.tsv"
format: tsv
Python Client Usage
from src.client.client_v2 import BiomapperClient
import asyncio
async def run_metabolomics_pipeline():
client = BiomapperClient(base_url="http://localhost:8000")
# Run the complete v4.0 pipeline
result = await client.run_strategy(
strategy_name="met_arv_to_ukbb_progressive_v4.0",
parameters={
"file_path": "/data/arivale_metabolites.tsv",
"reference_file": "/data/ukbb_nmr_reference.tsv",
"output_dir": "/results/metabolomics_v4",
"stages_to_run": [1, 2, 3, 4], # Full pipeline
"debug_mode": True
}
)
print(f"Pipeline completed with {result.total_matched} matches")
print(f"Coverage: {result.coverage:.1f}%")
print(f"Results saved to: {result.output_files}")
return result
# Synchronous wrapper for scripts
def run_pipeline_sync():
client = BiomapperClient()
return client.run("met_arv_to_ukbb_progressive_v4.0")
# Run the pipeline
if __name__ == "__main__":
result = run_pipeline_sync()
print(f"Final coverage: {result.coverage:.1f}%")
Advanced Configuration
Threshold Optimization
Fine-tune matching thresholds based on your dataset characteristics:
# Conservative (higher precision)
stage1_threshold: 0.98
stage2_threshold: 0.85
stage4_threshold: 0.80
# Aggressive (higher recall)
stage1_threshold: 0.90
stage2_threshold: 0.75
stage4_threshold: 0.70
Performance Tuning
For large datasets (>10K metabolites):
# Enable chunking for large datasets
chunk_processing: true
chunk_size: 5000
# Optimize API calls
rampdb_batch_size: 100
rampdb_timeout: 45
# Vector search optimization
vector_max_results: 5
vector_batch_size: 200
Quality Control Configuration
Add validation and quality checks:
# Enable LLM validation for Stage 4
use_llm_validation: true
llm_confidence_threshold: 0.7
# Add quality metrics tracking
track_confidence_scores: true
generate_quality_report: true
# Export unmatched for manual review
export_unmatched: true
unmatched_file_path: "${parameters.output_dir}/unmatched_metabolites.csv"
Real-World Case Studies
Arivale Metabolomics Dataset
Dataset Characteristics: - Size: 1,351 unique metabolites after filtering - Source: Arivale personalized medicine platform - Quality: High-quality, curated metabolite names
Results: - Total Coverage: 77.9% (1,053 matched metabolites) - Stage 1: 692 matches (51.2%) - Stage 2: 201 additional matches (14.9%) - Stage 3: 105 additional matches (7.8%) - Stage 4: 55 additional matches (4.0%) - Processing Time: 2 minutes 34 seconds
UK Biobank Subset
Dataset Characteristics: - Size: 2,847 metabolite measurements - Source: UK Biobank metabolomics data - Quality: Variable, research-grade identifiers
Results: - Total Coverage: 42.3% (1,204 matched metabolites) - Processing Time: 1 minute 47 seconds - Challenge: More heterogeneous naming conventions
Troubleshooting Common Issues
Low Coverage Issues
Check Data Quality
Verify metabolite names are clean (no extra whitespace)
Check for non-standard naming conventions
Review identifier column selection
Adjust Thresholds
Lower fuzzy matching threshold (0.8 → 0.7)
Increase vector similarity candidates
Enable LLM validation for borderline cases
Data Preprocessing
Normalize metabolite names (case, punctuation)
Handle synonyms and alternative names
Remove or standardize chemical formulas
Performance Issues
API Timeouts (Stage 3)
Increase RampDB timeout settings
Reduce batch sizes for API calls
Implement retry logic with exponential backoff
Memory Issues
Enable chunked processing for large datasets
Reduce vector search candidates
Process dataset in smaller batches
Slow Processing
Skip stages with low expected yield
Parallelize independent operations
Use cached results when available
Quality Validation
Confidence Score Review
Check distribution of matching scores
Manually validate low-confidence matches
Adjust thresholds based on validation results
Coverage Analysis
Compare against expected baselines
Identify systematic naming issues
Review unmatched metabolites for patterns
Best Practices
Pipeline Design
Start Conservative: Use high thresholds initially, then relax
Track Provenance: Maintain matching source information
Quality Metrics: Monitor confidence scores throughout
Incremental Improvement: Optimize one stage at a time
Data Preparation
Clean Input Data: Remove duplicates, normalize formatting
Validate Identifiers: Check for common naming issues
Backup Originals: Preserve original identifiers for reference
Document Assumptions: Record data preprocessing decisions
Production Deployment
Version Control: Tag strategy versions for reproducibility
Monitoring: Track pipeline performance over time
Validation: Regular spot-checks of matching quality
Documentation: Maintain parameter reasoning and tuning history
Integration with Other Pipelines
Multi-Omics Workflows
The metabolomics pipeline integrates with protein and chemistry pipelines:
# Combined multi-omics strategy
steps:
- name: process_metabolites
strategy: metabolomics_progressive_production_v3
- name: process_proteins
strategy: protein_harmonization_v2
- name: cross_validate_results
action:
type: CALCULATE_SET_OVERLAP
params:
dataset1_key: "metabolite_results"
dataset2_key: "protein_results"
Downstream Analysis Integration
Pipeline results feed into analysis workflows:
Pathway Analysis: Matched identifiers → pathway enrichment
Network Analysis: Cross-dataset connections and interactions
Visualization: Comprehensive multi-omics visualizations
Statistics: Coverage and quality metrics reporting
See Also
NIGHTINGALE_NMR_MATCH - Stage 1 direct matching
Metabolite Fuzzy String Match - Stage 2 fuzzy matching
HMDB Vector Match - Stage 4 vector matching
RampDB Integration - RampDB API integration
Real-World Case Studies - Additional case studies
Performance Optimization Guide - Performance tuning guide
—
## Verification Sources
Last verified: 2025-08-22
This documentation was verified against the following project resources:
/biomapper/src/actions/entities/metabolites/matching/nightingale_nmr_match.py (Stage 1 action implementation with built-in patterns and fuzzy matching)
/biomapper/src/actions/entities/metabolites/matching/fuzzy_string_match.py (Stage 2 algorithmic fuzzy matching with fuzzywuzzy)
/biomapper/src/actions/entities/metabolites/matching/hmdb_vector_match.py (Stage 4 vector matching with FastEmbed and Qdrant)
/biomapper/src/configs/strategies/experimental/met_arv_to_ukbb_progressive_v4.0.yaml (Complete v4.0 strategy configuration with debugging features)
/biomapper/src/client/client_v2.py (Enhanced BiomapperClient with async/sync execution patterns)
/biomapper/README.md (Project architecture overview and action registry documentation)
/biomapper/CLAUDE.md (Development standards and 2025 standardizations including parameter naming conventions)
/biomapper/pyproject.toml (Dependencies including fuzzywuzzy, qdrant-client, fastembed, and sentence-transformers)