Progressive Semantic Match
Overview
The PROGRESSIVE_SEMANTIC_MATCH action orchestrates a comprehensive 4-stage metabolite matching pipeline that progressively applies different matching strategies to achieve maximum coverage. This is the primary action for metabolomics identifier harmonization in the biomapper framework.
The progressive approach starts with high-confidence exact matches and gradually relaxes criteria to capture more difficult cases, achieving typical coverage rates of 75-80% for well-curated metabolomics datasets.
Pipeline Architecture
4-Stage Progressive Strategy
graph TD
A[Input Dataset] --> B[Stage 1: Direct Matching]
B --> C[Stage 2: Fuzzy String Matching]
C --> D[Stage 3: RampDB API Bridge]
D --> E[Stage 4: HMDB Vector Matching]
E --> F[Results Consolidation]
F --> G[Coverage Analysis & Reporting]
Stage Breakdown
- Stage 1: Direct/Exact Matching
Method:
NIGHTINGALE_NMR_MATCHCoverage: 45-55% (high confidence)
Speed: <2 seconds for 10K identifiers
Strategy: Exact string matching against curated reference
- Stage 2: Fuzzy String Matching
Method:
METABOLITE_FUZZY_STRING_MATCHCoverage: +15-20% additional
Speed: 5-10 seconds
Strategy: Levenshtein distance with biological awareness
- Stage 3: External API Integration
Method:
METABOLITE_RAMPDB_BRIDGECoverage: +8-12% additional
Speed: 30-60 seconds (API dependent)
Strategy: Cross-database lookups via RampDB
- Stage 4: Vector Semantic Matching
Method:
HMDB_VECTOR_MATCHCoverage: +5-10% additional
Speed: 10-20 seconds
Strategy: Embedding-based semantic similarity
Parameters
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Key for the input dataset containing metabolite identifiers |
|
string |
Yes |
Key for the final consolidated output dataset |
|
string |
Yes |
Column name containing metabolite identifiers to match |
|
float |
No |
Direct matching threshold (default: 0.95) |
|
float |
No |
Fuzzy matching threshold (default: 0.8) |
|
integer |
No |
RampDB API batch size (default: 50) |
|
float |
No |
Vector similarity threshold (default: 0.75) |
|
boolean |
No |
Enable inter-stage validation (default: true) |
|
boolean |
No |
Export individual stage results (default: false) |
Performance Metrics
Expected Coverage by Dataset Type
Based on production runs with real biological datasets:
Dataset Type |
Expected Coverage |
Processing Time |
Confidence Level |
|---|---|---|---|
Arivale Metabolomics |
75-80% (1,053/1,351) |
2-3 minutes |
High |
UK Biobank Subset |
40-45% (varies) |
1-2 minutes |
Medium |
Custom Datasets |
50-70% (varies) |
Variable |
Variable |
Cumulative Coverage Progression
Typical progression for a 1,000 metabolite dataset:
After Stage 1: ~500 matched (50%)
After Stage 2: ~650 matched (65%)
After Stage 3: ~720 matched (72%)
After Stage 4: ~750 matched (75%)
Example Usage
YAML Strategy
steps:
- name: progressive_metabolite_matching
action:
type: PROGRESSIVE_SEMANTIC_MATCH
params:
input_key: raw_metabolites
output_key: matched_metabolites
identifier_column: compound_name
stage1_threshold: 0.95
stage2_threshold: 0.8
stage3_batch_size: 40
stage4_threshold: 0.75
enable_quality_control: true
export_stage_results: true
Python Client
from src.client.client_v2 import BiomapperClient
client = BiomapperClient(base_url="http://localhost:8000")
# Load metabolite dataset
context = {"datasets": {"metabolites": metabolite_df}}
result = await client.run_action(
action_type="PROGRESSIVE_SEMANTIC_MATCH",
params={
"input_key": "metabolites",
"output_key": "harmonized_metabolites",
"identifier_column": "metabolite_name",
"stage1_threshold": 0.9,
"stage2_threshold": 0.8,
"enable_quality_control": True
},
context=context
)
# Access consolidated results
matched_data = result.context["datasets"]["harmonized_metabolites"]
print(f"Coverage achieved: {len(matched_data)} metabolites")
Output Format
The action returns a comprehensive dataset with matching provenance:
Column |
Description |
|---|---|
|
Original metabolite identifier from input |
|
Final HMDB identifier (primary output) |
|
Standardized metabolite name |
|
Stage that produced the match (1-4) |
|
Specific method used (exact, fuzzy, api, vector) |
|
Overall confidence score (0.0-1.0) |
|
Stage 1 matching candidate (if any) |
|
Stage 2 matching candidate (if any) |
|
Stage 3 matching candidate (if any) |
|
Stage 4 matching candidate (if any) |
|
Quality control flags and warnings |
Stage Implementation Details
Stage 1: Direct Matching
High-confidence exact matches using curated reference data:
stage1_configuration:
method: NIGHTINGALE_NMR_MATCH
reference_data: nightingale_nmr_metabolites
matching_strategy: exact_string
case_sensitive: false
normalize_whitespace: true
Stage 2: Fuzzy String Matching
Captures near-matches with controlled edit distance:
stage2_configuration:
method: METABOLITE_FUZZY_STRING_MATCH
algorithm: biological_levenshtein
max_edit_distance: 2
ignore_case: true
ignore_punctuation: true
Stage 3: External API Integration
Leverages RampDB for cross-database resolution:
stage3_configuration:
method: METABOLITE_RAMPDB_BRIDGE
target_databases: [hmdb, kegg, chebi]
timeout: 45
retry_attempts: 3
rate_limit_delay: 1.0
Stage 4: Vector Semantic Matching
Uses embedding-based similarity for complex cases:
stage4_configuration:
method: HMDB_VECTOR_MATCH
embedding_model: all-MiniLM-L6-v2
vector_db: qdrant_hmdb_collection
similarity_metric: cosine
max_candidates: 5
Quality Control Features
Inter-Stage Validation
Validates consistency between matching stages:
Conflict Detection: Identifies when stages produce conflicting matches
Confidence Scoring: Weights matches based on stage reliability
Consensus Building: Resolves conflicts using multiple signals
Quality Metrics Tracking
Monitors pipeline health and performance:
Coverage Progression: Tracks cumulative coverage by stage
Confidence Distribution: Analyzes confidence score distributions
Error Rate Analysis: Identifies systematic matching failures
Performance Benchmarks: Compares against expected baselines
Example Quality Report
{
"total_identifiers": 1351,
"total_matched": 1053,
"overall_coverage": 77.9,
"stage_breakdown": {
"stage1": {"matched": 692, "coverage": 51.2},
"stage2": {"matched": 201, "coverage": 14.9},
"stage3": {"matched": 105, "coverage": 7.8},
"stage4": {"matched": 55, "coverage": 4.0}
},
"quality_metrics": {
"high_confidence": 847,
"medium_confidence": 156,
"low_confidence": 50,
"flagged_for_review": 23
}
}
Advanced Configuration
Threshold Optimization
Fine-tune thresholds based on dataset characteristics:
# Conservative profile (high precision)
conservative_config:
stage1_threshold: 0.98
stage2_threshold: 0.85
stage4_threshold: 0.80
# Aggressive profile (high recall)
aggressive_config:
stage1_threshold: 0.92
stage2_threshold: 0.75
stage4_threshold: 0.70
# Balanced profile (recommended)
balanced_config:
stage1_threshold: 0.95
stage2_threshold: 0.80
stage4_threshold: 0.75
Performance Optimization
For large datasets (>10K metabolites):
performance_config:
enable_parallel_stages: true
stage2_chunk_size: 2000
stage3_batch_size: 100
stage4_batch_size: 500
cache_intermediate_results: true
Custom Stage Configuration
Override individual stage parameters:
custom_stage_config:
stage1_params:
reference_source: "custom_metabolite_db"
normalization_rules: ["remove_stereochemistry"]
stage2_params:
algorithm: "jaro_winkler"
biological_synonyms: true
stage3_params:
target_databases: ["hmdb", "kegg", "chebi", "pubchem"]
concurrent_requests: 3
stage4_params:
use_llm_validation: true
embedding_cache: true
Error Handling and Recovery
Stage-Level Error Recovery
Each stage includes independent error handling:
Stage Isolation: Failure in one stage doesn’t affect others
Partial Results: Successfully completed stages preserve their matches
Error Reporting: Detailed logs for debugging stage-specific issues
Graceful Degradation: Pipeline continues with reduced functionality
Common Recovery Patterns
error_handling:
stage1_fallback: "skip_to_stage2"
stage2_timeout_action: "reduce_batch_size"
stage3_api_failure: "retry_with_exponential_backoff"
stage4_memory_error: "process_in_smaller_chunks"
Monitoring and Alerting
Built-in monitoring for production deployments:
Progress Tracking: Real-time progress updates for long-running pipelines
Performance Alerts: Notifications when stages exceed expected runtimes
Quality Alerts: Warnings when coverage drops below baselines
Resource Monitoring: Memory and API quota usage tracking
Real-World Case Studies
Arivale Metabolomics Dataset
Dataset: 1,351 unique metabolites from personalized medicine platform Result: 77.9% coverage (1,053 matched metabolites) Processing Time: 2 minutes 34 seconds Stage Performance: - Stage 1: 692 matches (51.2%) - Stage 2: 201 additional (14.9%) - Stage 3: 105 additional (7.8%) - Stage 4: 55 additional (4.0%)
UK Biobank Subset
Dataset: 2,847 metabolite measurements from population study Result: 42.3% coverage (1,204 matched metabolites) Challenge: Heterogeneous naming conventions Processing Time: 1 minute 47 seconds
Best Practices
Start with Balanced Configuration: Use recommended thresholds initially
Monitor Stage Performance: Track individual stage contributions
Validate Results: Manually review sample matches from each stage
Document Parameters: Record threshold choices and rationale
Benchmark Regularly: Compare performance against known datasets
Integration Patterns
Complete Pipeline Integration
name: comprehensive_metabolomics_pipeline
steps:
- name: load_metabolites
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "${parameters.input_file}"
identifier_column: metabolite_name
output_key: raw_metabolites
- name: progressive_matching
action:
type: PROGRESSIVE_SEMANTIC_MATCH
params:
input_key: raw_metabolites
output_key: matched_metabolites
identifier_column: metabolite_name
- name: generate_report
action:
type: GENERATE_METABOLOMICS_REPORT
params:
input_key: matched_metabolites
output_key: coverage_report
- name: export_results
action:
type: EXPORT_DATASET
params:
input_key: matched_metabolites
file_path: "${parameters.output_file}"
Multi-Dataset Comparison
# Compare matching across multiple datasets
steps:
- name: match_dataset_a
action:
type: PROGRESSIVE_SEMANTIC_MATCH
params:
input_key: dataset_a
output_key: matched_a
- name: match_dataset_b
action:
type: PROGRESSIVE_SEMANTIC_MATCH
params:
input_key: dataset_b
output_key: matched_b
- name: compare_coverage
action:
type: CALCULATE_SET_OVERLAP
params:
dataset1_key: matched_a
dataset2_key: matched_b
Troubleshooting
Low Overall Coverage
Check Input Data Quality: Verify metabolite names are clean and standardized
Adjust Thresholds: Lower thresholds to increase recall
Enable All Stages: Ensure no stages are being skipped
Review Reference Data: Confirm reference databases are current
Stage-Specific Issues
Stage 1 Low Coverage: Check reference data alignment and normalization
Stage 2 Timeouts: Reduce batch sizes or enable chunking
Stage 3 API Failures: Verify API credentials and network connectivity
Stage 4 Performance: Ensure vector database is properly indexed
Performance Problems
Memory Issues: Enable chunked processing for large datasets
API Timeouts: Increase timeout values and reduce batch sizes
Slow Processing: Enable parallel processing where supported
Resource Limits: Monitor CPU and memory usage during execution
See Also
NIGHTINGALE_NMR_MATCH - Stage 1 direct matching implementation
Metabolite Fuzzy String Match - Stage 2 fuzzy matching details
Metabolite RampDB Bridge - Stage 3 API integration guide
HMDB Vector Match - Stage 4 vector similarity matching
Metabolomics Progressive Pipeline - Complete workflow examples
../examples/threshold_optimization - Parameter tuning guides
../performance/large_dataset_optimization - Scaling considerations