semantic_metabolite_match
The SEMANTIC_METABOLITE_MATCH action uses AI-powered semantic matching with embeddings and LLM validation to identify metabolite correspondences.
Overview
This advanced action combines embedding-based similarity search with Large Language Model (LLM) validation to match metabolites across datasets. It’s particularly useful for:
Complex metabolite names that don’t match exactly
Cross-platform metabolomics data integration
Pathway-aware matching using biological context
Quality validation of potential matches
The action uses OpenAI’s embedding models for similarity calculation and GPT models for biological validation.
Parameters
action:
type: SEMANTIC_METABOLITE_MATCH
params:
unmatched_dataset: "unmatched_metabolites"
reference_map: "nightingale_reference"
context_fields:
unmatched: ["BIOCHEMICAL_NAME", "SUPER_PATHWAY", "SUB_PATHWAY"]
reference: ["unified_name", "description", "category"]
embedding_model: "text-embedding-ada-002"
llm_model: "gpt-4"
confidence_threshold: 0.75
output_key: "semantic_matches"
Required Parameters
- unmatched_datasetstr
Key for dataset containing unmatched metabolites
- reference_mapstr
Key for reference dataset to match against
- context_fieldsdict
Fields to use for context per dataset
- output_keystr
Where to store semantic matches
Optional Parameters
- embedding_modelstr, default=”text-embedding-ada-002”
OpenAI embedding model for similarity calculation
- llm_modelstr, default=”gpt-4”
LLM model for biological validation
- confidence_thresholdfloat, default=0.75
Minimum confidence for accepting matches
- include_reasoningbool, default=True
Include LLM reasoning in match results
- max_llm_callsint, default=100
Maximum LLM API calls to prevent runaway costs
- embedding_similarity_thresholdfloat, default=0.85
Minimum embedding similarity for LLM validation
- batch_sizeint, default=10
Batch size for embedding generation
- unmatched_keystr, default=None
Key to store final unmatched metabolites
Semantic Matching Process
Context Creation - Combines metabolite name, pathway, and description - Creates rich context strings for embedding
Embedding Generation - Uses OpenAI embeddings API - Caches embeddings to reduce API calls - Processes in batches for efficiency
Similarity Search - Calculates cosine similarity between embeddings - Identifies top candidates above threshold
LLM Validation - Submits candidates to GPT for biological validation - Gets confidence scores and reasoning - Filters based on confidence threshold
Context String Examples
The action creates rich context strings for semantic matching:
Unmatched Metabolite Context:
`
"Metabolite: 1-methylhistidine | SUPER_PATHWAY: Amino Acid | SUB_PATHWAY: Histidine Metabolism"
`
Reference Metabolite Context:
`
"Metabolite: Histidine | Description: Essential amino acid | Category: Amino acids | Platform: Nightingale NMR"
`
Example Usage
Basic Semantic Matching
steps:
- name: semantic_match
action:
type: SEMANTIC_METABOLITE_MATCH
params:
unmatched_dataset: "unmatched_metabolomics"
reference_map: "nightingale_nmr_map"
context_fields:
unmatched_metabolomics: ["BIOCHEMICAL_NAME", "SUPER_PATHWAY"]
nightingale_nmr: ["unified_name", "description"]
confidence_threshold: 0.80
embedding_similarity_threshold: 0.85
output_key: "semantic_matches"
Advanced Configuration
steps:
- name: comprehensive_semantic_match
action:
type: SEMANTIC_METABOLITE_MATCH
params:
unmatched_dataset: "complex_metabolites"
reference_map: "comprehensive_reference"
context_fields:
complex_metabolites:
- "BIOCHEMICAL_NAME"
- "SUPER_PATHWAY"
- "SUB_PATHWAY"
- "PLATFORM"
comprehensive_reference:
- "unified_name"
- "description"
- "category"
- "synonyms"
embedding_model: "text-embedding-ada-002"
llm_model: "gpt-4"
confidence_threshold: 0.75
include_reasoning: true
max_llm_calls: 200
embedding_similarity_threshold: 0.80
batch_size: 20
output_key: "validated_semantic_matches"
unmatched_key: "still_unmatched"
Cost-Controlled Matching
steps:
- name: budget_semantic_match
action:
type: SEMANTIC_METABOLITE_MATCH
params:
unmatched_dataset: "priority_metabolites"
reference_map: "core_reference"
context_fields:
priority_metabolites: ["BIOCHEMICAL_NAME"]
core_reference: ["unified_name"]
embedding_model: "text-embedding-ada-002"
llm_model: "gpt-3.5-turbo" # Lower cost model
confidence_threshold: 0.85 # Higher threshold
max_llm_calls: 50 # Strict limit
embedding_similarity_threshold: 0.90 # Pre-filter more strictly
output_key: "budget_matches"
LLM Validation Process
The LLM receives structured prompts for biological validation:
Prompt Template: ``` I need to determine if these two metabolites are the same compound:
Metabolite A: - Name: 1-methylhistidine - Pathway: Amino Acid - Sub-pathway: Histidine Metabolism - Additional info: HMDB_ID: HMDB0000001
Metabolite B: - Name: Histidine - Description: Essential amino acid - Category: Amino acids - Platform: Nightingale NMR
Embedding similarity: 0.887
Are these the same metabolite? Respond with: 1. YES/NO/UNCERTAIN 2. Confidence (0-1) 3. Brief reasoning (1-2 sentences)
Format: YES|0.95|These are both referring to histidine-related compounds. ```
LLM Response Processing: - Parses structured responses: Decision|Confidence|Reasoning - Validates biological correctness - Provides confidence scores for downstream filtering
Output Format
The action outputs enriched matches with validation metadata:
Original Metabolite + Match Info + Validation Data
Example output:
BIOCHEMICAL_NAME | matched_name | match_confidence | embedding_similarity | match_reasoning
1-methylhistidine | Histidine | 0.85 | 0.887 | Related histidine compounds
Glucose-6-phosphate | Glucose | 0.92 | 0.901 | Same base metabolite
Unknown compound | | | |
Embedding Cache System
Intelligent caching reduces API costs and improves performance:
Memory Cache - In-memory storage for session reuse - MD5 hashing for efficient lookups - LRU eviction for memory management
Disk Cache - Persistent storage across sessions - JSON serialization for portability - TTL-based cache invalidation
Cache Statistics - Hit/miss ratios tracked - Performance metrics reported - Cache efficiency monitoring
Error Handling and Resilience
The action includes comprehensive error handling:
API Failures - Graceful fallback when OpenAI APIs fail - Retry logic with exponential backoff - Partial results preservation
Rate Limiting - Automatic rate limit detection - Adaptive throttling - Cost monitoring and alerting
Data Quality Issues - Empty context field handling - Invalid response parsing - Confidence threshold validation
Statistics and Monitoring
Detailed statistics are provided for analysis optimization:
{
"matched_count": 45,
"unmatched_count": 15,
"llm_calls": 87,
"cache_hits": 32,
"confidence_distribution": {
"high": 35, # ≥0.9
"medium": 10, # 0.75-0.9
"low": 0 # <0.75
},
"embedding_similarity_avg": 0.876,
"llm_validation_rate": 0.52,
"api_costs_estimated": 2.34
}
Best Practices
Optimize context fields: Include pathway and description information for better embeddings
Set appropriate thresholds: Balance recall vs precision with confidence thresholds
Monitor costs: Use max_llm_calls to control OpenAI API expenses
Cache embeddings: Enable caching for repeated analyses
Validate results: Review LLM reasoning for biological accuracy
Batch efficiently: Use appropriate batch sizes for your API limits
Performance Optimization
Embedding Efficiency - Batch processing for reduced API calls - Intelligent caching strategy - Deduplication of similar contexts
LLM Cost Management - Pre-filtering with embedding similarity - Configurable call limits - Cost estimation and tracking
Memory Management - Streaming processing for large datasets - Cache size limitations - Garbage collection optimization
Integration Examples
With Traditional Matching
steps:
- name: exact_match
action:
type: NIGHTINGALE_NMR_MATCH
params:
dataset_key: "metabolomics_data"
output_key: "exact_matches"
unmatched_key: "unmatched_after_exact"
- name: semantic_match
action:
type: SEMANTIC_METABOLITE_MATCH
params:
unmatched_dataset: "unmatched_after_exact"
reference_map: "nightingale_reference"
output_key: "semantic_matches"
With Quality Assessment
steps:
- name: semantic_matching
action:
type: SEMANTIC_METABOLITE_MATCH
# ... parameters
- name: validate_semantic_quality
action:
type: CALCULATE_MAPPING_QUALITY
params:
source_key: "unmatched_metabolites"
mapped_key: "semantic_matches"
confidence_column: "match_confidence"
output_key: "semantic_quality_metrics"
Requirements
API Access - OpenAI API key required - Sufficient API credits for embeddings and LLM calls - Network access to OpenAI endpoints
Dependencies - openai Python package - numpy for similarity calculations - scikit-learn for cosine similarity
Environment Variables - OPENAI_API_KEY: Your OpenAI API key - SEMANTIC_MATCH_CACHE_DIR: Optional cache directory
The semantic matching action provides state-of-the-art metabolite identification using AI while maintaining cost control and biological validation.