semantic_metabolite_match

The SEMANTIC_METABOLITE_MATCH action uses AI-powered semantic matching with embeddings and LLM validation to identify metabolite correspondences.

Overview

This advanced action combines embedding-based similarity search with Large Language Model (LLM) validation to match metabolites across datasets. It’s particularly useful for:

Complex metabolite names that don’t match exactly
Cross-platform metabolomics data integration
Pathway-aware matching using biological context
Quality validation of potential matches

The action uses OpenAI’s embedding models for similarity calculation and GPT models for biological validation.

Parameters

action:
  type: SEMANTIC_METABOLITE_MATCH
  params:
    unmatched_dataset: "unmatched_metabolites"
    reference_map: "nightingale_reference"
    context_fields:
      unmatched: ["BIOCHEMICAL_NAME", "SUPER_PATHWAY", "SUB_PATHWAY"]
      reference: ["unified_name", "description", "category"]
    embedding_model: "text-embedding-ada-002"
    llm_model: "gpt-4"
    confidence_threshold: 0.75
    output_key: "semantic_matches"

Required Parameters

unmatched_datasetstr: Key for dataset containing unmatched metabolites
reference_mapstr: Key for reference dataset to match against
context_fieldsdict: Fields to use for context per dataset
output_keystr: Where to store semantic matches

Optional Parameters

embedding_modelstr, default=”text-embedding-ada-002”: OpenAI embedding model for similarity calculation
llm_modelstr, default=”gpt-4”: LLM model for biological validation
confidence_thresholdfloat, default=0.75: Minimum confidence for accepting matches
include_reasoningbool, default=True: Include LLM reasoning in match results
max_llm_callsint, default=100: Maximum LLM API calls to prevent runaway costs
embedding_similarity_thresholdfloat, default=0.85: Minimum embedding similarity for LLM validation
batch_sizeint, default=10: Batch size for embedding generation
unmatched_keystr, default=None: Key to store final unmatched metabolites

Semantic Matching Process

Context Creation - Combines metabolite name, pathway, and description - Creates rich context strings for embedding
Embedding Generation - Uses OpenAI embeddings API - Caches embeddings to reduce API calls - Processes in batches for efficiency
Similarity Search - Calculates cosine similarity between embeddings - Identifies top candidates above threshold
LLM Validation - Submits candidates to GPT for biological validation - Gets confidence scores and reasoning - Filters based on confidence threshold

Context String Examples

The action creates rich context strings for semantic matching:

Unmatched Metabolite Context: ` "Metabolite: 1-methylhistidine | SUPER_PATHWAY: Amino Acid | SUB_PATHWAY: Histidine Metabolism" `

Reference Metabolite Context: ` "Metabolite: Histidine | Description: Essential amino acid | Category: Amino acids | Platform: Nightingale NMR" `

Example Usage

Basic Semantic Matching

steps:
  - name: semantic_match
    action:
      type: SEMANTIC_METABOLITE_MATCH
      params:
        unmatched_dataset: "unmatched_metabolomics"
        reference_map: "nightingale_nmr_map"
        context_fields:
          unmatched_metabolomics: ["BIOCHEMICAL_NAME", "SUPER_PATHWAY"]
          nightingale_nmr: ["unified_name", "description"]
        confidence_threshold: 0.80
        embedding_similarity_threshold: 0.85
        output_key: "semantic_matches"

Advanced Configuration

steps:
  - name: comprehensive_semantic_match
    action:
      type: SEMANTIC_METABOLITE_MATCH
      params:
        unmatched_dataset: "complex_metabolites"
        reference_map: "comprehensive_reference"
        context_fields:
          complex_metabolites:
            - "BIOCHEMICAL_NAME"
            - "SUPER_PATHWAY"
            - "SUB_PATHWAY"
            - "PLATFORM"
          comprehensive_reference:
            - "unified_name"
            - "description"
            - "category"
            - "synonyms"
        embedding_model: "text-embedding-ada-002"
        llm_model: "gpt-4"
        confidence_threshold: 0.75
        include_reasoning: true
        max_llm_calls: 200
        embedding_similarity_threshold: 0.80
        batch_size: 20
        output_key: "validated_semantic_matches"
        unmatched_key: "still_unmatched"

Cost-Controlled Matching

steps:
  - name: budget_semantic_match
    action:
      type: SEMANTIC_METABOLITE_MATCH
      params:
        unmatched_dataset: "priority_metabolites"
        reference_map: "core_reference"
        context_fields:
          priority_metabolites: ["BIOCHEMICAL_NAME"]
          core_reference: ["unified_name"]
        embedding_model: "text-embedding-ada-002"
        llm_model: "gpt-3.5-turbo"  # Lower cost model
        confidence_threshold: 0.85   # Higher threshold
        max_llm_calls: 50            # Strict limit
        embedding_similarity_threshold: 0.90  # Pre-filter more strictly
        output_key: "budget_matches"

LLM Validation Process

The LLM receives structured prompts for biological validation:

Prompt Template: ``` I need to determine if these two metabolites are the same compound:

Metabolite A: - Name: 1-methylhistidine - Pathway: Amino Acid - Sub-pathway: Histidine Metabolism - Additional info: HMDB_ID: HMDB0000001

Metabolite B: - Name: Histidine - Description: Essential amino acid - Category: Amino acids - Platform: Nightingale NMR

Embedding similarity: 0.887

Are these the same metabolite? Respond with: 1. YES/NO/UNCERTAIN 2. Confidence (0-1) 3. Brief reasoning (1-2 sentences)

Format: YES|0.95|These are both referring to histidine-related compounds. ```

LLM Response Processing: - Parses structured responses: Decision|Confidence|Reasoning - Validates biological correctness - Provides confidence scores for downstream filtering

Output Format

The action outputs enriched matches with validation metadata:

Original Metabolite + Match Info + Validation Data

Example output:

BIOCHEMICAL_NAME     | matched_name        | match_confidence | embedding_similarity | match_reasoning
1-methylhistidine    | Histidine          | 0.85            | 0.887               | Related histidine compounds
Glucose-6-phosphate  | Glucose            | 0.92            | 0.901               | Same base metabolite
Unknown compound     |                    |                 |                     |

Embedding Cache System

Intelligent caching reduces API costs and improves performance:

Memory Cache - In-memory storage for session reuse - MD5 hashing for efficient lookups - LRU eviction for memory management

Disk Cache - Persistent storage across sessions - JSON serialization for portability - TTL-based cache invalidation

Cache Statistics - Hit/miss ratios tracked - Performance metrics reported - Cache efficiency monitoring

Error Handling and Resilience

The action includes comprehensive error handling:

API Failures - Graceful fallback when OpenAI APIs fail - Retry logic with exponential backoff - Partial results preservation

Rate Limiting - Automatic rate limit detection - Adaptive throttling - Cost monitoring and alerting

Data Quality Issues - Empty context field handling - Invalid response parsing - Confidence threshold validation

Statistics and Monitoring

Detailed statistics are provided for analysis optimization:

{
    "matched_count": 45,
    "unmatched_count": 15,
    "llm_calls": 87,
    "cache_hits": 32,
    "confidence_distribution": {
        "high": 35,    # ≥0.9
        "medium": 10,  # 0.75-0.9
        "low": 0       # <0.75
    },
    "embedding_similarity_avg": 0.876,
    "llm_validation_rate": 0.52,
    "api_costs_estimated": 2.34
}

Best Practices

Optimize context fields: Include pathway and description information for better embeddings
Set appropriate thresholds: Balance recall vs precision with confidence thresholds
Monitor costs: Use max_llm_calls to control OpenAI API expenses
Cache embeddings: Enable caching for repeated analyses
Validate results: Review LLM reasoning for biological accuracy
Batch efficiently: Use appropriate batch sizes for your API limits

Performance Optimization

Embedding Efficiency - Batch processing for reduced API calls - Intelligent caching strategy - Deduplication of similar contexts

LLM Cost Management - Pre-filtering with embedding similarity - Configurable call limits - Cost estimation and tracking

Memory Management - Streaming processing for large datasets - Cache size limitations - Garbage collection optimization

Integration Examples

With Traditional Matching

steps:
  - name: exact_match
    action:
      type: NIGHTINGALE_NMR_MATCH
      params:
        dataset_key: "metabolomics_data"
        output_key: "exact_matches"
        unmatched_key: "unmatched_after_exact"

  - name: semantic_match
    action:
      type: SEMANTIC_METABOLITE_MATCH
      params:
        unmatched_dataset: "unmatched_after_exact"
        reference_map: "nightingale_reference"
        output_key: "semantic_matches"

With Quality Assessment

steps:
  - name: semantic_matching
    action:
      type: SEMANTIC_METABOLITE_MATCH
      # ... parameters

  - name: validate_semantic_quality
    action:
      type: CALCULATE_MAPPING_QUALITY
      params:
        source_key: "unmatched_metabolites"
        mapped_key: "semantic_matches"
        confidence_column: "match_confidence"
        output_key: "semantic_quality_metrics"

Requirements

API Access - OpenAI API key required - Sufficient API credits for embeddings and LLM calls - Network access to OpenAI endpoints

Dependencies - openai Python package - numpy for similarity calculations - scikit-learn for cosine similarity

Environment Variables - OPENAI_API_KEY: Your OpenAI API key - SEMANTIC_MATCH_CACHE_DIR: Optional cache directory

The semantic matching action provides state-of-the-art metabolite identification using AI while maintaining cost control and biological validation.