HMDB Vector Match

Overview

The HMDB_VECTOR_MATCH action performs semantic similarity matching using vector embeddings to map metabolite identifiers against the Human Metabolome Database (HMDB). This action uses the Qdrant vector database for high-performance similarity search and FastEmbed for efficient embedding generation.

This is particularly useful for Stage 4 progressive matching when direct matching, fuzzy matching, and other approaches have been exhausted.

Parameters

Parameter	Type	Required	Description
`input_key`	string	Yes	Key for the input dataset containing metabolite identifiers
`output_key`	string	Yes	Key for the output dataset with matched identifiers
`identifier_column`	string	Yes	Column name containing metabolite identifiers to match
`threshold`	float	No	Minimum similarity threshold (default: 0.7)
`max_results`	integer	No	Maximum number of matches per metabolite (default: 5)
`use_llm_validation`	boolean	No	Enable LLM validation for high-confidence matches (default: false)

Performance

Vector Search Speed: ~1-5ms per query
Batch Processing: Processes multiple metabolites efficiently
Memory Usage: Optimized embedding generation
Coverage Improvement: Typically adds 5-10% to total pipeline coverage

Expected coverage improvement over previous stages: - After direct matching: +15-25% - After fuzzy matching: +8-15% - After RampDB bridge: +5-10%

Example Usage

YAML Strategy

steps:
  - name: stage4_vector_matching
    action:
      type: HMDB_VECTOR_MATCH
      params:
        input_key: stage3_unmatched
        output_key: stage4_matched
        identifier_column: metabolite_name
        threshold: 0.75
        max_results: 3
        use_llm_validation: true

Python Client

from src.client.client_v2 import BiomapperClient

client = BiomapperClient(base_url="http://localhost:8000")

# Load your metabolite data first
context = {"datasets": {"unmatched_metabolites": metabolite_df}}

result = await client.run_action(
    action_type="HMDB_VECTOR_MATCH",
    params={
        "input_key": "unmatched_metabolites",
        "output_key": "vector_matched",
        "identifier_column": "compound_name",
        "threshold": 0.8
    },
    context=context
)

Output Format

The action returns a dataset with the following structure:

Column	Description
`original_id`	Original metabolite identifier from input
`matched_hmdb_id`	HMDB identifier of the best match
`matched_name`	Name of the matched HMDB compound
`similarity_score`	Cosine similarity score (0.0-1.0)
`match_confidence`	Confidence level: high/medium/low
`llm_validation`	LLM validation result (if enabled)

Technical Details

Vector Database

Uses Qdrant vector database with pre-computed HMDB embeddings:

Collection: hmdb_metabolites
Vector Size: 384 dimensions (all-MiniLM-L6-v2)
Distance Metric: Cosine similarity
Index Type: HNSW for fast approximate search

Embedding Generation

Model: sentence-transformers/all-MiniLM-L6-v2
Library: FastEmbed for optimized performance
Preprocessing: Text normalization and cleaning
Batch Size: Configurable for memory optimization

LLM Validation (Optional)

When use_llm_validation is enabled:

Uses lightweight language model for validation
Compares original and matched compound names
Provides confidence assessment
Filters out obvious false positives

Best Practices

Threshold Selection: Start with 0.7-0.8 for balanced precision/recall
Progressive Use: Use as final stage after direct/fuzzy matching
Validation: Enable LLM validation for critical applications
Batch Processing: Process large datasets in chunks for optimal performance

Troubleshooting

Common Issues

Qdrant Connection Error: Ensure vector database is running
Low Match Quality: Adjust threshold or enable LLM validation
Performance Issues: Reduce batch size or max_results
Missing Embeddings: Verify HMDB collection exists and is populated

Performance Tuning

Reduce threshold: Increases recall but may reduce precision
Increase max_results: More candidates but slower processing
Enable batching: Process multiple queries together
Optimize embedding model: Use smaller model for speed