HMDB Vector Match
Overview
The HMDB_VECTOR_MATCH action performs semantic similarity matching using vector embeddings to map metabolite identifiers against the Human Metabolome Database (HMDB). This action uses the Qdrant vector database for high-performance similarity search and FastEmbed for efficient embedding generation.
This is particularly useful for Stage 4 progressive matching when direct matching, fuzzy matching, and other approaches have been exhausted.
Parameters
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Key for the input dataset containing metabolite identifiers |
|
string |
Yes |
Key for the output dataset with matched identifiers |
|
string |
Yes |
Column name containing metabolite identifiers to match |
|
float |
No |
Minimum similarity threshold (default: 0.7) |
|
integer |
No |
Maximum number of matches per metabolite (default: 5) |
|
boolean |
No |
Enable LLM validation for high-confidence matches (default: false) |
Performance
Vector Search Speed: ~1-5ms per query
Batch Processing: Processes multiple metabolites efficiently
Memory Usage: Optimized embedding generation
Coverage Improvement: Typically adds 5-10% to total pipeline coverage
Expected coverage improvement over previous stages: - After direct matching: +15-25% - After fuzzy matching: +8-15% - After RampDB bridge: +5-10%
Example Usage
YAML Strategy
steps:
- name: stage4_vector_matching
action:
type: HMDB_VECTOR_MATCH
params:
input_key: stage3_unmatched
output_key: stage4_matched
identifier_column: metabolite_name
threshold: 0.75
max_results: 3
use_llm_validation: true
Python Client
from src.client.client_v2 import BiomapperClient
client = BiomapperClient(base_url="http://localhost:8000")
# Load your metabolite data first
context = {"datasets": {"unmatched_metabolites": metabolite_df}}
result = await client.run_action(
action_type="HMDB_VECTOR_MATCH",
params={
"input_key": "unmatched_metabolites",
"output_key": "vector_matched",
"identifier_column": "compound_name",
"threshold": 0.8
},
context=context
)
Output Format
The action returns a dataset with the following structure:
Column |
Description |
|---|---|
|
Original metabolite identifier from input |
|
HMDB identifier of the best match |
|
Name of the matched HMDB compound |
|
Cosine similarity score (0.0-1.0) |
|
Confidence level: high/medium/low |
|
LLM validation result (if enabled) |
Technical Details
Vector Database
Uses Qdrant vector database with pre-computed HMDB embeddings:
Collection:
hmdb_metabolitesVector Size: 384 dimensions (all-MiniLM-L6-v2)
Distance Metric: Cosine similarity
Index Type: HNSW for fast approximate search
Embedding Generation
Model:
sentence-transformers/all-MiniLM-L6-v2Library: FastEmbed for optimized performance
Preprocessing: Text normalization and cleaning
Batch Size: Configurable for memory optimization
LLM Validation (Optional)
When use_llm_validation is enabled:
Uses lightweight language model for validation
Compares original and matched compound names
Provides confidence assessment
Filters out obvious false positives
Best Practices
Threshold Selection: Start with 0.7-0.8 for balanced precision/recall
Progressive Use: Use as final stage after direct/fuzzy matching
Validation: Enable LLM validation for critical applications
Batch Processing: Process large datasets in chunks for optimal performance
Troubleshooting
Common Issues
Qdrant Connection Error: Ensure vector database is running
Low Match Quality: Adjust threshold or enable LLM validation
Performance Issues: Reduce batch size or max_results
Missing Embeddings: Verify HMDB collection exists and is populated
Performance Tuning
Reduce threshold: Increases recall but may reduce precision
Increase max_results: More candidates but slower processing
Enable batching: Process multiple queries together
Optimize embedding model: Use smaller model for speed
See Also
Metabolite Fuzzy String Match - String-based fuzzy matching
Metabolite RampDB Bridge - RampDB integration
Progressive Semantic Match - Multi-stage semantic matching
Metabolomics Progressive Pipeline - Complete workflow implementation