HMDB Vector Match

Overview

The HMDB_VECTOR_MATCH action performs semantic similarity matching using vector embeddings to map metabolite identifiers against the Human Metabolome Database (HMDB). This action uses the Qdrant vector database for high-performance similarity search and FastEmbed for efficient embedding generation.

This is particularly useful for Stage 4 progressive matching when direct matching, fuzzy matching, and other approaches have been exhausted.

Parameters

Parameter

Type

Required

Description

input_key

string

Yes

Key for the input dataset containing metabolite identifiers

output_key

string

Yes

Key for the output dataset with matched identifiers

identifier_column

string

Yes

Column name containing metabolite identifiers to match

threshold

float

No

Minimum similarity threshold (default: 0.7)

max_results

integer

No

Maximum number of matches per metabolite (default: 5)

use_llm_validation

boolean

No

Enable LLM validation for high-confidence matches (default: false)

Performance

  • Vector Search Speed: ~1-5ms per query

  • Batch Processing: Processes multiple metabolites efficiently

  • Memory Usage: Optimized embedding generation

  • Coverage Improvement: Typically adds 5-10% to total pipeline coverage

Expected coverage improvement over previous stages: - After direct matching: +15-25% - After fuzzy matching: +8-15% - After RampDB bridge: +5-10%

Example Usage

YAML Strategy

steps:
  - name: stage4_vector_matching
    action:
      type: HMDB_VECTOR_MATCH
      params:
        input_key: stage3_unmatched
        output_key: stage4_matched
        identifier_column: metabolite_name
        threshold: 0.75
        max_results: 3
        use_llm_validation: true

Python Client

from src.client.client_v2 import BiomapperClient

client = BiomapperClient(base_url="http://localhost:8000")

# Load your metabolite data first
context = {"datasets": {"unmatched_metabolites": metabolite_df}}

result = await client.run_action(
    action_type="HMDB_VECTOR_MATCH",
    params={
        "input_key": "unmatched_metabolites",
        "output_key": "vector_matched",
        "identifier_column": "compound_name",
        "threshold": 0.8
    },
    context=context
)

Output Format

The action returns a dataset with the following structure:

Column

Description

original_id

Original metabolite identifier from input

matched_hmdb_id

HMDB identifier of the best match

matched_name

Name of the matched HMDB compound

similarity_score

Cosine similarity score (0.0-1.0)

match_confidence

Confidence level: high/medium/low

llm_validation

LLM validation result (if enabled)

Technical Details

Vector Database

Uses Qdrant vector database with pre-computed HMDB embeddings:

  • Collection: hmdb_metabolites

  • Vector Size: 384 dimensions (all-MiniLM-L6-v2)

  • Distance Metric: Cosine similarity

  • Index Type: HNSW for fast approximate search

Embedding Generation

  • Model: sentence-transformers/all-MiniLM-L6-v2

  • Library: FastEmbed for optimized performance

  • Preprocessing: Text normalization and cleaning

  • Batch Size: Configurable for memory optimization

LLM Validation (Optional)

When use_llm_validation is enabled:

  • Uses lightweight language model for validation

  • Compares original and matched compound names

  • Provides confidence assessment

  • Filters out obvious false positives

Best Practices

  1. Threshold Selection: Start with 0.7-0.8 for balanced precision/recall

  2. Progressive Use: Use as final stage after direct/fuzzy matching

  3. Validation: Enable LLM validation for critical applications

  4. Batch Processing: Process large datasets in chunks for optimal performance

Troubleshooting

Common Issues

  • Qdrant Connection Error: Ensure vector database is running

  • Low Match Quality: Adjust threshold or enable LLM validation

  • Performance Issues: Reduce batch size or max_results

  • Missing Embeddings: Verify HMDB collection exists and is populated

Performance Tuning

  • Reduce threshold: Increases recall but may reduce precision

  • Increase max_results: More candidates but slower processing

  • Enable batching: Process multiple queries together

  • Optimize embedding model: Use smaller model for speed

See Also