Metabolite RampDB Bridge
Overview
The METABOLITE_RAMPDB_BRIDGE action integrates with the RampDB API to perform external database lookups for metabolite identifiers. This action is typically used in Stage 3 of the progressive metabolomics pipeline to leverage RampDB’s comprehensive metabolite mapping capabilities.
RampDB provides cross-references between multiple metabolite databases including HMDB, KEGG, ChEBI, PubChem, and others, making it valuable for resolving identifiers that couldn’t be matched through direct or fuzzy approaches.
Key Features
External API Integration: Real-time queries to RampDB service
Cross-Database Mapping: Maps across HMDB, KEGG, ChEBI, PubChem, and more
Batch Processing: Optimized batch API calls for better performance
Retry Logic: Robust error handling with exponential backoff
Rate Limiting: Respects API rate limits to avoid service disruption
Parameters
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Key for the input dataset containing metabolite identifiers |
|
string |
Yes |
Key for the output dataset with RampDB matches |
|
string |
Yes |
Column containing metabolite identifiers to query |
|
integer |
No |
Number of identifiers per API call (default: 50) |
|
integer |
No |
API request timeout in seconds (default: 30) |
|
integer |
No |
Maximum retry attempts for failed requests (default: 3) |
|
list |
No |
Specific databases to query (default: [“hmdb”, “kegg”, “chebi”]) |
Performance Metrics
Expected performance for Stage 3 in progressive pipeline:
Coverage Addition: +8-12% over previous stages
Processing Speed: 30-60 seconds for 1,000 metabolites (API dependent)
Success Rate: 90-95% API call success rate
Database Coverage: Access to 15+ metabolite databases
API Rate Limits: - Requests per minute: 60 (varies by RampDB service level) - Concurrent requests: 5 maximum recommended - Daily quota: 10,000 requests (free tier)
Example Usage
YAML Strategy
steps:
- name: stage3_rampdb_bridge
action:
type: METABOLITE_RAMPDB_BRIDGE
params:
input_key: stage2_unmatched
output_key: stage3_matched
identifier_column: metabolite_name
batch_size: 25
timeout: 45
target_databases: ["hmdb", "kegg", "chebi", "pubchem"]
Python Client
from src.client.client_v2 import BiomapperClient
client = BiomapperClient(base_url="http://localhost:8000")
# Query RampDB for unmatched metabolites
context = {"datasets": {"stage2_unmatched": unmatched_df}}
result = await client.run_action(
action_type="METABOLITE_RAMPDB_BRIDGE",
params={
"input_key": "stage2_unmatched",
"output_key": "rampdb_matches",
"identifier_column": "compound_name",
"batch_size": 30,
"timeout": 60,
"target_databases": ["hmdb", "kegg"]
},
context=context
)
Output Format
The action returns matches from RampDB with cross-references:
Column |
Description |
|---|---|
|
Original metabolite identifier from input |
|
RampDB internal identifier |
|
Standardized metabolite name from RampDB |
|
HMDB identifier (if available) |
|
KEGG identifier (if available) |
|
ChEBI identifier (if available) |
|
PubChem CID (if available) |
|
RampDB confidence score (0.0-1.0) |
|
Time taken for API call (ms) |
Technical Implementation
API Integration
RampDB API endpoint structure:
# Base API configuration
RAMPDB_BASE_URL = "https://rampdb.nih.gov/api/v1/"
ENDPOINTS = {
"search": "metabolites/search",
"batch": "metabolites/batch_search",
"crossref": "metabolites/crossref"
}
Batch Processing Logic
Optimized batch processing to minimize API calls:
Batch Grouping: Groups identifiers into optimal batch sizes
Rate Limiting: Implements delays between API calls
Error Handling: Retries failed batches with exponential backoff
Result Aggregation: Combines batch results into unified dataset
Retry Strategy
Robust error handling for API reliability:
retry_configuration:
max_retries: 3
base_delay: 1.0 # seconds
backoff_multiplier: 2.0
max_delay: 30.0 # seconds
timeout_handling:
connect_timeout: 10 # seconds
read_timeout: 30 # seconds
Database Cross-References
RampDB provides cross-references to multiple databases:
Primary Databases
HMDB: Human Metabolome Database
KEGG: Kyoto Encyclopedia of Genes and Genomes
ChEBI: Chemical Entities of Biological Interest
PubChem: PubChem Compound Database
Secondary Databases
BioCyc: Metabolic pathway database
LIPID MAPS: Lipidomics database
MetaCyc: Metabolic pathway database
Reactome: Pathway database
WikiPathways: Community pathway database
Example Workflow Integration
Stage 3 in Progressive Pipeline
# Complete Stage 3 implementation
steps:
# Previous stages completed
- name: load_unmatched_from_stage2
action:
type: FILTER_DATASET
params:
input_key: stage2_results
output_key: stage2_unmatched
filter_expression: "matched_status == 'unmatched'"
- name: stage3_rampdb_query
action:
type: METABOLITE_RAMPDB_BRIDGE
params:
input_key: stage2_unmatched
output_key: stage3_matched
identifier_column: metabolite_name
batch_size: 40
timeout: 45
- name: combine_stage3_results
action:
type: MERGE_DATASETS
params:
input_keys: [stage2_matched, stage3_matched]
output_key: stages_1_3_combined
Error Handling and Monitoring
Common API Issues
Error Type |
Handling Strategy |
|---|---|
|
Automatic retry with exponential backoff |
|
Retry with increased timeout values |
|
Skip batch and continue with remaining data |
|
Log error and mark identifiers as unprocessable |
|
Check API key configuration |
Monitoring Metrics
Track these metrics for API health:
Success Rate: Percentage of successful API calls
Average Response Time: Monitor API performance
Error Distribution: Track types of failures
Coverage Rate: Percentage of identifiers successfully mapped
Quota Usage: Monitor daily API quota consumption
Configuration Examples
High-Throughput Configuration
For large datasets with relaxed accuracy requirements:
params:
batch_size: 100 # Larger batches
timeout: 60 # Longer timeout
max_retries: 2 # Fewer retries
target_databases: ["hmdb", "kegg"] # Focus on key databases
High-Accuracy Configuration
For critical datasets requiring comprehensive mapping:
params:
batch_size: 20 # Smaller batches for reliability
timeout: 90 # Extended timeout
max_retries: 5 # More retries
target_databases: ["hmdb", "kegg", "chebi", "pubchem", "lipidmaps"]
Development Configuration
For testing and development:
params:
batch_size: 5 # Very small batches
timeout: 30
max_retries: 1
debug_mode: true # Enhanced logging
dry_run: false # Set to true to skip actual API calls
Best Practices
API Key Management: Store API keys securely in environment variables
Batch Size Optimization: Start with 25-50, adjust based on performance
Timeout Configuration: Set timeouts 2-3x longer than average response time
Error Logging: Log all API errors for debugging and monitoring
Quota Monitoring: Track daily API usage to avoid quota exhaustion
Integration Patterns
Sequential Processing
Process in stages with error recovery:
steps:
- name: stage3_batch1
action:
type: METABOLITE_RAMPDB_BRIDGE
params:
input_key: unmatched_batch1
batch_size: 50
- name: stage3_batch2
action:
type: METABOLITE_RAMPDB_BRIDGE
params:
input_key: unmatched_batch2
batch_size: 50
Parallel Processing
For independent batches (advanced):
# Note: Requires workflow orchestration support
parallel_steps:
- name: rampdb_batch_1
action:
type: METABOLITE_RAMPDB_BRIDGE
params: {batch_size: 25, timeout: 30}
- name: rampdb_batch_2
action:
type: METABOLITE_RAMPDB_BRIDGE
params: {batch_size: 25, timeout: 30}
Troubleshooting
Performance Issues
Slow API responses: Reduce batch size, increase timeout
Rate limiting: Decrease request frequency, implement delays
Memory usage: Process in smaller chunks
Network instability: Increase retry attempts with longer delays
Data Quality Issues
Low match rates: Verify input data quality and identifier formats
Inconsistent results: Check RampDB service status and version
Missing cross-references: Query additional target databases
Duplicate matches: Implement deduplication logic
API Configuration Issues
Authentication failures: Verify API key configuration
Quota exceeded: Monitor and manage daily usage
Service unavailable: Implement fallback strategies
Version compatibility: Check RampDB API version requirements
See Also
Metabolite Fuzzy String Match - Stage 2 fuzzy matching
HMDB Vector Match - Stage 4 vector similarity matching
Progressive Semantic Match - Multi-stage orchestration
Metabolomics Progressive Pipeline - Complete pipeline implementation
RampDB Integration - RampDB setup and configuration
../examples/api_error_handling - Error handling patterns