RampDB Integration

Overview

RampDB (Relational Database of Metabolic Pathways) is a comprehensive metabolite database that provides cross-references between multiple metabolomics databases including HMDB, KEGG, ChEBI, PubChem, and others. Biomapper integrates with RampDB through its REST API to enhance metabolite identifier resolution.

The RampDB integration is primarily used in Stage 3 of the progressive metabolomics pipeline via the METABOLITE_RAMPDB_BRIDGE action, which is implemented as RAMPDB_BRIDGE in the current codebase.

Key Features

Cross-Database Mapping: Maps identifiers across 15+ metabolite databases
Real-Time API Access: Query RampDB service in real-time
Batch Processing: Optimized batch queries for better performance
Comprehensive Coverage: Access to pathway and reaction information
Standardized Outputs: Consistent identifier formats across databases

Supported Databases

Primary Metabolite Databases

Database	Description
HMDB	Human Metabolome Database - comprehensive human metabolite data
KEGG	Kyoto Encyclopedia of Genes and Genomes - metabolic pathways
ChEBI	Chemical Entities of Biological Interest - chemical compounds
PubChem	PubChem Compound Database - chemical information
BioCyc	Metabolic pathway database collection
LIPID MAPS	Lipidomics database and tools

Secondary Databases

MetaCyc - Metabolic pathway database
Reactome - Pathway database
WikiPathways - Community pathway database
CAS Registry - Chemical abstracts service
InChI/InChIKey - International chemical identifier

Setup and Configuration

API Access Setup

API Access Configuration

The RampDB client uses the modern RaMP database API for metabolite matching:

# RampDB integration is available without API key registration
# The client uses async HTTP requests with rate limiting
# No manual API key setup required

Configure Rate Limiting (Optional)

Set up environment variables for performance tuning:

# Optional: Configure rate limiting (defaults are built-in)
export RAMPDB_RATE_LIMIT="5"   # requests per second (default)
export RAMPDB_TIMEOUT="30"     # seconds (default)
export RAMPDB_BATCH_SIZE="50" # batch processing size

Install Required Dependencies

# Install all dependencies (includes aiohttp for RampDB client)
poetry install --with dev,docs,api

# No separate RampDB verification script needed
# Integration testing handled through biomapper test suite

Environment Configuration

Create a configuration file for RampDB settings:

# config/rampdb_config.yaml
rampdb:
  api:
    base_url: "https://rampdb.nih.gov/api/v1/"
    timeout: 30
    max_retries: 3
    rate_limit: 60  # requests per minute

  databases:
    primary: ["hmdb", "kegg", "chebi", "pubchem"]
    secondary: ["biocyc", "lipidmaps", "metacyc"]

  batch_processing:
    default_batch_size: 50
    max_batch_size: 100
    batch_delay: 1.0  # seconds between batches

API Usage Patterns

Basic Query

from actions.entities.metabolites.external.ramp_client_modern import RaMPClientModern, create_ramp_client

# Initialize client with default configuration
client = create_ramp_client()

# Single metabolite query
result = await client.search_metabolite_by_name("glucose")
print(f"Found {len(result)} matches")

# Access cross-references
for match in result:
    print(f"Common Name: {match.common_name}")
    print(f"Source ID: {match.source_id}")
    print(f"Database: {match.id_type}")

Batch Query

# Batch query for multiple metabolites
metabolite_names = ["glucose", "fructose", "galactose"]

batch_results = await client.batch_metabolite_search(
    metabolite_names=metabolite_names,
    batch_size=25
)

# Process batch results
for metabolite, matches in batch_results.items():
    print(f"{metabolite}: {len(matches)} matches found")

Advanced Queries

# Query with specific database targets
# Search with specific options
result = await client.search_metabolite_by_name(
    name="citric acid",
    analyte_type="metabolite"
)

# Get pathway information (if available)
pathway_info = await client.get_pathways_from_analytes(
    analytes=["HMDB0000094"]
)

Integration with Biomapper Actions

YAML Strategy Configuration

steps:
  - name: rampdb_metabolite_mapping
    action:
      type: RAMPDB_BRIDGE
      params:
        unmapped_key: unmatched_metabolites
        output_key: rampdb_matches
        final_unmapped_key: rampdb_unmapped
        identifier_column: metabolite_name
        batch_size: 40
        timeout: 45
        max_retries: 3

Python Client Usage

from src.client.client_v2 import BiomapperClient

client = BiomapperClient(base_url="http://localhost:8000")

result = await client.run_action(
    action_type="RAMPDB_BRIDGE",
    params={
        "unmapped_key": "metabolites",
        "output_key": "rampdb_results",
        "final_unmapped_key": "still_unmapped",
        "identifier_column": "compound_name",
        "batch_size": 30
    },
    context={"datasets": {"metabolites": metabolite_df}}
)

Performance Optimization

Batch Size Optimization

Optimize batch sizes based on your use case:

Dataset Size	Recommended Batch Size	Expected Time	Memory Usage
< 100 metabolites	25	< 30 seconds	Low
100-1,000 metabolites	50	1-5 minutes	Medium
1,000-10,000 metabolites	75	10-30 minutes	High
> 10,000 metabolites	100	30+ minutes	Very High

Rate Limiting Configuration

# Conservative rate limiting (high reliability)
rate_limiting:
  requests_per_minute: 30
  batch_delay: 2.0
  exponential_backoff: true

# Aggressive rate limiting (faster processing)
rate_limiting:
  requests_per_minute: 100
  batch_delay: 0.5
  exponential_backoff: false

# Balanced rate limiting (recommended)
rate_limiting:
  requests_per_minute: 60
  batch_delay: 1.0
  exponential_backoff: true

Caching Strategy

# Enable caching for repeated queries
client_config = {
    "cache_enabled": True,
    "cache_ttl": 3600,  # 1 hour
    "cache_backend": "redis",  # or "memory"
    "cache_key_prefix": "rampdb_"
}

client = RampDBClient(config=client_config)

Error Handling and Monitoring

Common API Errors

Error Type	Handling Strategy
`401 Unauthorized`	Check API key configuration and registration status
`429 Rate Limited`	Implement exponential backoff and reduce request rate
`500 Server Error`	Retry with exponential backoff, consider service status
`503 Service Unavailable`	Wait and retry, check RampDB service status
`Timeout`	Increase timeout values or reduce batch sizes

Monitoring Implementation

import logging
from src.integrations.rampdb_monitor import RampDBMonitor

# Setup monitoring
monitor = RampDBMonitor()

# Track API metrics
@monitor.track_api_call
async def query_rampdb(query):
    result = await client.search_metabolite(query)

    # Log metrics
    monitor.log_success_rate(result.success)
    monitor.log_response_time(result.response_time)
    monitor.log_quota_usage(result.quota_used)

    return result

Error Recovery Patterns

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10)
)
async def robust_rampdb_query(identifiers):
    try:
        return await client.batch_search(identifiers)
    except RampDBAPIError as e:
        if e.status_code == 429:  # Rate limited
            await asyncio.sleep(e.retry_after or 60)
            raise
        elif e.status_code >= 500:  # Server error
            raise
        else:  # Client error - don't retry
            return None

Data Quality and Validation

Result Validation

def validate_rampdb_results(results):
    """Validate RampDB API results for quality"""
    validation_report = {
        "total_queries": len(results),
        "successful_matches": 0,
        "failed_queries": 0,
        "quality_issues": []
    }

    for query, result in results.items():
        if result.success:
            validation_report["successful_matches"] += 1

            # Check for quality issues
            if not result.hmdb_id and not result.kegg_id:
                validation_report["quality_issues"].append(
                    f"No primary database IDs for {query}"
                )
        else:
            validation_report["failed_queries"] += 1

    return validation_report

Cross-Reference Consistency

def check_cross_reference_consistency(rampdb_result):
    """Verify cross-references are consistent"""
    issues = []

    # Check if HMDB and KEGG refer to same compound
    if rampdb_result.hmdb_id and rampdb_result.kegg_id:
        hmdb_name = get_compound_name_from_hmdb(rampdb_result.hmdb_id)
        kegg_name = get_compound_name_from_kegg(rampdb_result.kegg_id)

        if not compounds_are_equivalent(hmdb_name, kegg_name):
            issues.append("HMDB and KEGG cross-references inconsistent")

    return issues

Best Practices

API Key Management
- Store API keys securely in environment variables
- Use different keys for development and production
- Monitor API key usage and quotas
- Rotate keys regularly for security
Rate Limiting
- Respect RampDB rate limits to maintain service availability
- Implement exponential backoff for rate limit errors
- Monitor quota usage to avoid service interruption
- Use batch queries to maximize efficiency
Error Handling
- Implement comprehensive error handling for all API calls
- Log errors with sufficient context for debugging
- Use retry logic with exponential backoff
- Have fallback strategies for service unavailability
Data Quality
- Validate API responses before using results
- Check for cross-reference consistency
- Monitor match rates and quality metrics
- Flag low-confidence matches for manual review
Performance
- Use appropriate batch sizes for your dataset
- Cache results to avoid repeated API calls
- Process in parallel where possible
- Monitor response times and optimize accordingly

Troubleshooting Guide

Connection Issues

# Test basic connectivity
curl -X GET "https://rampdb.nih.gov/api/v1/status" \
     -H "Authorization: Bearer YOUR_API_KEY"

# Check DNS resolution
nslookup rampdb.nih.gov

# Test from Python
python -c "
import requests
response = requests.get('https://rampdb.nih.gov/api/v1/status')
print(f'Status: {response.status_code}')
"

Authentication Issues

Verify API key is correctly set in environment
Check API key has not expired
Confirm registration is active and approved
Test with simple API call to verify credentials

Performance Issues

Slow API Responses
- Reduce batch sizes
- Increase timeout values
- Check network connectivity
- Monitor RampDB service status
Rate Limiting
- Implement longer delays between requests
- Use exponential backoff
- Reduce concurrent requests
- Monitor quota usage patterns

Data Quality Issues

Low Match Rates
- Verify input data quality and formatting
- Check metabolite name normalization
- Try different database targets
- Review confidence thresholds
Inconsistent Results
- Check RampDB service version and updates
- Validate cross-references manually
- Compare with alternative data sources
- Report data quality issues to RampDB team