Performance Optimization Guide
Overview
This guide provides comprehensive strategies for optimizing biomapper performance across different use cases, from real-time clinical applications to large-scale batch processing. Performance optimization in biomapper involves balancing accuracy, speed, memory usage, and API costs.
Performance Fundamentals
Key Performance Metrics
Metric |
Target Range |
Impact |
|---|---|---|
Processing Speed |
<5 minutes for 10K identifiers |
User experience, real-time feasibility |
Memory Usage |
<1GB peak for 100K identifiers |
Resource costs, scalability |
Coverage Rate |
70-80% for typical datasets |
Scientific value, downstream analysis |
API Success Rate |
>95% for external calls |
Reliability, data completeness |
False Positive Rate |
<5% for production use |
Data quality, trust in results |
Performance Bottlenecks
Common Bottlenecks by Stage:
Stage 1 (Direct Matching): Usually fast, but large reference datasets can slow lookups
Stage 2 (Fuzzy Matching): O(n²) string comparisons can be expensive
Stage 3 (API Calls): Network latency and rate limits
Stage 4 (Vector Search): Vector computation and database queries
Strategy-Level Optimizations
Pipeline Configuration
High-Speed Configuration (Optimize for speed):
# Optimized for <30 second processing
speed_optimized:
stages_enabled: [1, 2] # Skip API and vector stages
stage1_threshold: 0.98 # Higher threshold = fewer candidates
stage2_threshold: 0.9 # Higher threshold = less fuzzy matching
stage2_max_distance: 1 # Stricter edit distance
chunk_processing: true
chunk_size: 5000
parallel_processing: true
enable_caching: true
High-Accuracy Configuration (Optimize for coverage):
# Optimized for maximum coverage
accuracy_optimized:
stages_enabled: [1, 2, 3, 4] # All stages
stage1_threshold: 0.9 # Lower threshold = more matches
stage2_threshold: 0.75 # More aggressive fuzzy matching
stage3_batch_size: 25 # Smaller batches = more reliable
stage4_threshold: 0.7 # Lower vector similarity threshold
use_llm_validation: true # Additional validation
quality_control: enhanced
Balanced Configuration (Production default):
# Balanced speed and accuracy
production_optimized:
stages_enabled: [1, 2, 3, 4]
stage1_threshold: 0.95
stage2_threshold: 0.8
stage3_batch_size: 50
stage4_threshold: 0.75
adaptive_chunking: true
progressive_timeouts: true
Stage-Specific Optimizations
Stage 1: Direct Matching Optimization
Reference Data Optimization:
# Pre-compute lookup indices
reference_optimization = {
"create_hash_index": True,
"normalize_keys": True, # Pre-normalize for faster lookup
"use_trie_structure": True, # For prefix matching
"case_insensitive_index": True
}
Memory-Efficient Loading:
stage1_config:
reference_loading:
lazy_loading: true # Load on demand
memory_mapping: true # Use mmap for large files
compression: gzip # Compress in memory
cache_size: 10000 # LRU cache for frequent lookups
Stage 2: Fuzzy Matching Optimization
Algorithm Selection:
Algorithm |
Speed |
Accuracy |
Best For |
|---|---|---|---|
Levenshtein |
Fast |
Good |
General use |
Jaro-Winkler |
Medium |
Better |
Transposed characters |
Biological |
Slow |
Best |
Chemical nomenclature |
Performance Tuning:
stage2_optimization:
# Pre-filtering to reduce candidate set
pre_filter:
length_difference_max: 5 # Skip very different length strings
first_char_match: true # Require first character match
common_prefix_min: 2 # Minimum common prefix
# Parallel processing
parallel_chunks: 4
chunk_size: 2000
# Early termination
max_candidates: 5 # Stop after finding 5 good matches
early_exit_threshold: 0.95 # Stop if perfect match found
Stage 3: API Optimization
Connection Management:
# Optimize HTTP connections
api_config = {
"connection_pool_size": 10,
"keep_alive": True,
"timeout": (5, 30), # (connect, read) timeouts
"max_retries": 3,
"backoff_factor": 1.0,
"session_reuse": True
}
Batch Size Optimization:
# Adaptive batch sizing based on performance
stage3_adaptive_batching:
initial_batch_size: 50
min_batch_size: 10
max_batch_size: 200
# Adjust based on response time
target_response_time: 30 # seconds
size_increase_factor: 1.2
size_decrease_factor: 0.8
Caching Strategy:
# Multi-level caching
caching_config = {
"level1_memory": {
"size": 10000,
"ttl": 3600 # 1 hour
},
"level2_redis": {
"size": 100000,
"ttl": 86400 # 24 hours
},
"level3_disk": {
"size": 1000000,
"ttl": 604800 # 1 week
}
}
Stage 4: Vector Search Optimization
Vector Database Tuning:
qdrant_optimization:
# Index configuration
index_type: hnsw
m: 16 # HNSW connections
ef_construct: 200 # Build-time accuracy
ef_search: 100 # Search-time accuracy
# Memory management
memory_threshold: 0.8 # Trigger cleanup at 80%
batch_size: 1000 # Batch vector operations
# Performance tuning
parallel_indexing: true
prefetch_factor: 2
Query Optimization:
# Optimize vector queries
vector_config = {
"max_results": 5, # Limit candidates
"score_threshold": 0.7, # Early filtering
"batch_queries": True, # Batch multiple queries
"use_filters": True, # Pre-filter by metadata
"cache_embeddings": True # Cache computed embeddings
}
Memory Management
Dataset Chunking
Adaptive Chunking Strategy:
def calculate_optimal_chunk_size(dataset_size, available_memory):
"""Calculate optimal chunk size based on available memory"""
# Estimate memory per record (KB)
memory_per_record = estimate_memory_usage(sample_record)
# Target 70% of available memory
target_memory = available_memory * 0.7
# Calculate chunk size
chunk_size = int(target_memory / memory_per_record)
# Apply bounds
chunk_size = max(1000, min(chunk_size, 50000))
return chunk_size
Memory Monitoring:
memory_management:
monitoring:
check_interval: 30 # seconds
warning_threshold: 0.8 # 80% memory usage
critical_threshold: 0.95 # 95% memory usage
actions:
on_warning: reduce_batch_size
on_critical: trigger_garbage_collection
on_overflow: enable_disk_swap
Garbage Collection Optimization
import gc
# Optimize garbage collection for large datasets
def optimize_gc_for_biomapper():
# Increase generation thresholds
gc.set_threshold(1000, 15, 15)
# Force collection between stages
gc.collect()
# Disable during intensive operations
gc.disable()
# ... intensive processing ...
gc.enable()
Parallel Processing
Thread-Level Parallelization
import concurrent.futures
from multiprocessing import Pool
def parallel_fuzzy_matching(identifiers, reference_data, num_workers=4):
"""Parallel fuzzy matching implementation"""
# Split identifiers into chunks
chunk_size = len(identifiers) // num_workers
chunks = [identifiers[i:i+chunk_size] for i in range(0, len(identifiers), chunk_size)]
# Process chunks in parallel
with Pool(num_workers) as pool:
results = pool.starmap(fuzzy_match_chunk,
[(chunk, reference_data) for chunk in chunks])
# Combine results
return [item for sublist in results for item in sublist]
Async Processing
import asyncio
import aiohttp
async def async_api_calls(identifiers, batch_size=50):
"""Async API calls for better throughput"""
semaphore = asyncio.Semaphore(10) # Limit concurrent requests
async with aiohttp.ClientSession() as session:
tasks = []
for i in range(0, len(identifiers), batch_size):
batch = identifiers[i:i+batch_size]
task = limited_api_call(session, batch, semaphore)
tasks.append(task)
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
Caching Strategies
Multi-Level Caching Architecture
class BiomapperCache:
def __init__(self):
self.l1_cache = {} # In-memory (fast)
self.l2_cache = redis.Redis() # Redis (medium)
self.l3_cache = sqlite3.connect("cache.db") # Disk (slow)
def get(self, key):
# L1: Memory cache
if key in self.l1_cache:
return self.l1_cache[key]
# L2: Redis cache
result = self.l2_cache.get(key)
if result:
self.l1_cache[key] = result # Promote to L1
return result
# L3: Disk cache
result = self.l3_cache.execute("SELECT value FROM cache WHERE key=?", (key,)).fetchone()
if result:
self.l2_cache.set(key, result[0]) # Promote to L2
self.l1_cache[key] = result[0] # Promote to L1
return result[0]
return None
Cache Invalidation Strategy:
cache_management:
ttl_strategy:
api_results: 86400 # 24 hours
fuzzy_matches: 3600 # 1 hour
vector_results: 7200 # 2 hours
exact_matches: 604800 # 1 week (more stable)
invalidation_triggers:
- reference_data_update
- parameter_change
- manual_cache_clear
- strategy_version_change
cleanup_schedule:
frequency: daily
time: "02:00" # 2 AM
max_size: "10GB"
Database Optimizations
Reference Database Tuning
-- Optimize reference database queries
CREATE INDEX idx_metabolite_name ON metabolites(name);
CREATE INDEX idx_metabolite_hmdb ON metabolites(hmdb_id);
CREATE INDEX idx_metabolite_kegg ON metabolites(kegg_id);
-- Compound index for common queries
CREATE INDEX idx_metabolite_compound ON metabolites(name, hmdb_id, kegg_id);
-- Full-text search index
CREATE VIRTUAL TABLE metabolite_fts USING fts5(name, synonyms);
Vector Database Optimization
# Qdrant collection optimization
qdrant_config = {
"vectors": {
"size": 384,
"distance": "Cosine",
"hnsw_config": {
"m": 16, # Number of connections
"ef_construct": 200, # Build-time accuracy
"full_scan_threshold": 20000, # Use full scan below this
"max_indexing_threads": 4, # Parallel indexing
}
},
"optimizers_config": {
"deleted_threshold": 0.2, # Trigger cleanup at 20% deleted
"vacuum_min_vector_number": 1000,
"default_segment_number": 2, # Number of segments
}
}
Monitoring and Profiling
Performance Monitoring
import time
import psutil
import logging
class PerformanceMonitor:
def __init__(self):
self.metrics = {}
def track_stage(self, stage_name):
def decorator(func):
def wrapper(*args, **kwargs):
start_time = time.time()
start_memory = psutil.virtual_memory().percent
try:
result = func(*args, **kwargs)
end_time = time.time()
end_memory = psutil.virtual_memory().percent
self.metrics[stage_name] = {
'duration': end_time - start_time,
'memory_change': end_memory - start_memory,
'success': True
}
return result
except Exception as e:
self.metrics[stage_name] = {
'duration': time.time() - start_time,
'error': str(e),
'success': False
}
raise
return wrapper
return decorator
Profiling Tools
CPU Profiling:
import cProfile
import pstats
# Profile biomapper execution
profiler = cProfile.Profile()
profiler.enable()
# Run biomapper pipeline
result = run_biomapper_pipeline(config)
profiler.disable()
# Analyze results
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20) # Top 20 functions
Memory Profiling:
import tracemalloc
# Built-in memory tracking (available in Python 3.4+)
def profile_memory_usage(func, *args, **kwargs):
tracemalloc.start()
result = func(*args, **kwargs)
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"Current memory: {current / 1024 / 1024:.2f} MB")
print(f"Peak memory: {peak / 1024 / 1024:.2f} MB")
return result
Troubleshooting Performance Issues
Common Performance Problems
Problem |
Symptoms |
Solutions |
|---|---|---|
Memory exhaustion |
OOM errors, swapping |
Enable chunking, reduce batch sizes |
Slow API calls |
Stage 3 timeouts |
Increase batch size, add caching |
Vector search slow |
Stage 4 takes >5 minutes |
Optimize index, reduce candidates |
High CPU usage |
Stage 2 uses 100% CPU |
Enable parallel processing |
Poor cache hit rate |
Repeated slow operations |
Review cache TTL, increase sizes |
Diagnostic Commands
# Monitor system resources during execution
top -p $(pgrep -f biomapper)
# Track memory usage
pmap -x $(pgrep -f biomapper)
# Monitor disk I/O
iotop -p $(pgrep -f biomapper)
# Network monitoring for API calls
netstat -i 1 # Interface stats
# Check biomapper-specific performance
poetry run pytest tests/performance/test_algorithm_complexity.py -v
# Monitor Python memory usage
python -c "import tracemalloc; tracemalloc.start(); # your code here"
Performance Testing Framework
Benchmark Suite
class BiomapperBenchmark:
def __init__(self):
self.test_datasets = {
'small': 100, # 100 metabolites
'medium': 1000, # 1K metabolites
'large': 10000, # 10K metabolites
'xlarge': 100000 # 100K metabolites
}
def run_performance_suite(self):
results = {}
for size_name, size in self.test_datasets.items():
dataset = self.generate_test_dataset(size)
start_time = time.time()
result = run_biomapper_pipeline(dataset)
end_time = time.time()
results[size_name] = {
'processing_time': end_time - start_time,
'coverage': result.coverage_percentage,
'memory_peak': result.peak_memory_usage
}
return results
Continuous Performance Monitoring
# Performance CI pipeline
performance_tests:
schedule: daily
benchmarks:
- name: small_dataset_speed
dataset_size: 1000
max_time: 30 # seconds
min_coverage: 70 # percent
- name: large_dataset_memory
dataset_size: 50000
max_memory: 4 # GB
min_coverage: 65
alerts:
- condition: performance_degradation > 20%
action: slack_notification
recipients: [dev-team]
Production Optimization Checklist
Pre-Deployment Checklist
[ ] Profiling Complete: CPU and memory profiling completed
[ ] Caching Enabled: Multi-level caching configured and tested
[ ] Resource Limits: Memory and CPU limits set appropriately
[ ] Monitoring Configured: Performance metrics collection enabled
[ ] Error Handling: Graceful degradation and recovery tested
[ ] Load Testing: Performance under expected load verified
[ ] Documentation: Performance characteristics documented
Deployment Configuration
# Production-optimized configuration
production_config:
# Resource allocation
memory_limit: "8GB"
cpu_cores: 4
# Performance tuning
enable_chunking: true
chunk_size: 5000
parallel_workers: 4
cache_enabled: true
# Monitoring
metrics_enabled: true
performance_logging: true
alert_thresholds:
memory_usage: 80%
processing_time: 300 # seconds
error_rate: 5%
Algorithm Complexity Resources
BioMapper includes comprehensive algorithm complexity monitoring and optimization tools:
Core Efficiency Classes:
from core.algorithms.efficient_matching import EfficientMatcher
# Replace O(n*m) nested loops with O(n+m) indexed matching
target_index = EfficientMatcher.build_index(target_data, key_func)
matches = EfficientMatcher.match_with_index(source_data, target_index, key_func)
# Multi-key indexing for biological identifiers
protein_index = EfficientMatcher.multi_key_index(
proteins,
key_funcs=[
lambda p: p.get('uniprot_id'),
lambda p: p.get('gene_symbol'),
lambda p: p.get('ensembl_id')
]
)
Performance Testing:
# Run algorithm complexity tests
poetry run pytest tests/performance/test_algorithm_complexity.py -v
# Performance scaling verification
poetry run python tests/performance/test_algorithm_complexity.py
Algorithm Performance Estimator:
from core.algorithms.efficient_matching import EfficientMatcher
# Estimate performance before implementation
estimates = EfficientMatcher.estimate_performance(
n_source=10000,
n_target=100000,
algorithm="hash_index"
)
print(f"Estimated time: {estimates['estimated_time']}")
print(f"Complexity: {estimates['complexity']}")
See Also
/biomapper/dev/standards/ALGORITHM_COMPLEXITY_GUIDE.md- Detailed algorithm best practices/biomapper/src/core/algorithms/efficient_matching.py- Efficient matching implementations/biomapper/tests/performance/test_algorithm_complexity.py- Performance benchmarks and tests
—
## Verification Sources Last verified: 2025-08-22
This documentation was verified against the following project resources:
/biomapper/src/core/algorithms/efficient_matching.py (verified all performance optimization utilities and methods)
/biomapper/tests/performance/test_algorithm_complexity.py (confirmed benchmarking framework and test implementation)
/biomapper/dev/standards/ALGORITHM_COMPLEXITY_GUIDE.md (cross-referenced algorithm best practices and anti-patterns)
/biomapper/README.md (verified architectural components and performance features)
/biomapper/CLAUDE.md (confirmed standardizations and testing framework integration)
/biomapper/pyproject.toml (verified dependencies and testing configuration)