Performance Optimization Guide

Overview

This guide provides comprehensive strategies for optimizing biomapper performance across different use cases, from real-time clinical applications to large-scale batch processing. Performance optimization in biomapper involves balancing accuracy, speed, memory usage, and API costs.

Performance Fundamentals

Key Performance Metrics

Metric	Target Range	Impact
Processing Speed	<5 minutes for 10K identifiers	User experience, real-time feasibility
Memory Usage	<1GB peak for 100K identifiers	Resource costs, scalability
Coverage Rate	70-80% for typical datasets	Scientific value, downstream analysis
API Success Rate	>95% for external calls	Reliability, data completeness
False Positive Rate	<5% for production use	Data quality, trust in results

Performance Bottlenecks

Common Bottlenecks by Stage:

Stage 1 (Direct Matching): Usually fast, but large reference datasets can slow lookups
Stage 2 (Fuzzy Matching): O(n²) string comparisons can be expensive
Stage 3 (API Calls): Network latency and rate limits
Stage 4 (Vector Search): Vector computation and database queries

Strategy-Level Optimizations

Pipeline Configuration

High-Speed Configuration (Optimize for speed):

# Optimized for <30 second processing
speed_optimized:
  stages_enabled: [1, 2]  # Skip API and vector stages
  stage1_threshold: 0.98  # Higher threshold = fewer candidates
  stage2_threshold: 0.9   # Higher threshold = less fuzzy matching
  stage2_max_distance: 1  # Stricter edit distance
  chunk_processing: true
  chunk_size: 5000
  parallel_processing: true
  enable_caching: true

High-Accuracy Configuration (Optimize for coverage):

# Optimized for maximum coverage
accuracy_optimized:
  stages_enabled: [1, 2, 3, 4]  # All stages
  stage1_threshold: 0.9          # Lower threshold = more matches
  stage2_threshold: 0.75         # More aggressive fuzzy matching
  stage3_batch_size: 25          # Smaller batches = more reliable
  stage4_threshold: 0.7          # Lower vector similarity threshold
  use_llm_validation: true       # Additional validation
  quality_control: enhanced

Balanced Configuration (Production default):

# Balanced speed and accuracy
production_optimized:
  stages_enabled: [1, 2, 3, 4]
  stage1_threshold: 0.95
  stage2_threshold: 0.8
  stage3_batch_size: 50
  stage4_threshold: 0.75
  adaptive_chunking: true
  progressive_timeouts: true

Stage-Specific Optimizations

Stage 1: Direct Matching Optimization

Reference Data Optimization:

# Pre-compute lookup indices
reference_optimization = {
    "create_hash_index": True,
    "normalize_keys": True,  # Pre-normalize for faster lookup
    "use_trie_structure": True,  # For prefix matching
    "case_insensitive_index": True
}

Memory-Efficient Loading:

stage1_config:
  reference_loading:
    lazy_loading: true      # Load on demand
    memory_mapping: true    # Use mmap for large files
    compression: gzip       # Compress in memory
    cache_size: 10000       # LRU cache for frequent lookups

Stage 2: Fuzzy Matching Optimization

Algorithm Selection:

Algorithm	Speed	Accuracy	Best For
Levenshtein	Fast	Good	General use
Jaro-Winkler	Medium	Better	Transposed characters
Biological	Slow	Best	Chemical nomenclature

Performance Tuning:

stage2_optimization:
  # Pre-filtering to reduce candidate set
  pre_filter:
    length_difference_max: 5  # Skip very different length strings
    first_char_match: true    # Require first character match
    common_prefix_min: 2      # Minimum common prefix

  # Parallel processing
  parallel_chunks: 4
  chunk_size: 2000

  # Early termination
  max_candidates: 5         # Stop after finding 5 good matches
  early_exit_threshold: 0.95  # Stop if perfect match found

Stage 3: API Optimization

Connection Management:

# Optimize HTTP connections
api_config = {
    "connection_pool_size": 10,
    "keep_alive": True,
    "timeout": (5, 30),  # (connect, read) timeouts
    "max_retries": 3,
    "backoff_factor": 1.0,
    "session_reuse": True
}

Batch Size Optimization:

# Adaptive batch sizing based on performance
stage3_adaptive_batching:
  initial_batch_size: 50
  min_batch_size: 10
  max_batch_size: 200

  # Adjust based on response time
  target_response_time: 30  # seconds
  size_increase_factor: 1.2
  size_decrease_factor: 0.8

Caching Strategy:

# Multi-level caching
caching_config = {
    "level1_memory": {
        "size": 10000,
        "ttl": 3600  # 1 hour
    },
    "level2_redis": {
        "size": 100000,
        "ttl": 86400  # 24 hours
    },
    "level3_disk": {
        "size": 1000000,
        "ttl": 604800  # 1 week
    }
}

Stage 4: Vector Search Optimization

Vector Database Tuning:

qdrant_optimization:
  # Index configuration
  index_type: hnsw
  m: 16                    # HNSW connections
  ef_construct: 200        # Build-time accuracy
  ef_search: 100           # Search-time accuracy

  # Memory management
  memory_threshold: 0.8    # Trigger cleanup at 80%
  batch_size: 1000         # Batch vector operations

  # Performance tuning
  parallel_indexing: true
  prefetch_factor: 2

Query Optimization:

# Optimize vector queries
vector_config = {
    "max_results": 5,        # Limit candidates
    "score_threshold": 0.7,  # Early filtering
    "batch_queries": True,   # Batch multiple queries
    "use_filters": True,     # Pre-filter by metadata
    "cache_embeddings": True # Cache computed embeddings
}

Memory Management

Dataset Chunking

Adaptive Chunking Strategy:

def calculate_optimal_chunk_size(dataset_size, available_memory):
    """Calculate optimal chunk size based on available memory"""

    # Estimate memory per record (KB)
    memory_per_record = estimate_memory_usage(sample_record)

    # Target 70% of available memory
    target_memory = available_memory * 0.7

    # Calculate chunk size
    chunk_size = int(target_memory / memory_per_record)

    # Apply bounds
    chunk_size = max(1000, min(chunk_size, 50000))

    return chunk_size

Memory Monitoring:

memory_management:
  monitoring:
    check_interval: 30      # seconds
    warning_threshold: 0.8  # 80% memory usage
    critical_threshold: 0.95 # 95% memory usage

  actions:
    on_warning: reduce_batch_size
    on_critical: trigger_garbage_collection
    on_overflow: enable_disk_swap

Garbage Collection Optimization

import gc

# Optimize garbage collection for large datasets
def optimize_gc_for_biomapper():
    # Increase generation thresholds
    gc.set_threshold(1000, 15, 15)

    # Force collection between stages
    gc.collect()

    # Disable during intensive operations
    gc.disable()
    # ... intensive processing ...
    gc.enable()

Parallel Processing

Thread-Level Parallelization

import concurrent.futures
from multiprocessing import Pool

def parallel_fuzzy_matching(identifiers, reference_data, num_workers=4):
    """Parallel fuzzy matching implementation"""

    # Split identifiers into chunks
    chunk_size = len(identifiers) // num_workers
    chunks = [identifiers[i:i+chunk_size] for i in range(0, len(identifiers), chunk_size)]

    # Process chunks in parallel
    with Pool(num_workers) as pool:
        results = pool.starmap(fuzzy_match_chunk,
                              [(chunk, reference_data) for chunk in chunks])

    # Combine results
    return [item for sublist in results for item in sublist]

Async Processing

import asyncio
import aiohttp

async def async_api_calls(identifiers, batch_size=50):
    """Async API calls for better throughput"""

    semaphore = asyncio.Semaphore(10)  # Limit concurrent requests

    async with aiohttp.ClientSession() as session:
        tasks = []

        for i in range(0, len(identifiers), batch_size):
            batch = identifiers[i:i+batch_size]
            task = limited_api_call(session, batch, semaphore)
            tasks.append(task)

        results = await asyncio.gather(*tasks, return_exceptions=True)

    return results

Caching Strategies

Multi-Level Caching Architecture

class BiomapperCache:
    def __init__(self):
        self.l1_cache = {}  # In-memory (fast)
        self.l2_cache = redis.Redis()  # Redis (medium)
        self.l3_cache = sqlite3.connect("cache.db")  # Disk (slow)

    def get(self, key):
        # L1: Memory cache
        if key in self.l1_cache:
            return self.l1_cache[key]

        # L2: Redis cache
        result = self.l2_cache.get(key)
        if result:
            self.l1_cache[key] = result  # Promote to L1
            return result

        # L3: Disk cache
        result = self.l3_cache.execute("SELECT value FROM cache WHERE key=?", (key,)).fetchone()
        if result:
            self.l2_cache.set(key, result[0])  # Promote to L2
            self.l1_cache[key] = result[0]    # Promote to L1
            return result[0]

        return None

Cache Invalidation Strategy:

cache_management:
  ttl_strategy:
    api_results: 86400      # 24 hours
    fuzzy_matches: 3600     # 1 hour
    vector_results: 7200    # 2 hours
    exact_matches: 604800   # 1 week (more stable)

  invalidation_triggers:
    - reference_data_update
    - parameter_change
    - manual_cache_clear
    - strategy_version_change

  cleanup_schedule:
    frequency: daily
    time: "02:00"  # 2 AM
    max_size: "10GB"

Database Optimizations

Reference Database Tuning

-- Optimize reference database queries
CREATE INDEX idx_metabolite_name ON metabolites(name);
CREATE INDEX idx_metabolite_hmdb ON metabolites(hmdb_id);
CREATE INDEX idx_metabolite_kegg ON metabolites(kegg_id);

-- Compound index for common queries
CREATE INDEX idx_metabolite_compound ON metabolites(name, hmdb_id, kegg_id);

-- Full-text search index
CREATE VIRTUAL TABLE metabolite_fts USING fts5(name, synonyms);

Vector Database Optimization

# Qdrant collection optimization
qdrant_config = {
    "vectors": {
        "size": 384,
        "distance": "Cosine",
        "hnsw_config": {
            "m": 16,                 # Number of connections
            "ef_construct": 200,     # Build-time accuracy
            "full_scan_threshold": 20000,  # Use full scan below this
            "max_indexing_threads": 4,     # Parallel indexing
        }
    },
    "optimizers_config": {
        "deleted_threshold": 0.2,    # Trigger cleanup at 20% deleted
        "vacuum_min_vector_number": 1000,
        "default_segment_number": 2,  # Number of segments
    }
}

Monitoring and Profiling

Performance Monitoring

import time
import psutil
import logging

class PerformanceMonitor:
    def __init__(self):
        self.metrics = {}

    def track_stage(self, stage_name):
        def decorator(func):
            def wrapper(*args, **kwargs):
                start_time = time.time()
                start_memory = psutil.virtual_memory().percent

                try:
                    result = func(*args, **kwargs)

                    end_time = time.time()
                    end_memory = psutil.virtual_memory().percent

                    self.metrics[stage_name] = {
                        'duration': end_time - start_time,
                        'memory_change': end_memory - start_memory,
                        'success': True
                    }

                    return result

                except Exception as e:
                    self.metrics[stage_name] = {
                        'duration': time.time() - start_time,
                        'error': str(e),
                        'success': False
                    }
                    raise

            return wrapper
        return decorator

Profiling Tools

CPU Profiling:

import cProfile
import pstats

# Profile biomapper execution
profiler = cProfile.Profile()
profiler.enable()

# Run biomapper pipeline
result = run_biomapper_pipeline(config)

profiler.disable()

# Analyze results
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20)  # Top 20 functions

Memory Profiling:

import tracemalloc

# Built-in memory tracking (available in Python 3.4+)
def profile_memory_usage(func, *args, **kwargs):
    tracemalloc.start()
    result = func(*args, **kwargs)
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()

    print(f"Current memory: {current / 1024 / 1024:.2f} MB")
    print(f"Peak memory: {peak / 1024 / 1024:.2f} MB")
    return result

Troubleshooting Performance Issues

Common Performance Problems

Problem	Symptoms	Solutions
Memory exhaustion	OOM errors, swapping	Enable chunking, reduce batch sizes
Slow API calls	Stage 3 timeouts	Increase batch size, add caching
Vector search slow	Stage 4 takes >5 minutes	Optimize index, reduce candidates
High CPU usage	Stage 2 uses 100% CPU	Enable parallel processing
Poor cache hit rate	Repeated slow operations	Review cache TTL, increase sizes

Diagnostic Commands

# Monitor system resources during execution
top -p $(pgrep -f biomapper)

# Track memory usage
pmap -x $(pgrep -f biomapper)

# Monitor disk I/O
iotop -p $(pgrep -f biomapper)

# Network monitoring for API calls
netstat -i 1  # Interface stats

# Check biomapper-specific performance
poetry run pytest tests/performance/test_algorithm_complexity.py -v

# Monitor Python memory usage
python -c "import tracemalloc; tracemalloc.start(); # your code here"

Performance Testing Framework

Benchmark Suite

class BiomapperBenchmark:
    def __init__(self):
        self.test_datasets = {
            'small': 100,     # 100 metabolites
            'medium': 1000,   # 1K metabolites
            'large': 10000,   # 10K metabolites
            'xlarge': 100000  # 100K metabolites
        }

    def run_performance_suite(self):
        results = {}

        for size_name, size in self.test_datasets.items():
            dataset = self.generate_test_dataset(size)

            start_time = time.time()
            result = run_biomapper_pipeline(dataset)
            end_time = time.time()

            results[size_name] = {
                'processing_time': end_time - start_time,
                'coverage': result.coverage_percentage,
                'memory_peak': result.peak_memory_usage
            }

        return results

Continuous Performance Monitoring

# Performance CI pipeline
performance_tests:
  schedule: daily

  benchmarks:
    - name: small_dataset_speed
      dataset_size: 1000
      max_time: 30  # seconds
      min_coverage: 70  # percent

    - name: large_dataset_memory
      dataset_size: 50000
      max_memory: 4  # GB
      min_coverage: 65

  alerts:
    - condition: performance_degradation > 20%
      action: slack_notification
      recipients: [dev-team]

Production Optimization Checklist

Pre-Deployment Checklist

[ ] Profiling Complete: CPU and memory profiling completed
[ ] Caching Enabled: Multi-level caching configured and tested
[ ] Resource Limits: Memory and CPU limits set appropriately
[ ] Monitoring Configured: Performance metrics collection enabled
[ ] Error Handling: Graceful degradation and recovery tested
[ ] Load Testing: Performance under expected load verified
[ ] Documentation: Performance characteristics documented

Deployment Configuration

# Production-optimized configuration
production_config:
  # Resource allocation
  memory_limit: "8GB"
  cpu_cores: 4

  # Performance tuning
  enable_chunking: true
  chunk_size: 5000
  parallel_workers: 4
  cache_enabled: true

  # Monitoring
  metrics_enabled: true
  performance_logging: true
  alert_thresholds:
    memory_usage: 80%
    processing_time: 300  # seconds
    error_rate: 5%

Algorithm Complexity Resources

BioMapper includes comprehensive algorithm complexity monitoring and optimization tools:

Core Efficiency Classes:

from core.algorithms.efficient_matching import EfficientMatcher

# Replace O(n*m) nested loops with O(n+m) indexed matching
target_index = EfficientMatcher.build_index(target_data, key_func)
matches = EfficientMatcher.match_with_index(source_data, target_index, key_func)

# Multi-key indexing for biological identifiers
protein_index = EfficientMatcher.multi_key_index(
    proteins,
    key_funcs=[
        lambda p: p.get('uniprot_id'),
        lambda p: p.get('gene_symbol'),
        lambda p: p.get('ensembl_id')
    ]
)

Performance Testing:

# Run algorithm complexity tests
poetry run pytest tests/performance/test_algorithm_complexity.py -v

# Performance scaling verification
poetry run python tests/performance/test_algorithm_complexity.py

Algorithm Performance Estimator:

from core.algorithms.efficient_matching import EfficientMatcher

# Estimate performance before implementation
estimates = EfficientMatcher.estimate_performance(
    n_source=10000,
    n_target=100000,
    algorithm="hash_index"
)
print(f"Estimated time: {estimates['estimated_time']}")
print(f"Complexity: {estimates['complexity']}")