Architecture Documentation

Comprehensive documentation of BioMapper’s architecture and design patterns.

Architecture Topics

Core Concepts

Three-Layer Architecture:

Client Layer - Python client library (biomapper_client), CLI tools, and Jupyter notebooks
API Layer - FastAPI REST service with background job management and SSE progress tracking
Core Layer - Business logic with self-registering actions and strategy execution engine

Key Design Patterns:

Registry Pattern - Actions self-register via @register_action decorator at import time
Strategy Pattern - YAML configurations define workflows as sequences of pluggable actions
Pipeline Pattern - Actions process data through shared execution context
Type Safety - Pydantic models provide runtime validation and compile-time type checking

Quick Architecture Overview

Client Request → BiomapperClient → FastAPI Server → MapperService → MinimalStrategyService
                                                                 ↓
                                                ACTION_REGISTRY (Global Dict)
                                                                 ↓
                                          Self-Registering Action Classes
                                                                 ↓
                                             Execution Context (Dict[str, Any])

Component Responsibilities

BiomapperClient (src/client/client_v2.py)

Python client library providing synchronous wrapper and async interfaces. Primary entry point for programmatic access.

FastAPI Server (src/api/main.py)

REST API handling HTTP requests, validation, response formatting, and Server-Sent Events (SSE) for progress tracking.

MapperService (src/api/services/mapper_service.py)

Orchestrates job execution, manages background tasks, handles SQLite persistence, and checkpoint recovery.

MinimalStrategyService (src/core/minimal_strategy_service.py)

Core execution engine that loads YAML strategies from src/configs/strategies/ and executes actions sequentially.

ACTION_REGISTRY (src/actions/registry.py)

Global dictionary where actions self-register at import time using the @register_action decorator.

Execution Context

Shared state (Dict[str, Any]) passed between actions containing:

datasets - Named datasets from previous actions
current_identifiers - Active identifier set
statistics - Accumulated metrics
output_files - Generated file paths

Action System

Actions are the fundamental units of work in BioMapper:

Self-Registration - Use @register_action("ACTION_NAME") decorator
Type Safety - Inherit from TypedStrategyAction[ParamsModel, ResultModel]
Parameter Validation - Pydantic models for inputs with field descriptions
Entity Organization - Grouped by biological entity type (proteins, metabolites, chemistry)

Example action implementation:

from actions.typed_base import TypedStrategyAction
from actions.registry import register_action
from pydantic import BaseModel, Field
from typing import Dict, Any

class MyActionParams(BaseModel):
    input_key: str = Field(..., description="Input dataset key")
    threshold: float = Field(0.8, ge=0.0, le=1.0)
    output_key: str = Field(..., description="Output dataset key")

# ActionResult is typically defined within each action module
class ActionResult(BaseModel):
    success: bool
    message: str = ""
    data: Dict[str, Any] = Field(default_factory=dict)

@register_action("MY_ACTION")
class MyAction(TypedStrategyAction[MyActionParams, ActionResult]):
    def get_params_model(self) -> type[MyActionParams]:
        return MyActionParams

    async def execute_typed(self, params: MyActionParams, context: Dict[str, Any]) -> ActionResult:
        # Access input data
        input_data = context["datasets"].get(params.input_key)
        # Process and store results
        processed_data = input_data  # Processing logic here
        context["datasets"][params.output_key] = processed_data
        return ActionResult(success=True, message="Processed successfully")

YAML Strategy System

Strategies define workflows as YAML configurations:

name: my_strategy
description: Example strategy

parameters:
  input_file: "${DATA_DIR}/input.csv"
  threshold: 0.8

steps:
  - name: load_step
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "${parameters.input_file}"
        output_key: "data"

Variable substitution supports:

${parameters.key} - Strategy parameters
${env.VAR} - Environment variables
${metadata.field} - Metadata fields

Performance Considerations

Chunking: Large datasets are processed in chunks to manage memory.
Caching: Results cached in SQLite for recovery and reuse.
Async Processing: Actions run asynchronously for better performance.
Job Persistence: Jobs persist to database enabling recovery from failures.

Testing Architecture

Test Levels:

Unit Tests - Individual action testing
Integration Tests - Complete workflow testing
API Tests - REST endpoint testing
E2E Tests - Full system testing

Test Organization:

tests/
├── unit/
│   └── core/
│       └── strategy_actions/
├── integration/
│   └── strategies/
└── api/
    └── endpoints/

Security Considerations

Input Validation - Pydantic models validate all inputs
Path Traversal - File paths validated and sandboxed
SQL Injection - SQLAlchemy ORM prevents injection
Rate Limiting - API endpoints rate limited
Error Handling - Sensitive data scrubbed from errors

Future Architecture Goals

Plugin System - Dynamic action loading from external packages
Distributed Execution - Celery/RQ for distributed processing
Stream Processing - Real-time data stream support
GraphQL API - Alternative API interface
Kubernetes Support - Cloud-native deployment

—

## Verification Sources Last verified: 2025-01-18

This documentation was verified against the following project resources:

/biomapper/src/actions/registry.py (Action registry implementation with global ACTION_REGISTRY)
/biomapper/src/actions/typed_base.py (TypedStrategyAction base class with execute_typed method)
/biomapper/src/core/minimal_strategy_service.py (Strategy execution engine with shared context)
/biomapper/src/api/main.py (FastAPI server with background job management)
/biomapper/src/api/services/mapper_service.py (MapperService orchestration logic)
/biomapper/src/client/client_v2.py (BiomapperClient synchronous wrapper)
/biomapper/README.md (Project overview and architecture documentation)
/biomapper/CLAUDE.md (TDD development guidelines and 2025 standardizations)
/biomapper/pyproject.toml (Poetry dependencies and package configuration)