Architecture Documentation
Comprehensive documentation of BioMapper’s architecture and design patterns.
Architecture Topics
Core Concepts
Three-Layer Architecture:
Client Layer - Python client library (
biomapper_client), CLI tools, and Jupyter notebooksAPI Layer - FastAPI REST service with background job management and SSE progress tracking
Core Layer - Business logic with self-registering actions and strategy execution engine
Key Design Patterns:
Registry Pattern - Actions self-register via
@register_actiondecorator at import timeStrategy Pattern - YAML configurations define workflows as sequences of pluggable actions
Pipeline Pattern - Actions process data through shared execution context
Type Safety - Pydantic models provide runtime validation and compile-time type checking
Quick Architecture Overview
Client Request → BiomapperClient → FastAPI Server → MapperService → MinimalStrategyService
↓
ACTION_REGISTRY (Global Dict)
↓
Self-Registering Action Classes
↓
Execution Context (Dict[str, Any])
Component Responsibilities
- BiomapperClient (
src/client/client_v2.py) Python client library providing synchronous wrapper and async interfaces. Primary entry point for programmatic access.
- FastAPI Server (
src/api/main.py) REST API handling HTTP requests, validation, response formatting, and Server-Sent Events (SSE) for progress tracking.
- MapperService (
src/api/services/mapper_service.py) Orchestrates job execution, manages background tasks, handles SQLite persistence, and checkpoint recovery.
- MinimalStrategyService (
src/core/minimal_strategy_service.py) Core execution engine that loads YAML strategies from
src/configs/strategies/and executes actions sequentially.- ACTION_REGISTRY (
src/actions/registry.py) Global dictionary where actions self-register at import time using the
@register_actiondecorator.- Execution Context
Shared state (
Dict[str, Any]) passed between actions containing:datasets- Named datasets from previous actionscurrent_identifiers- Active identifier setstatistics- Accumulated metricsoutput_files- Generated file paths
Action System
Actions are the fundamental units of work in BioMapper:
Self-Registration - Use
@register_action("ACTION_NAME")decoratorType Safety - Inherit from
TypedStrategyAction[ParamsModel, ResultModel]Parameter Validation - Pydantic models for inputs with field descriptions
Entity Organization - Grouped by biological entity type (proteins, metabolites, chemistry)
Example action implementation:
from actions.typed_base import TypedStrategyAction
from actions.registry import register_action
from pydantic import BaseModel, Field
from typing import Dict, Any
class MyActionParams(BaseModel):
input_key: str = Field(..., description="Input dataset key")
threshold: float = Field(0.8, ge=0.0, le=1.0)
output_key: str = Field(..., description="Output dataset key")
# ActionResult is typically defined within each action module
class ActionResult(BaseModel):
success: bool
message: str = ""
data: Dict[str, Any] = Field(default_factory=dict)
@register_action("MY_ACTION")
class MyAction(TypedStrategyAction[MyActionParams, ActionResult]):
def get_params_model(self) -> type[MyActionParams]:
return MyActionParams
async def execute_typed(self, params: MyActionParams, context: Dict[str, Any]) -> ActionResult:
# Access input data
input_data = context["datasets"].get(params.input_key)
# Process and store results
processed_data = input_data # Processing logic here
context["datasets"][params.output_key] = processed_data
return ActionResult(success=True, message="Processed successfully")
YAML Strategy System
Strategies define workflows as YAML configurations:
name: my_strategy
description: Example strategy
parameters:
input_file: "${DATA_DIR}/input.csv"
threshold: 0.8
steps:
- name: load_step
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "${parameters.input_file}"
output_key: "data"
Variable substitution supports:
${parameters.key}- Strategy parameters${env.VAR}- Environment variables${metadata.field}- Metadata fields
Performance Considerations
- Chunking
Large datasets are processed in chunks to manage memory.
- Caching
Results cached in SQLite for recovery and reuse.
- Async Processing
Actions run asynchronously for better performance.
- Job Persistence
Jobs persist to database enabling recovery from failures.
Testing Architecture
Test Levels:
Unit Tests - Individual action testing
Integration Tests - Complete workflow testing
API Tests - REST endpoint testing
E2E Tests - Full system testing
Test Organization:
tests/
├── unit/
│ └── core/
│ └── strategy_actions/
├── integration/
│ └── strategies/
└── api/
└── endpoints/
Security Considerations
Input Validation - Pydantic models validate all inputs
Path Traversal - File paths validated and sandboxed
SQL Injection - SQLAlchemy ORM prevents injection
Rate Limiting - API endpoints rate limited
Error Handling - Sensitive data scrubbed from errors
Future Architecture Goals
Plugin System - Dynamic action loading from external packages
Distributed Execution - Celery/RQ for distributed processing
Stream Processing - Real-time data stream support
GraphQL API - Alternative API interface
Kubernetes Support - Cloud-native deployment
—
## Verification Sources Last verified: 2025-01-18
This documentation was verified against the following project resources:
/biomapper/src/actions/registry.py (Action registry implementation with global ACTION_REGISTRY)
/biomapper/src/actions/typed_base.py (TypedStrategyAction base class with execute_typed method)
/biomapper/src/core/minimal_strategy_service.py (Strategy execution engine with shared context)
/biomapper/src/api/main.py (FastAPI server with background job management)
/biomapper/src/api/services/mapper_service.py (MapperService orchestration logic)
/biomapper/src/client/client_v2.py (BiomapperClient synchronous wrapper)
/biomapper/README.md (Project overview and architecture documentation)
/biomapper/CLAUDE.md (TDD development guidelines and 2025 standardizations)
/biomapper/pyproject.toml (Poetry dependencies and package configuration)