Progressive Semantic Match ========================= .. automodule:: src.actions.entities.metabolites.matching.progressive_semantic_match Overview -------- The ``PROGRESSIVE_SEMANTIC_MATCH`` action orchestrates a comprehensive 4-stage metabolite matching pipeline that progressively applies different matching strategies to achieve maximum coverage. This is the primary action for metabolomics identifier harmonization in the biomapper framework. The progressive approach starts with high-confidence exact matches and gradually relaxes criteria to capture more difficult cases, achieving typical coverage rates of 75-80% for well-curated metabolomics datasets. Pipeline Architecture --------------------- 4-Stage Progressive Strategy ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. mermaid:: graph TD A[Input Dataset] --> B[Stage 1: Direct Matching] B --> C[Stage 2: Fuzzy String Matching] C --> D[Stage 3: RampDB API Bridge] D --> E[Stage 4: HMDB Vector Matching] E --> F[Results Consolidation] F --> G[Coverage Analysis & Reporting] Stage Breakdown ~~~~~~~~~~~~~~~~ **Stage 1: Direct/Exact Matching** - Method: ``NIGHTINGALE_NMR_MATCH`` - Coverage: 45-55% (high confidence) - Speed: <2 seconds for 10K identifiers - Strategy: Exact string matching against curated reference **Stage 2: Fuzzy String Matching** - Method: ``METABOLITE_FUZZY_STRING_MATCH`` - Coverage: +15-20% additional - Speed: 5-10 seconds - Strategy: Levenshtein distance with biological awareness **Stage 3: External API Integration** - Method: ``METABOLITE_RAMPDB_BRIDGE`` - Coverage: +8-12% additional - Speed: 30-60 seconds (API dependent) - Strategy: Cross-database lookups via RampDB **Stage 4: Vector Semantic Matching** - Method: ``HMDB_VECTOR_MATCH`` - Coverage: +5-10% additional - Speed: 10-20 seconds - Strategy: Embedding-based semantic similarity Parameters ---------- .. list-table:: :widths: 25 15 10 50 :header-rows: 1 * - Parameter - Type - Required - Description * - ``input_key`` - string - Yes - Key for the input dataset containing metabolite identifiers * - ``output_key`` - string - Yes - Key for the final consolidated output dataset * - ``identifier_column`` - string - Yes - Column name containing metabolite identifiers to match * - ``stage1_threshold`` - float - No - Direct matching threshold (default: 0.95) * - ``stage2_threshold`` - float - No - Fuzzy matching threshold (default: 0.8) * - ``stage3_batch_size`` - integer - No - RampDB API batch size (default: 50) * - ``stage4_threshold`` - float - No - Vector similarity threshold (default: 0.75) * - ``enable_quality_control`` - boolean - No - Enable inter-stage validation (default: true) * - ``export_stage_results`` - boolean - No - Export individual stage results (default: false) Performance Metrics ------------------- Expected Coverage by Dataset Type ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Based on production runs with real biological datasets: .. list-table:: :widths: 30 20 20 30 :header-rows: 1 * - Dataset Type - Expected Coverage - Processing Time - Confidence Level * - Arivale Metabolomics - 75-80% (1,053/1,351) - 2-3 minutes - High * - UK Biobank Subset - 40-45% (varies) - 1-2 minutes - Medium * - Custom Datasets - 50-70% (varies) - Variable - Variable Cumulative Coverage Progression ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Typical progression for a 1,000 metabolite dataset: - **After Stage 1**: ~500 matched (50%) - **After Stage 2**: ~650 matched (65%) - **After Stage 3**: ~720 matched (72%) - **After Stage 4**: ~750 matched (75%) Example Usage ------------- YAML Strategy ~~~~~~~~~~~~~ .. code-block:: yaml steps: - name: progressive_metabolite_matching action: type: PROGRESSIVE_SEMANTIC_MATCH params: input_key: raw_metabolites output_key: matched_metabolites identifier_column: compound_name stage1_threshold: 0.95 stage2_threshold: 0.8 stage3_batch_size: 40 stage4_threshold: 0.75 enable_quality_control: true export_stage_results: true Python Client ~~~~~~~~~~~~~ .. code-block:: python from src.client.client_v2 import BiomapperClient client = BiomapperClient(base_url="http://localhost:8000") # Load metabolite dataset context = {"datasets": {"metabolites": metabolite_df}} result = await client.run_action( action_type="PROGRESSIVE_SEMANTIC_MATCH", params={ "input_key": "metabolites", "output_key": "harmonized_metabolites", "identifier_column": "metabolite_name", "stage1_threshold": 0.9, "stage2_threshold": 0.8, "enable_quality_control": True }, context=context ) # Access consolidated results matched_data = result.context["datasets"]["harmonized_metabolites"] print(f"Coverage achieved: {len(matched_data)} metabolites") Output Format ------------- The action returns a comprehensive dataset with matching provenance: .. list-table:: :widths: 25 75 :header-rows: 1 * - Column - Description * - ``original_id`` - Original metabolite identifier from input * - ``matched_hmdb_id`` - Final HMDB identifier (primary output) * - ``matched_name`` - Standardized metabolite name * - ``matching_stage`` - Stage that produced the match (1-4) * - ``matching_method`` - Specific method used (exact, fuzzy, api, vector) * - ``confidence_score`` - Overall confidence score (0.0-1.0) * - ``stage1_candidate`` - Stage 1 matching candidate (if any) * - ``stage2_candidate`` - Stage 2 matching candidate (if any) * - ``stage3_candidate`` - Stage 3 matching candidate (if any) * - ``stage4_candidate`` - Stage 4 matching candidate (if any) * - ``quality_flags`` - Quality control flags and warnings Stage Implementation Details ---------------------------- Stage 1: Direct Matching ~~~~~~~~~~~~~~~~~~~~~~~~~ High-confidence exact matches using curated reference data: .. code-block:: yaml stage1_configuration: method: NIGHTINGALE_NMR_MATCH reference_data: nightingale_nmr_metabolites matching_strategy: exact_string case_sensitive: false normalize_whitespace: true Stage 2: Fuzzy String Matching ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Captures near-matches with controlled edit distance: .. code-block:: yaml stage2_configuration: method: METABOLITE_FUZZY_STRING_MATCH algorithm: biological_levenshtein max_edit_distance: 2 ignore_case: true ignore_punctuation: true Stage 3: External API Integration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Leverages RampDB for cross-database resolution: .. code-block:: yaml stage3_configuration: method: METABOLITE_RAMPDB_BRIDGE target_databases: [hmdb, kegg, chebi] timeout: 45 retry_attempts: 3 rate_limit_delay: 1.0 Stage 4: Vector Semantic Matching ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Uses embedding-based similarity for complex cases: .. code-block:: yaml stage4_configuration: method: HMDB_VECTOR_MATCH embedding_model: all-MiniLM-L6-v2 vector_db: qdrant_hmdb_collection similarity_metric: cosine max_candidates: 5 Quality Control Features ------------------------ Inter-Stage Validation ~~~~~~~~~~~~~~~~~~~~~~ Validates consistency between matching stages: - **Conflict Detection**: Identifies when stages produce conflicting matches - **Confidence Scoring**: Weights matches based on stage reliability - **Consensus Building**: Resolves conflicts using multiple signals Quality Metrics Tracking ~~~~~~~~~~~~~~~~~~~~~~~~~ Monitors pipeline health and performance: - **Coverage Progression**: Tracks cumulative coverage by stage - **Confidence Distribution**: Analyzes confidence score distributions - **Error Rate Analysis**: Identifies systematic matching failures - **Performance Benchmarks**: Compares against expected baselines Example Quality Report ~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: json { "total_identifiers": 1351, "total_matched": 1053, "overall_coverage": 77.9, "stage_breakdown": { "stage1": {"matched": 692, "coverage": 51.2}, "stage2": {"matched": 201, "coverage": 14.9}, "stage3": {"matched": 105, "coverage": 7.8}, "stage4": {"matched": 55, "coverage": 4.0} }, "quality_metrics": { "high_confidence": 847, "medium_confidence": 156, "low_confidence": 50, "flagged_for_review": 23 } } Advanced Configuration ---------------------- Threshold Optimization ~~~~~~~~~~~~~~~~~~~~~~ Fine-tune thresholds based on dataset characteristics: .. code-block:: yaml # Conservative profile (high precision) conservative_config: stage1_threshold: 0.98 stage2_threshold: 0.85 stage4_threshold: 0.80 # Aggressive profile (high recall) aggressive_config: stage1_threshold: 0.92 stage2_threshold: 0.75 stage4_threshold: 0.70 # Balanced profile (recommended) balanced_config: stage1_threshold: 0.95 stage2_threshold: 0.80 stage4_threshold: 0.75 Performance Optimization ~~~~~~~~~~~~~~~~~~~~~~~~ For large datasets (>10K metabolites): .. code-block:: yaml performance_config: enable_parallel_stages: true stage2_chunk_size: 2000 stage3_batch_size: 100 stage4_batch_size: 500 cache_intermediate_results: true Custom Stage Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~ Override individual stage parameters: .. code-block:: yaml custom_stage_config: stage1_params: reference_source: "custom_metabolite_db" normalization_rules: ["remove_stereochemistry"] stage2_params: algorithm: "jaro_winkler" biological_synonyms: true stage3_params: target_databases: ["hmdb", "kegg", "chebi", "pubchem"] concurrent_requests: 3 stage4_params: use_llm_validation: true embedding_cache: true Error Handling and Recovery --------------------------- Stage-Level Error Recovery ~~~~~~~~~~~~~~~~~~~~~~~~~~ Each stage includes independent error handling: - **Stage Isolation**: Failure in one stage doesn't affect others - **Partial Results**: Successfully completed stages preserve their matches - **Error Reporting**: Detailed logs for debugging stage-specific issues - **Graceful Degradation**: Pipeline continues with reduced functionality Common Recovery Patterns ~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: yaml error_handling: stage1_fallback: "skip_to_stage2" stage2_timeout_action: "reduce_batch_size" stage3_api_failure: "retry_with_exponential_backoff" stage4_memory_error: "process_in_smaller_chunks" Monitoring and Alerting ~~~~~~~~~~~~~~~~~~~~~~~ Built-in monitoring for production deployments: - **Progress Tracking**: Real-time progress updates for long-running pipelines - **Performance Alerts**: Notifications when stages exceed expected runtimes - **Quality Alerts**: Warnings when coverage drops below baselines - **Resource Monitoring**: Memory and API quota usage tracking Real-World Case Studies ----------------------- Arivale Metabolomics Dataset ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Dataset**: 1,351 unique metabolites from personalized medicine platform **Result**: 77.9% coverage (1,053 matched metabolites) **Processing Time**: 2 minutes 34 seconds **Stage Performance**: - Stage 1: 692 matches (51.2%) - Stage 2: 201 additional (14.9%) - Stage 3: 105 additional (7.8%) - Stage 4: 55 additional (4.0%) UK Biobank Subset ~~~~~~~~~~~~~~~~~~ **Dataset**: 2,847 metabolite measurements from population study **Result**: 42.3% coverage (1,204 matched metabolites) **Challenge**: Heterogeneous naming conventions **Processing Time**: 1 minute 47 seconds Best Practices -------------- 1. **Start with Balanced Configuration**: Use recommended thresholds initially 2. **Monitor Stage Performance**: Track individual stage contributions 3. **Validate Results**: Manually review sample matches from each stage 4. **Document Parameters**: Record threshold choices and rationale 5. **Benchmark Regularly**: Compare performance against known datasets Integration Patterns -------------------- Complete Pipeline Integration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: yaml name: comprehensive_metabolomics_pipeline steps: - name: load_metabolites action: type: LOAD_DATASET_IDENTIFIERS params: file_path: "${parameters.input_file}" identifier_column: metabolite_name output_key: raw_metabolites - name: progressive_matching action: type: PROGRESSIVE_SEMANTIC_MATCH params: input_key: raw_metabolites output_key: matched_metabolites identifier_column: metabolite_name - name: generate_report action: type: GENERATE_METABOLOMICS_REPORT params: input_key: matched_metabolites output_key: coverage_report - name: export_results action: type: EXPORT_DATASET params: input_key: matched_metabolites file_path: "${parameters.output_file}" Multi-Dataset Comparison ~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: yaml # Compare matching across multiple datasets steps: - name: match_dataset_a action: type: PROGRESSIVE_SEMANTIC_MATCH params: input_key: dataset_a output_key: matched_a - name: match_dataset_b action: type: PROGRESSIVE_SEMANTIC_MATCH params: input_key: dataset_b output_key: matched_b - name: compare_coverage action: type: CALCULATE_SET_OVERLAP params: dataset1_key: matched_a dataset2_key: matched_b Troubleshooting --------------- Low Overall Coverage ~~~~~~~~~~~~~~~~~~~~ 1. **Check Input Data Quality**: Verify metabolite names are clean and standardized 2. **Adjust Thresholds**: Lower thresholds to increase recall 3. **Enable All Stages**: Ensure no stages are being skipped 4. **Review Reference Data**: Confirm reference databases are current Stage-Specific Issues ~~~~~~~~~~~~~~~~~~~~~ - **Stage 1 Low Coverage**: Check reference data alignment and normalization - **Stage 2 Timeouts**: Reduce batch sizes or enable chunking - **Stage 3 API Failures**: Verify API credentials and network connectivity - **Stage 4 Performance**: Ensure vector database is properly indexed Performance Problems ~~~~~~~~~~~~~~~~~~~~ 1. **Memory Issues**: Enable chunked processing for large datasets 2. **API Timeouts**: Increase timeout values and reduce batch sizes 3. **Slow Processing**: Enable parallel processing where supported 4. **Resource Limits**: Monitor CPU and memory usage during execution See Also -------- - :doc:`nightingale_nmr_match` - Stage 1 direct matching implementation - :doc:`metabolite_fuzzy_string_match` - Stage 2 fuzzy matching details - :doc:`metabolite_rampdb_bridge` - Stage 3 API integration guide - :doc:`hmdb_vector_match` - Stage 4 vector similarity matching - :doc:`../workflows/metabolomics_pipeline` - Complete workflow examples - :doc:`../examples/threshold_optimization` - Parameter tuning guides - :doc:`../performance/large_dataset_optimization` - Scaling considerations