NIGHTINGALE_NMR_MATCH

Match Nightingale NMR biomarkers to standard identifiers (HMDB/LOINC) with specialized platform knowledge.

Purpose

This action provides specialized matching for UK Biobank NMR metabolomics data from the Nightingale Health platform. It offers:

Exact matching for known Nightingale biomarkers
Fuzzy matching for naming variations
Lipoprotein particle pattern recognition
Abbreviation expansion and standardization
Category classification (lipids, amino acids, etc.)
Unit standardization
Integration with external reference files

Parameters

Required Parameters

input_key (string): Dataset key from context containing Nightingale biomarker data.
output_key (string): Key where matched results will be stored in context.

Optional Parameters

biomarker_column (string): Column containing Nightingale biomarker names. Default: “biomarker”
unit_column (string): Column containing measurement units (optional). Default: None
reference_file (string): Path to Nightingale reference mapping file. Default: “/procedure/data/local_data/references/nightingale_nmr_reference.csv”
use_cached_reference (boolean): Cache reference file in memory for performance. Default: true
target_format (string): Target identifier format: ‘hmdb’, ‘loinc’, or ‘both’. Default: “hmdb”
match_threshold (float): Fuzzy match threshold for biomarker names (0.0-1.0). Default: 0.85
use_abbreviations (boolean): Expand and match common abbreviations. Default: true
case_sensitive (boolean): Case-sensitive matching. Default: false
add_metadata (boolean): Add Nightingale metadata columns to output. Default: true
include_units (boolean): Include standardized units in output. Default: true
include_categories (boolean): Include biomarker categories in output. Default: true

Built-in Biomarker Patterns

The action includes built-in patterns for common Nightingale biomarkers:

Lipids and Lipoproteins * Total_C (Total cholesterol) → HMDB0000067, LOINC 2093-3 * LDL_C (LDL cholesterol) → HMDB0000067, LOINC 13457-7 * HDL_C (HDL cholesterol) → HMDB0000067, LOINC 2085-9 * Triglycerides → HMDB0000827, LOINC 2571-8

Apolipoproteins * ApoA1 (Apolipoprotein A1) → LOINC 1869-7 * ApoB (Apolipoprotein B) → LOINC 1884-6

Amino Acids * Ala (Alanine) → HMDB0000161, LOINC 1916-6 * Gln (Glutamine) → HMDB0000641, LOINC 14681-2

Metabolic Markers * Glucose → HMDB0000122, LOINC 2345-7 * Lactate → HMDB0000190, LOINC 2524-7 * bOHbutyrate (Beta-hydroxybutyrate) → HMDB0000357, LOINC 53060-9

Inflammation * GlycA (Glycoprotein acetyls) → No standard IDs (Nightingale-specific)

Lipoprotein Particle Patterns

The action recognizes complex lipoprotein particle naming patterns:

VLDL particles: XXL_VLDL_*, XL_VLDL_*, L_VLDL_*, etc.
LDL particles: L_LDL_*, M_LDL_*, S_LDL_*
HDL particles: XL_HDL_*, L_HDL_*, M_HDL_*, S_HDL_*

Each pattern includes appropriate units (nmol/L for particles, mmol/L for concentrations).

Example Usage

Basic HMDB Matching

- name: match_nmr_biomarkers
  action:
    type: NIGHTINGALE_NMR_MATCH
    params:
      input_key: "ukbb_nmr_data"
      output_key: "matched_biomarkers"
      biomarker_column: "biomarker_name"
      target_format: "hmdb"
      match_threshold: 0.85

LOINC Code Mapping

- name: map_to_loinc
  action:
    type: NIGHTINGALE_NMR_MATCH
    params:
      input_key: "clinical_metabolites"
      output_key: "loinc_mapped"
      biomarker_column: "test_name"
      target_format: "loinc"
      include_units: true
      include_categories: true

Both HMDB and LOINC

- name: comprehensive_mapping
  action:
    type: NIGHTINGALE_NMR_MATCH
    params:
      input_key: "nmr_metabolomics"
      output_key: "fully_mapped"
      target_format: "both"
      add_metadata: true
      use_abbreviations: true

Custom Reference File

- name: custom_nightingale_match
  action:
    type: NIGHTINGALE_NMR_MATCH
    params:
      input_key: "biomarker_data"
      output_key: "custom_matched"
      reference_file: "/data/custom_nightingale_reference.csv"
      use_cached_reference: false
      match_threshold: 0.90

Strict Matching

- name: exact_matches_only
  action:
    type: NIGHTINGALE_NMR_MATCH
    params:
      input_key: "quality_controlled_data"
      output_key: "exact_matches"
      match_threshold: 1.0  # Only exact matches
      use_abbreviations: false
      case_sensitive: true

Input Data Format

Expected biomarker data structure: .. code-block:: python

[

{
“biomarker”: “Total_C”, “value”: 5.2, “unit”: “mmol/L”, “sample_id”: “UKB_001”

}, {

“biomarker”: “Ala”, “value”: 0.45, “unit”: “mmol/L”, “sample_id”: “UKB_002”

}, {

“biomarker”: “XXL_VLDL_P”, “value”: 1.8, “unit”: “nmol/L”, “sample_id”: “UKB_003”

}

]

Output Format

HMDB format output: .. code-block:: python

[

{
“original_biomarker”: “Total_C”, “matched_name”: “Total_C”, “hmdb_id”: “HMDB0000067”, “description”: “Total cholesterol”, “category”: “lipids”, “unit”: “mmol/L”, “confidence”: 1.0, “value”: 5.2, “sample_id”: “UKB_001”

}, {

“original_biomarker”: “Ala”, “matched_name”: “Ala”, “hmdb_id”: “HMDB0000161”, “description”: “Alanine”, “category”: “amino_acids”, “unit”: “mmol/L”, “confidence”: 1.0, “value”: 0.45, “sample_id”: “UKB_002”

}

]

Both HMDB and LOINC format: .. code-block:: python

[

{
“original_biomarker”: “Total_C”, “matched_name”: “Total_C”, “hmdb_id”: “HMDB0000067”, “loinc_code”: “2093-3”, “description”: “Total cholesterol”, “category”: “lipids”, “unit”: “mmol/L”, “confidence”: 1.0, “value”: 5.2, “sample_id”: “UKB_001”

}

]

Reference File Format

If using a custom reference file, it should follow this CSV structure:

nightingale_name,hmdb_id,loinc_code,description,category,unit
Total_C,HMDB0000067,2093-3,Total cholesterol,lipids,mmol/L
LDL_C,HMDB0000067,13457-7,LDL cholesterol,lipids,mmol/L
Ala,HMDB0000161,1916-6,Alanine,amino_acids,mmol/L
GlycA,,,"Glycoprotein acetyls",inflammation,mmol/L

Matching Algorithm

The action uses a multi-step matching approach:

Exact match against reference file or built-in patterns
Lipoprotein pattern matching for particle measurements
Fuzzy matching with abbreviation expansion
Confidence scoring based on match quality

Abbreviation Expansion

Common abbreviations are automatically expanded:

C → cholesterol
TG → triglycerides
PL → phospholipids
P → particles
XXL/XL/L/M/S → size descriptors

Statistics and Metadata

The action provides comprehensive matching statistics:

{
    "statistics": {
        "nightingale_nmr_match": {
            "total_biomarkers": 150,
            "matched_biomarkers": 142,
            "match_rate": 0.947,
            "category_breakdown": {
                "lipids": 65,
                "amino_acids": 22,
                "glycolysis": 18,
                "lipoproteins": 25,
                "inflammation": 8,
                "unknown": 4
            }
        }
    }
}

Error Handling

Dataset not found

Error: Dataset 'missing_data' not found in context

Solution: Verify input_key exists in context datasets.

Missing biomarker column

Error: Column 'biomarker' not found in dataset

Solution: Check biomarker_column parameter matches dataset structure.

Reference file issues

Warning: Reference file not found, using built-in patterns only

Solution: Verify reference file path or rely on built-in patterns.

Best Practices

Use appropriate target format - HMDB for metabolomics, LOINC for clinical
Adjust match threshold based on data quality - higher for clean data
Enable abbreviation expansion for varied naming conventions
Include metadata for comprehensive biomarker annotation
Cache reference files for repeated strategy executions
Validate match rates - low rates may indicate data format issues

Performance Notes

Built-in patterns provide fast exact matching
Fuzzy matching adds computational overhead but improves coverage
Reference file caching significantly improves repeated execution
Memory usage scales with dataset size and reference complexity

Common Use Cases

UK Biobank NMR Processing: Map Nightingale biomarker names to standard metabolomics identifiers
Clinical Data Integration: Convert platform-specific names to standardized clinical codes
Multi-Platform Studies: Harmonize biomarker names across different NMR platforms
Metabolomics Database Mapping: Prepare data for integration with metabolomics databases

Integration

This action typically follows data loading and precedes metabolomics analysis:

steps:
  # 1. Load Nightingale NMR data
  - name: load_nmr_data
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "/data/ukbb_nmr_biomarkers.csv"
        identifier_column: "biomarker"
        output_key: "raw_nmr"

  # 2. Match to standard identifiers
  - name: standardize_biomarkers
    action:
      type: NIGHTINGALE_NMR_MATCH
      params:
        input_key: "raw_nmr"
        output_key: "standardized_nmr"
        target_format: "both"
        match_threshold: 0.85

  # 3. Continue with metabolomics analysis
  - name: analyze_metabolites
    action:
      type: SEMANTIC_METABOLITE_MATCH
      params:
        input_key: "standardized_nmr"
        target_database: "hmdb"