CHEMISTRY_FUZZY_TEST_MATCH

Fuzzy matching for clinical chemistry test names and descriptions.

Purpose

This action performs intelligent fuzzy matching for clinical chemistry tests by:

  • Matching test names using multiple algorithms (Levenshtein, Jaro-Winkler, token-based)

  • Handling abbreviations and synonyms in clinical test nomenclature

  • Normalizing units and reference ranges

  • Resolving ambiguous test mappings

  • Supporting LOINC code cross-referencing

Parameters

Required Parameters

input_key (string)

Context key containing source chemistry test data.

target_key (string)

Context key containing target reference test database.

output_key (string)

Context key to store matched results.

Optional Parameters

test_name_column (string)

Column containing test names in source data. Default: “test_name”

match_threshold (float)

Minimum similarity score for matches (0.0-1.0). Default: 0.85

matching_strategy (string)

Strategy: ‘best_match’, ‘all_above_threshold’, ‘top_n’. Default: ‘best_match’

top_n (integer)

Number of top matches to return (if strategy=’top_n’). Default: 3

use_abbreviations (boolean)

Enable abbreviation expansion (e.g., ‘Hgb’ → ‘Hemoglobin’). Default: true

use_synonyms (boolean)

Enable synonym matching from clinical dictionaries. Default: true

normalize_units (boolean)

Standardize measurement units before matching. Default: true

Example Usage

Basic Fuzzy Matching

- name: match_chemistry_tests
  action:
    type: CHEMISTRY_FUZZY_TEST_MATCH
    params:
      input_key: "lab_tests"
      target_key: "reference_tests"
      output_key: "matched_tests"
      match_threshold: 0.85

Advanced Configuration

- name: comprehensive_matching
  action:
    type: CHEMISTRY_FUZZY_TEST_MATCH
    params:
      input_key: "clinical_chemistry"
      target_key: "loinc_database"
      output_key: "matched_chemistry"
      test_name_column: "assay_name"
      match_threshold: 0.80
      matching_strategy: "top_n"
      top_n: 5
      use_abbreviations: true
      use_synonyms: true
      normalize_units: true

Input Format

Source Test Data

[
    {
        "test_name": "Glucose, Serum",
        "value": "95",
        "units": "mg/dL",
        "reference_range": "70-100"
    },
    {
        "test_name": "Hgb",  # Abbreviation
        "value": "14.5",
        "units": "g/dl",
        "reference_range": "13.5-17.5"
    }
]

Target Reference Database

[
    {
        "standard_name": "Glucose in Serum or Plasma",
        "loinc_code": "2345-7",
        "units": "mg/dL",
        "synonyms": ["Blood Glucose", "Serum Glucose"]
    },
    {
        "standard_name": "Hemoglobin",
        "loinc_code": "718-7",
        "units": "g/dL",
        "abbreviations": ["Hgb", "Hb"]
    }
]

Output Format

Matched Results

{
    "datasets": {
        "matched_tests": [
            {
                # Original fields
                "test_name": "Glucose, Serum",
                "value": "95",
                "units": "mg/dL",

                # Match metadata
                "matched_name": "Glucose in Serum or Plasma",
                "loinc_code": "2345-7",
                "match_score": 0.92,
                "match_method": "fuzzy_token",
                "match_confidence": "high",

                # Normalized values
                "normalized_units": "mg/dL",
                "standardized_value": 95.0
            }
        ]
    }
}

Matching Statistics

{
    "statistics": {
        "fuzzy_matching": {
            "total_tests": 150,
            "matched": 142,
            "unmatched": 8,
            "match_rate": 0.947,
            "confidence_distribution": {
                "high": 120,
                "medium": 22,
                "low": 0
            },
            "method_usage": {
                "exact": 45,
                "abbreviation": 28,
                "synonym": 15,
                "fuzzy_token": 54
            }
        }
    }
}

Matching Algorithms

Matching Methods (in order)

  1. Exact Match: Direct string comparison

  2. Abbreviation Expansion: Hgb → Hemoglobin

  3. Synonym Matching: Uses clinical dictionaries

  4. Token-Based Fuzzy: Compares word tokens

  5. Levenshtein Distance: Character-level similarity

  6. Jaro-Winkler: Optimized for short strings

Confidence Scoring

  • High (>0.90): Exact or near-exact matches

  • Medium (0.80-0.90): Good fuzzy matches

  • Low (<0.80): Weak matches (if above threshold)

Best Practices

  1. Start with higher thresholds (0.85+) and adjust based on results

  2. Review unmatched tests to identify missing synonyms

  3. Use top_n strategy for manual validation workflows

  4. Enable all normalization options for heterogeneous data

  5. Validate LOINC codes when available

Performance Notes

  • Optimized for datasets with <10,000 tests

  • Uses indexed search for large reference databases

  • Caches abbreviation and synonym lookups

  • Parallel processing for batch matching

Integration Example

name: clinical_chemistry_pipeline
description: Map clinical chemistry tests to standards

steps:
  - name: load_lab_data
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "/data/lab_results.csv"
        identifier_column: "patient_id"
        output_key: "lab_data"

  - name: extract_loinc
    action:
      type: CHEMISTRY_EXTRACT_LOINC
      params:
        input_key: "lab_data"
        output_key: "loinc_extracted"

  - name: fuzzy_match
    action:
      type: CHEMISTRY_FUZZY_TEST_MATCH
      params:
        input_key: "loinc_extracted"
        target_key: "loinc_reference"
        output_key: "matched_tests"
        match_threshold: 0.85

  - name: export_results
    action:
      type: EXPORT_DATASET
      params:
        input_key: "matched_tests"
        output_file: "/results/matched_chemistry.xlsx"
        format: "excel"

See Also

  • chemistry_extract_loinc - Extract LOINC codes

  • chemistry_vendor_harmonization - Harmonize vendor-specific tests

  • calculate_mapping_quality - Assess match quality