MERGE_DATASETS

Merge multiple datasets with optional deduplication and flexible join strategies.

Purpose

This action combines multiple datasets from the execution context into a single unified dataset. It provides:

  • Multiple merge strategies (concatenation and join)

  • Flexible deduplication options

  • Support for different join types

  • Comprehensive error handling and validation

  • Detailed provenance tracking

Parameters

Required Parameters

dataset_keys (list of strings)

List of dataset keys to merge from the execution context. For backward compatibility, also supports input_key and dataset2_key for two-dataset merges.

output_key (string)

Key name to store the merged dataset in the execution context.

Optional Parameters

deduplication_column (string)

Column name to use for deduplication. If not specified, no deduplication is performed. Default: None

keep (string)

Which duplicate to keep when deduplicating: ‘first’, ‘last’, or ‘all’. Default: ‘first’

merge_strategy (string)

How to merge datasets: ‘concat’ (stack rows) or ‘join’ (merge on common column). Default: ‘concat’

join_on (string)

Column name to join on when using ‘join’ strategy with uniform columns. Alternative to join_columns. Default: None

join_columns (dict)

Map of dataset_key to column name for joins when datasets have different column names. Example: {“dataset1”: “id”, “dataset2”: “identifier”} Default: None

join_how (string)

Type of join to perform: ‘inner’, ‘outer’, ‘left’, ‘right’. Default: ‘outer’

handle_one_to_many (string)

How to handle one-to-many relationships: ‘keep_all’, ‘first’, ‘aggregate’. Default: ‘keep_all’

aggregate_func (string)

Aggregation function when handle_one_to_many=’aggregate’ (e.g., ‘mean’, ‘sum’, ‘first’). Default: None

add_provenance (boolean)

Whether to add a provenance column tracking the source dataset for each row. Default: false

provenance_value (string)

Custom value for the provenance column when add_provenance=true. Default: None (uses dataset key as value)

Example Usage

Basic Dataset Concatenation

- name: merge_protein_datasets
  action:
    type: MERGE_DATASETS
    params:
      dataset_keys: ["ukbb_proteins", "arv_proteins", "kg2c_proteins"]
      output_key: "all_proteins"
      merge_strategy: "concat"
      deduplication_column: "uniprot_id"
      keep: "first"
      add_provenance: true

Join-Based Merging

- name: merge_with_metadata
  action:
    type: MERGE_DATASETS
    params:
      dataset_keys: ["protein_data", "protein_annotations"]
      output_key: "annotated_proteins"
      merge_strategy: "join"
      join_columns: {
        "protein_data": "uniprot_id",
        "protein_annotations": "protein_id"
      }
      join_how: "left"

Advanced Deduplication

- name: combine_metabolite_results
  action:
    type: MERGE_DATASETS
    params:
      dataset_keys: ["cts_matches", "hmdb_matches", "manual_matches"]
      output_key: "unified_metabolites"
      deduplication_column: "hmdb_id"
      keep: "last"  # Keep most recent match

Output Format

The action stores the merged dataset in the context under the specified output_key:

# Context after execution
{
    "datasets": {
        "all_proteins": [
            {
                "uniprot_id": "P12345",
                "gene_name": "EXAMPLE1",
                "source": "ukbb_proteins"
            },
            {
                "uniprot_id": "Q67890",
                "gene_name": "EXAMPLE2",
                "source": "arv_proteins"
            }
            # ... merged rows from all datasets
        ]
    }
}

Merge Strategies

Concatenation (concat)

Stacks datasets vertically, preserving all columns from all datasets. Missing columns are filled with NaN. Supports provenance tracking to identify source dataset for each row.

Join (join)

Merges datasets horizontally based on common columns. Supports different column names per dataset via join_columns. Column conflicts are resolved with suffixes (_x, _y).

Deduplication Options

keep=’first’

Keeps the first occurrence of each duplicate identifier.

keep=’last’

Keeps the last occurrence of each duplicate identifier.

keep=’all’

Keeps all rows (no deduplication performed).

One-to-Many Handling

handle_one_to_many=’keep_all’

Preserves all matching rows in one-to-many relationships.

handle_one_to_many=’first’

Keeps only the first match in one-to-many relationships.

handle_one_to_many=’aggregate’

Aggregates multiple matches using specified aggregation function.

Error Handling

Missing datasets
Warning: Dataset 'missing_data' not found in context

Solution: Verify dataset keys exist in context from previous actions.

Join column missing
Error: join_columns or join_on required when merge_strategy='join'

Solution: Specify either join_on for uniform columns or join_columns for different column names per dataset.

Empty datasets
Warning: Dataset at index 1 is empty or invalid type

Solution: Ensure datasets contain valid data before merging.

Best Practices

  1. Use descriptive output keys like “merged_proteins” instead of “result”

  2. Choose appropriate merge strategy - concat for combining similar datasets, join for adding metadata

  3. Consider deduplication carefully - first occurrence often preserves original data quality

  4. Validate join columns exist in all datasets before using join strategy

  5. Handle missing datasets gracefully by checking dataset availability

Performance Notes

  • Large datasets (>100K rows) are processed efficiently using pandas

  • Memory usage scales with combined dataset size

  • Join operations may be slower than concatenation for large datasets

  • Uses UniversalContext for robust context handling across different execution environments

  • Supports both legacy parameter formats and new standardized formats for backward compatibility

Common Use Cases

Combining Multi-Source Data

Merge datasets from different platforms (UK Biobank, ArraySeq, etc.)

Adding Annotations

Join experimental data with reference annotations or metadata

Result Consolidation

Combine results from multiple matching algorithms with deduplication

Quality Control

Merge datasets while removing duplicates to ensure data integrity

Integration

This action typically follows data loading actions and precedes analysis:

steps:
  # 1. Load datasets
  - name: load_ukbb
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "/data/ukbb_proteins.csv"
        identifier_column: "UniProt"
        output_key: "ukbb_data"

  - name: load_arv
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "/data/arv_proteins.csv"
        identifier_column: "UniProt"
        output_key: "arv_data"

  # 2. Merge datasets
  - name: merge_all
    action:
      type: MERGE_DATASETS
      params:
        dataset_keys: ["ukbb_data", "arv_data"]
        output_key: "combined_proteins"
        deduplication_column: "UniProt"
        keep: "first"

  # 3. Continue with analysis
  - name: analyze_overlap
    action:
      type: CALCULATE_SET_OVERLAP
      params:
        input_key: "combined_proteins"
        source_name: "UKBB"
        target_name: "ARV"
        mapping_combo_id: "UKBB_ARV"
        output_key: "overlap_stats"

Backward Compatibility

The action supports legacy parameter formats for seamless migration:

Legacy Two-Dataset Format
params:
  input_key: "dataset1"         # Mapped to dataset_keys[0]
  dataset2_key: "dataset2"      # Mapped to dataset_keys[1]
  join_column1: "id"            # Mapped to join_columns
  join_column2: "identifier"
  join_type: "outer"            # Mapped to join_how

# Alternative legacy format:
params:
  dataset1_key: "dataset1"      # Alias for input_key
  dataset2_key: "dataset2"
Modern Multi-Dataset Format
params:
  dataset_keys: ["dataset1", "dataset2", "dataset3"]
  join_columns: {
    "dataset1": "id",
    "dataset2": "identifier",
    "dataset3": "uid"
  }
  join_how: "outer"

## Verification Sources Last verified: 2025-08-22

This documentation was verified against the following project resources:

  • /biomapper/src/actions/merge_datasets.py (actual implementation with flexible parameter format support)

  • /biomapper/src/actions/typed_base.py (TypedStrategyAction base class and StandardActionResult)

  • /biomapper/src/actions/registry.py (self-registration via @register_action decorator)

  • /biomapper/src/core/standards/context_handler.py (UniversalContext for unified context access)

  • /biomapper/CLAUDE.md (2025 standardizations and parameter naming conventions)