MERGE_DATASETS

Merge multiple datasets with optional deduplication and flexible join strategies.

Purpose

This action combines multiple datasets from the execution context into a single unified dataset. It provides:

Multiple merge strategies (concatenation and join)
Flexible deduplication options
Support for different join types
Comprehensive error handling and validation
Detailed provenance tracking

Parameters

Required Parameters

dataset_keys (list of strings): List of dataset keys to merge from the execution context. For backward compatibility, also supports input_key and dataset2_key for two-dataset merges.
output_key (string): Key name to store the merged dataset in the execution context.

Optional Parameters

deduplication_column (string): Column name to use for deduplication. If not specified, no deduplication is performed. Default: None
keep (string): Which duplicate to keep when deduplicating: ‘first’, ‘last’, or ‘all’. Default: ‘first’
merge_strategy (string): How to merge datasets: ‘concat’ (stack rows) or ‘join’ (merge on common column). Default: ‘concat’
join_on (string): Column name to join on when using ‘join’ strategy with uniform columns. Alternative to join_columns. Default: None
join_columns (dict): Map of dataset_key to column name for joins when datasets have different column names. Example: {“dataset1”: “id”, “dataset2”: “identifier”} Default: None
join_how (string): Type of join to perform: ‘inner’, ‘outer’, ‘left’, ‘right’. Default: ‘outer’
handle_one_to_many (string): How to handle one-to-many relationships: ‘keep_all’, ‘first’, ‘aggregate’. Default: ‘keep_all’
aggregate_func (string): Aggregation function when handle_one_to_many=’aggregate’ (e.g., ‘mean’, ‘sum’, ‘first’). Default: None
add_provenance (boolean): Whether to add a provenance column tracking the source dataset for each row. Default: false
provenance_value (string): Custom value for the provenance column when add_provenance=true. Default: None (uses dataset key as value)

Example Usage

Basic Dataset Concatenation

- name: merge_protein_datasets
  action:
    type: MERGE_DATASETS
    params:
      dataset_keys: ["ukbb_proteins", "arv_proteins", "kg2c_proteins"]
      output_key: "all_proteins"
      merge_strategy: "concat"
      deduplication_column: "uniprot_id"
      keep: "first"
      add_provenance: true

Join-Based Merging

- name: merge_with_metadata
  action:
    type: MERGE_DATASETS
    params:
      dataset_keys: ["protein_data", "protein_annotations"]
      output_key: "annotated_proteins"
      merge_strategy: "join"
      join_columns: {
        "protein_data": "uniprot_id",
        "protein_annotations": "protein_id"
      }
      join_how: "left"

Advanced Deduplication

- name: combine_metabolite_results
  action:
    type: MERGE_DATASETS
    params:
      dataset_keys: ["cts_matches", "hmdb_matches", "manual_matches"]
      output_key: "unified_metabolites"
      deduplication_column: "hmdb_id"
      keep: "last"  # Keep most recent match

Output Format

The action stores the merged dataset in the context under the specified output_key:

# Context after execution
{
    "datasets": {
        "all_proteins": [
            {
                "uniprot_id": "P12345",
                "gene_name": "EXAMPLE1",
                "source": "ukbb_proteins"
            },
            {
                "uniprot_id": "Q67890",
                "gene_name": "EXAMPLE2",
                "source": "arv_proteins"
            }
            # ... merged rows from all datasets
        ]
    }
}

Merge Strategies

Concatenation (concat): Stacks datasets vertically, preserving all columns from all datasets. Missing columns are filled with NaN. Supports provenance tracking to identify source dataset for each row.
Join (join): Merges datasets horizontally based on common columns. Supports different column names per dataset via join_columns. Column conflicts are resolved with suffixes (_x, _y).

Deduplication Options

keep=’first’: Keeps the first occurrence of each duplicate identifier.
keep=’last’: Keeps the last occurrence of each duplicate identifier.
keep=’all’: Keeps all rows (no deduplication performed).

One-to-Many Handling

handle_one_to_many=’keep_all’: Preserves all matching rows in one-to-many relationships.
handle_one_to_many=’first’: Keeps only the first match in one-to-many relationships.
handle_one_to_many=’aggregate’: Aggregates multiple matches using specified aggregation function.

Error Handling

Missing datasets

Warning: Dataset 'missing_data' not found in context

Solution: Verify dataset keys exist in context from previous actions.

Join column missing

Error: join_columns or join_on required when merge_strategy='join'

Solution: Specify either join_on for uniform columns or join_columns for different column names per dataset.

Empty datasets

Warning: Dataset at index 1 is empty or invalid type

Solution: Ensure datasets contain valid data before merging.

Best Practices

Use descriptive output keys like “merged_proteins” instead of “result”
Choose appropriate merge strategy - concat for combining similar datasets, join for adding metadata
Consider deduplication carefully - first occurrence often preserves original data quality
Validate join columns exist in all datasets before using join strategy
Handle missing datasets gracefully by checking dataset availability

Performance Notes

Large datasets (>100K rows) are processed efficiently using pandas
Memory usage scales with combined dataset size
Join operations may be slower than concatenation for large datasets
Uses UniversalContext for robust context handling across different execution environments
Supports both legacy parameter formats and new standardized formats for backward compatibility

Common Use Cases

Combining Multi-Source Data: Merge datasets from different platforms (UK Biobank, ArraySeq, etc.)
Adding Annotations: Join experimental data with reference annotations or metadata
Result Consolidation: Combine results from multiple matching algorithms with deduplication
Quality Control: Merge datasets while removing duplicates to ensure data integrity

Integration

This action typically follows data loading actions and precedes analysis:

steps:
  # 1. Load datasets
  - name: load_ukbb
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "/data/ukbb_proteins.csv"
        identifier_column: "UniProt"
        output_key: "ukbb_data"

  - name: load_arv
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "/data/arv_proteins.csv"
        identifier_column: "UniProt"
        output_key: "arv_data"

  # 2. Merge datasets
  - name: merge_all
    action:
      type: MERGE_DATASETS
      params:
        dataset_keys: ["ukbb_data", "arv_data"]
        output_key: "combined_proteins"
        deduplication_column: "UniProt"
        keep: "first"

  # 3. Continue with analysis
  - name: analyze_overlap
    action:
      type: CALCULATE_SET_OVERLAP
      params:
        input_key: "combined_proteins"
        source_name: "UKBB"
        target_name: "ARV"
        mapping_combo_id: "UKBB_ARV"
        output_key: "overlap_stats"

Backward Compatibility

The action supports legacy parameter formats for seamless migration:

Legacy Two-Dataset Format

params:
  input_key: "dataset1"         # Mapped to dataset_keys[0]
  dataset2_key: "dataset2"      # Mapped to dataset_keys[1]
  join_column1: "id"            # Mapped to join_columns
  join_column2: "identifier"
  join_type: "outer"            # Mapped to join_how

# Alternative legacy format:
params:
  dataset1_key: "dataset1"      # Alias for input_key
  dataset2_key: "dataset2"

Modern Multi-Dataset Format

params:
  dataset_keys: ["dataset1", "dataset2", "dataset3"]
  join_columns: {
    "dataset1": "id",
    "dataset2": "identifier",
    "dataset3": "uid"
  }
  join_how: "outer"

—

## Verification Sources Last verified: 2025-08-22

This documentation was verified against the following project resources:

/biomapper/src/actions/merge_datasets.py (actual implementation with flexible parameter format support)
/biomapper/src/actions/typed_base.py (TypedStrategyAction base class and StandardActionResult)
/biomapper/src/actions/registry.py (self-registration via @register_action decorator)
/biomapper/src/core/standards/context_handler.py (UniversalContext for unified context access)
/biomapper/CLAUDE.md (2025 standardizations and parameter naming conventions)