MERGE_DATASETS
==============

Merge multiple datasets with optional deduplication and flexible join strategies.

Purpose
-------

This action combines multiple datasets from the execution context into a single unified dataset. It provides:

* Multiple merge strategies (concatenation and join)
* Flexible deduplication options
* Support for different join types
* Comprehensive error handling and validation
* Detailed provenance tracking

Parameters
----------

Required Parameters
~~~~~~~~~~~~~~~~~~~

**dataset_keys** (list of strings)
  List of dataset keys to merge from the execution context. For backward compatibility, also supports `input_key` and `dataset2_key` for two-dataset merges.

**output_key** (string)
  Key name to store the merged dataset in the execution context.

Optional Parameters
~~~~~~~~~~~~~~~~~~~

**deduplication_column** (string)
  Column name to use for deduplication. If not specified, no deduplication is performed.
  Default: None

**keep** (string)
  Which duplicate to keep when deduplicating: 'first', 'last', or 'all'.
  Default: 'first'

**merge_strategy** (string)
  How to merge datasets: 'concat' (stack rows) or 'join' (merge on common column).
  Default: 'concat'

**join_on** (string)
  Column name to join on when using 'join' strategy with uniform columns. Alternative to `join_columns`.
  Default: None

**join_columns** (dict)
  Map of dataset_key to column name for joins when datasets have different column names.
  Example: {"dataset1": "id", "dataset2": "identifier"}
  Default: None

**join_how** (string)
  Type of join to perform: 'inner', 'outer', 'left', 'right'.
  Default: 'outer'

**handle_one_to_many** (string)
  How to handle one-to-many relationships: 'keep_all', 'first', 'aggregate'.
  Default: 'keep_all'

**aggregate_func** (string)
  Aggregation function when handle_one_to_many='aggregate' (e.g., 'mean', 'sum', 'first').
  Default: None

**add_provenance** (boolean)
  Whether to add a provenance column tracking the source dataset for each row.
  Default: false

**provenance_value** (string)
  Custom value for the provenance column when add_provenance=true.
  Default: None (uses dataset key as value)

Example Usage
-------------

Basic Dataset Concatenation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: yaml

    - name: merge_protein_datasets
      action:
        type: MERGE_DATASETS
        params:
          dataset_keys: ["ukbb_proteins", "arv_proteins", "kg2c_proteins"]
          output_key: "all_proteins"
          merge_strategy: "concat"
          deduplication_column: "uniprot_id"
          keep: "first"
          add_provenance: true

Join-Based Merging
~~~~~~~~~~~~~~~~~~

.. code-block:: yaml

    - name: merge_with_metadata
      action:
        type: MERGE_DATASETS
        params:
          dataset_keys: ["protein_data", "protein_annotations"]
          output_key: "annotated_proteins"
          merge_strategy: "join"
          join_columns: {
            "protein_data": "uniprot_id",
            "protein_annotations": "protein_id"
          }
          join_how: "left"

Advanced Deduplication
~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: yaml

    - name: combine_metabolite_results
      action:
        type: MERGE_DATASETS
        params:
          dataset_keys: ["cts_matches", "hmdb_matches", "manual_matches"]
          output_key: "unified_metabolites"
          deduplication_column: "hmdb_id"
          keep: "last"  # Keep most recent match

Output Format
-------------

The action stores the merged dataset in the context under the specified ``output_key``:

.. code-block:: python

    # Context after execution
    {
        "datasets": {
            "all_proteins": [
                {
                    "uniprot_id": "P12345",
                    "gene_name": "EXAMPLE1",
                    "source": "ukbb_proteins"
                },
                {
                    "uniprot_id": "Q67890", 
                    "gene_name": "EXAMPLE2",
                    "source": "arv_proteins"
                }
                # ... merged rows from all datasets
            ]
        }
    }

Merge Strategies
----------------

**Concatenation (concat)**
  Stacks datasets vertically, preserving all columns from all datasets. Missing columns are filled with NaN. Supports provenance tracking to identify source dataset for each row.

**Join (join)**
  Merges datasets horizontally based on common columns. Supports different column names per dataset via `join_columns`. Column conflicts are resolved with suffixes (_x, _y).

Deduplication Options
---------------------

**keep='first'**
  Keeps the first occurrence of each duplicate identifier.

**keep='last'**
  Keeps the last occurrence of each duplicate identifier.

**keep='all'**
  Keeps all rows (no deduplication performed).

One-to-Many Handling
--------------------

**handle_one_to_many='keep_all'**
  Preserves all matching rows in one-to-many relationships.

**handle_one_to_many='first'**
  Keeps only the first match in one-to-many relationships.

**handle_one_to_many='aggregate'**
  Aggregates multiple matches using specified aggregation function.

Error Handling
--------------

**Missing datasets**
  .. code-block::
  
      Warning: Dataset 'missing_data' not found in context
      
  Solution: Verify dataset keys exist in context from previous actions.

**Join column missing**
  .. code-block::
  
      Error: join_columns or join_on required when merge_strategy='join'
      
  Solution: Specify either `join_on` for uniform columns or `join_columns` for different column names per dataset.

**Empty datasets**
  .. code-block::
  
      Warning: Dataset at index 1 is empty or invalid type
      
  Solution: Ensure datasets contain valid data before merging.

Best Practices
--------------

1. **Use descriptive output keys** like "merged_proteins" instead of "result"
2. **Choose appropriate merge strategy** - concat for combining similar datasets, join for adding metadata
3. **Consider deduplication carefully** - first occurrence often preserves original data quality
4. **Validate join columns** exist in all datasets before using join strategy
5. **Handle missing datasets gracefully** by checking dataset availability

Performance Notes
-----------------

* Large datasets (>100K rows) are processed efficiently using pandas
* Memory usage scales with combined dataset size
* Join operations may be slower than concatenation for large datasets
* Uses UniversalContext for robust context handling across different execution environments
* Supports both legacy parameter formats and new standardized formats for backward compatibility

Common Use Cases
----------------

**Combining Multi-Source Data**
  Merge datasets from different platforms (UK Biobank, ArraySeq, etc.)

**Adding Annotations**
  Join experimental data with reference annotations or metadata

**Result Consolidation**
  Combine results from multiple matching algorithms with deduplication

**Quality Control**
  Merge datasets while removing duplicates to ensure data integrity

Integration
-----------

This action typically follows data loading actions and precedes analysis:

.. code-block:: yaml

    steps:
      # 1. Load datasets
      - name: load_ukbb
        action:
          type: LOAD_DATASET_IDENTIFIERS
          params:
            file_path: "/data/ukbb_proteins.csv"
            identifier_column: "UniProt"
            output_key: "ukbb_data"
      
      - name: load_arv
        action:
          type: LOAD_DATASET_IDENTIFIERS
          params:
            file_path: "/data/arv_proteins.csv"
            identifier_column: "UniProt"
            output_key: "arv_data"
      
      # 2. Merge datasets
      - name: merge_all
        action:
          type: MERGE_DATASETS
          params:
            dataset_keys: ["ukbb_data", "arv_data"]
            output_key: "combined_proteins"
            deduplication_column: "UniProt"
            keep: "first"
      
      # 3. Continue with analysis
      - name: analyze_overlap
        action:
          type: CALCULATE_SET_OVERLAP
          params:
            input_key: "combined_proteins"
            source_name: "UKBB"
            target_name: "ARV"
            mapping_combo_id: "UKBB_ARV"
            output_key: "overlap_stats"

Backward Compatibility
----------------------

The action supports legacy parameter formats for seamless migration:

**Legacy Two-Dataset Format**
  .. code-block:: yaml
  
      params:
        input_key: "dataset1"         # Mapped to dataset_keys[0]
        dataset2_key: "dataset2"      # Mapped to dataset_keys[1]
        join_column1: "id"            # Mapped to join_columns
        join_column2: "identifier"
        join_type: "outer"            # Mapped to join_how
        
      # Alternative legacy format:
      params:
        dataset1_key: "dataset1"      # Alias for input_key
        dataset2_key: "dataset2"

**Modern Multi-Dataset Format**
  .. code-block:: yaml
  
      params:
        dataset_keys: ["dataset1", "dataset2", "dataset3"]
        join_columns: {
          "dataset1": "id",
          "dataset2": "identifier",
          "dataset3": "uid"
        }
        join_how: "outer"

---

## Verification Sources
*Last verified: 2025-08-22*

This documentation was verified against the following project resources:

- `/biomapper/src/actions/merge_datasets.py` (actual implementation with flexible parameter format support)
- `/biomapper/src/actions/typed_base.py` (TypedStrategyAction base class and StandardActionResult)
- `/biomapper/src/actions/registry.py` (self-registration via @register_action decorator)
- `/biomapper/src/core/standards/context_handler.py` (UniversalContext for unified context access)
- `/biomapper/CLAUDE.md` (2025 standardizations and parameter naming conventions)