EXPORT_DATASET
==============

Export datasets from the execution context to files in various formats.

Purpose
-------

This action saves processed datasets to files for external use, sharing, or archival. It provides:

* Multiple output formats (TSV, CSV, JSON, Excel)
* Selective column export
* Automatic directory creation
* Integration with output file tracking
* Flexible path specification

Parameters
----------

Required Parameters
~~~~~~~~~~~~~~~~~~~

**input_key** (string)
  Key of the dataset to export from context['datasets'].

**output_path** (string)
  Full file path where the dataset will be saved. Supports absolute and relative paths.

Optional Parameters
~~~~~~~~~~~~~~~~~~~

**format** (string)
  Export format: 'tsv', 'csv', 'json', or 'xlsx'.
  Default: 'tsv'

**columns** (list of strings)
  Specific columns to export. If not specified, all columns are exported.
  Default: None (export all columns)

Supported Formats
-----------------

**TSV (Tab-Separated Values)**
  * Extension: .tsv, .txt
  * Delimiter: Tab character
  * Headers: Included
  * Best for: Large datasets, programmatic processing

**CSV (Comma-Separated Values)**
  * Extension: .csv
  * Delimiter: Comma
  * Headers: Included
  * Best for: Excel compatibility, general data exchange

**JSON (JavaScript Object Notation)**
  * Extension: .json
  * Format: Array of objects (records orientation)
  * Indented: 2 spaces for readability
  * Best for: Web applications, APIs

**Excel (XLSX)**
  * Extension: .xlsx
  * Format: Excel workbook
  * Headers: Included
  * Best for: Manual analysis, reporting

Example Usage
-------------

Basic TSV Export
~~~~~~~~~~~~~~~~

.. code-block:: yaml

    - name: export_results
      action:
        type: EXPORT_DATASET
        params:
          input_key: "final_proteins"
          output_path: "/results/protein_matches.tsv"
          format: "tsv"

Export Specific Columns
~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: yaml

    - name: export_summary
      action:
        type: EXPORT_DATASET
        params:
          input_key: "metabolite_matches"
          output_path: "/output/metabolite_summary.csv"
          format: "csv"
          columns: ["compound_name", "hmdb_id", "confidence", "category"]

JSON Export for Web Use
~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: yaml

    - name: export_api_data
      action:
        type: EXPORT_DATASET
        params:
          input_key: "processed_compounds"
          output_path: "/web/data/compounds.json"
          format: "json"

Excel Export for Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: yaml

    - name: export_excel_report
      action:
        type: EXPORT_DATASET
        params:
          input_key: "comprehensive_results"
          output_path: "/reports/analysis_${date}.xlsx"
          format: "xlsx"

Multiple Exports
~~~~~~~~~~~~~~~~~

.. code-block:: yaml

    - name: export_tsv
      action:
        type: EXPORT_DATASET
        params:
          input_key: "final_data"
          output_path: "/output/data.tsv"
          format: "tsv"

    - name: export_excel
      action:
        type: EXPORT_DATASET
        params:
          input_key: "final_data"
          output_path: "/output/data.xlsx"
          format: "xlsx"
          columns: ["id", "name", "description", "category"]

Variable Substitution in Paths
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: yaml

    - name: export_timestamped
      action:
        type: EXPORT_DATASET
        params:
          input_key: "results"
          output_path: "${OUTPUT_DIR}/results_${timestamp}.csv"
          format: "csv"

Output Format Examples
----------------------

**TSV Format**
.. code-block:: tsv

    uniprot_id	gene_name	confidence	category
    P12345	EXAMPLE1	0.95	reviewed
    Q67890	EXAMPLE2	0.87	reviewed

**CSV Format**
.. code-block:: csv

    uniprot_id,gene_name,confidence,category
    P12345,EXAMPLE1,0.95,reviewed
    Q67890,EXAMPLE2,0.87,reviewed

**JSON Format**
.. code-block:: json

    [
      {
        "uniprot_id": "P12345",
        "gene_name": "EXAMPLE1",
        "confidence": 0.95,
        "category": "reviewed"
      },
      {
        "uniprot_id": "Q67890",
        "gene_name": "EXAMPLE2",
        "confidence": 0.87,
        "category": "reviewed"
      }
    ]

Context Integration
-------------------

The action updates the execution context with output file information:

.. code-block:: python

    # Context after execution
    {
        "output_files": {
            "final_proteins": "/results/protein_matches.tsv"
        }
    }

This enables downstream actions to reference exported files.

Path Handling
-------------

**Absolute Paths**
  Use full file system paths: ``/home/user/data/results.csv``

**Relative Paths**
  Relative to current working directory: ``./output/data.tsv``

**Directory Creation**
  Parent directories are created automatically if they don't exist.

**Path Variables**
  Support for environment variables and strategy parameters:
  
  * ``${OUTPUT_DIR}/results.csv``
  * ``${parameters.output_path}``
  * ``${metadata.timestamp}``

Error Handling
--------------

**Dataset not found**
  .. code-block::
  
      Error: Dataset 'missing_data' not found in context
      
  Solution: Verify the input_key exists in context['datasets'].

**Unsupported format**
  .. code-block::
  
      Error: Unsupported format: xml
      
  Solution: Use supported formats: tsv, csv, json, xlsx.

**Permission denied**
  .. code-block::
  
      Error: Export failed: Permission denied
      
  Solution: Check write permissions for output directory.

**Invalid columns**
  .. code-block::
  
      Error: Column 'missing_col' not found in dataset
      
  Solution: Verify column names exist in the dataset.

Best Practices
--------------

1. **Use descriptive filenames** including dataset type and timestamp
2. **Choose appropriate formats** for intended use:
   
   * TSV/CSV for data processing
   * JSON for web applications
   * Excel for manual analysis

3. **Specify column subsets** to reduce file size and focus on key data
4. **Use absolute paths** in production environments
5. **Include metadata** in filenames (date, version, parameters)
6. **Plan directory structure** for organized output management

Performance Notes
-----------------

* Export speed depends on dataset size and format complexity
* TSV exports are fastest for large datasets
* Excel exports may be slower due to formatting overhead
* JSON exports with many columns can be memory-intensive
* Column filtering reduces export time and file size

File Size Considerations
------------------------

**Large Datasets (>100K rows)**
  * Prefer TSV format for efficiency
  * Consider column filtering to reduce size
  * Use compression if supported by downstream tools

**Memory Usage**
  * Scales with dataset size
  * JSON format uses more memory during export
  * Excel format may require significant memory for large datasets

Integration Patterns
--------------------

**End-of-Pipeline Export**
.. code-block:: yaml

    steps:
      # ... processing steps ...
      
      - name: export_final_results
        action:
          type: EXPORT_DATASET
          params:
            input_key: "processed_data"
            output_path: "/results/final_analysis.tsv"

**Multi-Format Export**
.. code-block:: yaml

    steps:
      # ... processing steps ...
      
      - name: export_for_analysis
        action:
          type: EXPORT_DATASET
          params:
            input_key: "results"
            output_path: "/output/analysis.xlsx"
            format: "xlsx"
      
      - name: export_for_api
        action:
          type: EXPORT_DATASET
          params:
            input_key: "results"
            output_path: "/api/data.json"
            format: "json"
            columns: ["id", "name", "value"]

**Conditional Export**
.. code-block:: yaml

    steps:
      # ... processing steps ...
      
      - name: export_if_successful
        action:
          type: EXPORT_DATASET
          params:
            input_key: "validated_results"
            output_path: "/output/success_${date}.tsv"
            format: "tsv"

---

## Verification Sources
*Last verified: 2025-08-22*

This documentation was verified against the following project resources:

- `/biomapper/src/actions/export_dataset.py` (actual implementation with pandas export and UniversalContext integration)
- `/biomapper/src/actions/typed_base.py` (TypedStrategyAction base class)
- `/biomapper/src/actions/registry.py` (self-registration via @register_action decorator)
- `/biomapper/src/core/standards/context_handler.py` (UniversalContext for unified context access)
- `/biomapper/src/core/standards/base_models.py` (ActionParamsBase inheritance)
- `/biomapper/CLAUDE.md` (2025 standardizations and parameter naming)