EXPORT_DATASET

Export datasets from the execution context to files in various formats.

Purpose

This action saves processed datasets to files for external use, sharing, or archival. It provides:

Multiple output formats (TSV, CSV, JSON, Excel)
Selective column export
Automatic directory creation
Integration with output file tracking
Flexible path specification

Parameters

Required Parameters

input_key (string): Key of the dataset to export from context[‘datasets’].
output_path (string): Full file path where the dataset will be saved. Supports absolute and relative paths.

Optional Parameters

format (string): Export format: ‘tsv’, ‘csv’, ‘json’, or ‘xlsx’. Default: ‘tsv’
columns (list of strings): Specific columns to export. If not specified, all columns are exported. Default: None (export all columns)

Supported Formats

TSV (Tab-Separated Values)

Extension: .tsv, .txt
Delimiter: Tab character
Headers: Included
Best for: Large datasets, programmatic processing

CSV (Comma-Separated Values)

Extension: .csv
Delimiter: Comma
Headers: Included
Best for: Excel compatibility, general data exchange

JSON (JavaScript Object Notation)

Extension: .json
Format: Array of objects (records orientation)
Indented: 2 spaces for readability
Best for: Web applications, APIs

Excel (XLSX)

Extension: .xlsx
Format: Excel workbook
Headers: Included
Best for: Manual analysis, reporting

Example Usage

Basic TSV Export

- name: export_results
  action:
    type: EXPORT_DATASET
    params:
      input_key: "final_proteins"
      output_path: "/results/protein_matches.tsv"
      format: "tsv"

Export Specific Columns

- name: export_summary
  action:
    type: EXPORT_DATASET
    params:
      input_key: "metabolite_matches"
      output_path: "/output/metabolite_summary.csv"
      format: "csv"
      columns: ["compound_name", "hmdb_id", "confidence", "category"]

JSON Export for Web Use

- name: export_api_data
  action:
    type: EXPORT_DATASET
    params:
      input_key: "processed_compounds"
      output_path: "/web/data/compounds.json"
      format: "json"

Excel Export for Analysis

- name: export_excel_report
  action:
    type: EXPORT_DATASET
    params:
      input_key: "comprehensive_results"
      output_path: "/reports/analysis_${date}.xlsx"
      format: "xlsx"

Multiple Exports

- name: export_tsv
  action:
    type: EXPORT_DATASET
    params:
      input_key: "final_data"
      output_path: "/output/data.tsv"
      format: "tsv"

- name: export_excel
  action:
    type: EXPORT_DATASET
    params:
      input_key: "final_data"
      output_path: "/output/data.xlsx"
      format: "xlsx"
      columns: ["id", "name", "description", "category"]

Variable Substitution in Paths

- name: export_timestamped
  action:
    type: EXPORT_DATASET
    params:
      input_key: "results"
      output_path: "${OUTPUT_DIR}/results_${timestamp}.csv"
      format: "csv"

Output Format Examples

TSV Format .. code-block:: tsv

uniprot_id gene_name confidence category P12345 EXAMPLE1 0.95 reviewed Q67890 EXAMPLE2 0.87 reviewed

CSV Format .. code-block:: csv

uniprot_id,gene_name,confidence,category P12345,EXAMPLE1,0.95,reviewed Q67890,EXAMPLE2,0.87,reviewed

JSON Format .. code-block:: json

[

{
“uniprot_id”: “P12345”, “gene_name”: “EXAMPLE1”, “confidence”: 0.95, “category”: “reviewed”

}, {

“uniprot_id”: “Q67890”, “gene_name”: “EXAMPLE2”, “confidence”: 0.87, “category”: “reviewed”

}

]

Context Integration

The action updates the execution context with output file information:

# Context after execution
{
    "output_files": {
        "final_proteins": "/results/protein_matches.tsv"
    }
}

This enables downstream actions to reference exported files.

Path Handling

Absolute Paths

Use full file system paths: /home/user/data/results.csv

Relative Paths

Relative to current working directory: ./output/data.tsv

Directory Creation

Parent directories are created automatically if they don’t exist.

Path Variables

Support for environment variables and strategy parameters:

${OUTPUT_DIR}/results.csv
${parameters.output_path}
${metadata.timestamp}

Error Handling

Dataset not found

Error: Dataset 'missing_data' not found in context

Solution: Verify the input_key exists in context[‘datasets’].

Unsupported format

Error: Unsupported format: xml

Solution: Use supported formats: tsv, csv, json, xlsx.

Permission denied

Error: Export failed: Permission denied

Solution: Check write permissions for output directory.

Invalid columns

Error: Column 'missing_col' not found in dataset

Solution: Verify column names exist in the dataset.

Best Practices

Use descriptive filenames including dataset type and timestamp
Choose appropriate formats for intended use:
- TSV/CSV for data processing
- JSON for web applications
- Excel for manual analysis
Specify column subsets to reduce file size and focus on key data
Use absolute paths in production environments
Include metadata in filenames (date, version, parameters)
Plan directory structure for organized output management

Performance Notes

Export speed depends on dataset size and format complexity
TSV exports are fastest for large datasets
Excel exports may be slower due to formatting overhead
JSON exports with many columns can be memory-intensive
Column filtering reduces export time and file size

File Size Considerations

Large Datasets (>100K rows)

Prefer TSV format for efficiency
Consider column filtering to reduce size
Use compression if supported by downstream tools

Memory Usage

Scales with dataset size
JSON format uses more memory during export
Excel format may require significant memory for large datasets

Integration Patterns

End-of-Pipeline Export .. code-block:: yaml

steps:
# … processing steps …

name: export_final_results action:

type: EXPORT_DATASET params:

input_key: “processed_data” output_path: “/results/final_analysis.tsv”

Multi-Format Export .. code-block:: yaml

steps:
# … processing steps …

name: export_for_analysis action:

type: EXPORT_DATASET params:

input_key: “results” output_path: “/output/analysis.xlsx” format: “xlsx”

name: export_for_api action:

type: EXPORT_DATASET params:

input_key: “results” output_path: “/api/data.json” format: “json” columns: [“id”, “name”, “value”]

Conditional Export .. code-block:: yaml

steps:
# … processing steps …

name: export_if_successful action:

type: EXPORT_DATASET params:

input_key: “validated_results” output_path: “/output/success_${date}.tsv” format: “tsv”

—

## Verification Sources Last verified: 2025-08-22

This documentation was verified against the following project resources:

/biomapper/src/actions/export_dataset.py (actual implementation with pandas export and UniversalContext integration)
/biomapper/src/actions/typed_base.py (TypedStrategyAction base class)
/biomapper/src/actions/registry.py (self-registration via @register_action decorator)
/biomapper/src/core/standards/context_handler.py (UniversalContext for unified context access)
/biomapper/src/core/standards/base_models.py (ActionParamsBase inheritance)
/biomapper/CLAUDE.md (2025 standardizations and parameter naming)