EXPORT_DATASET

Export datasets from the execution context to files in various formats.

Purpose

This action saves processed datasets to files for external use, sharing, or archival. It provides:

  • Multiple output formats (TSV, CSV, JSON, Excel)

  • Selective column export

  • Automatic directory creation

  • Integration with output file tracking

  • Flexible path specification

Parameters

Required Parameters

input_key (string)

Key of the dataset to export from context[‘datasets’].

output_path (string)

Full file path where the dataset will be saved. Supports absolute and relative paths.

Optional Parameters

format (string)

Export format: ‘tsv’, ‘csv’, ‘json’, or ‘xlsx’. Default: ‘tsv’

columns (list of strings)

Specific columns to export. If not specified, all columns are exported. Default: None (export all columns)

Supported Formats

TSV (Tab-Separated Values)
  • Extension: .tsv, .txt

  • Delimiter: Tab character

  • Headers: Included

  • Best for: Large datasets, programmatic processing

CSV (Comma-Separated Values)
  • Extension: .csv

  • Delimiter: Comma

  • Headers: Included

  • Best for: Excel compatibility, general data exchange

JSON (JavaScript Object Notation)
  • Extension: .json

  • Format: Array of objects (records orientation)

  • Indented: 2 spaces for readability

  • Best for: Web applications, APIs

Excel (XLSX)
  • Extension: .xlsx

  • Format: Excel workbook

  • Headers: Included

  • Best for: Manual analysis, reporting

Example Usage

Basic TSV Export

- name: export_results
  action:
    type: EXPORT_DATASET
    params:
      input_key: "final_proteins"
      output_path: "/results/protein_matches.tsv"
      format: "tsv"

Export Specific Columns

- name: export_summary
  action:
    type: EXPORT_DATASET
    params:
      input_key: "metabolite_matches"
      output_path: "/output/metabolite_summary.csv"
      format: "csv"
      columns: ["compound_name", "hmdb_id", "confidence", "category"]

JSON Export for Web Use

- name: export_api_data
  action:
    type: EXPORT_DATASET
    params:
      input_key: "processed_compounds"
      output_path: "/web/data/compounds.json"
      format: "json"

Excel Export for Analysis

- name: export_excel_report
  action:
    type: EXPORT_DATASET
    params:
      input_key: "comprehensive_results"
      output_path: "/reports/analysis_${date}.xlsx"
      format: "xlsx"

Multiple Exports

- name: export_tsv
  action:
    type: EXPORT_DATASET
    params:
      input_key: "final_data"
      output_path: "/output/data.tsv"
      format: "tsv"

- name: export_excel
  action:
    type: EXPORT_DATASET
    params:
      input_key: "final_data"
      output_path: "/output/data.xlsx"
      format: "xlsx"
      columns: ["id", "name", "description", "category"]

Variable Substitution in Paths

- name: export_timestamped
  action:
    type: EXPORT_DATASET
    params:
      input_key: "results"
      output_path: "${OUTPUT_DIR}/results_${timestamp}.csv"
      format: "csv"

Output Format Examples

TSV Format .. code-block:: tsv

uniprot_id gene_name confidence category P12345 EXAMPLE1 0.95 reviewed Q67890 EXAMPLE2 0.87 reviewed

CSV Format .. code-block:: csv

uniprot_id,gene_name,confidence,category P12345,EXAMPLE1,0.95,reviewed Q67890,EXAMPLE2,0.87,reviewed

JSON Format .. code-block:: json

[
{

“uniprot_id”: “P12345”, “gene_name”: “EXAMPLE1”, “confidence”: 0.95, “category”: “reviewed”

}, {

“uniprot_id”: “Q67890”, “gene_name”: “EXAMPLE2”, “confidence”: 0.87, “category”: “reviewed”

}

]

Context Integration

The action updates the execution context with output file information:

# Context after execution
{
    "output_files": {
        "final_proteins": "/results/protein_matches.tsv"
    }
}

This enables downstream actions to reference exported files.

Path Handling

Absolute Paths

Use full file system paths: /home/user/data/results.csv

Relative Paths

Relative to current working directory: ./output/data.tsv

Directory Creation

Parent directories are created automatically if they don’t exist.

Path Variables

Support for environment variables and strategy parameters:

  • ${OUTPUT_DIR}/results.csv

  • ${parameters.output_path}

  • ${metadata.timestamp}

Error Handling

Dataset not found
Error: Dataset 'missing_data' not found in context

Solution: Verify the input_key exists in context[‘datasets’].

Unsupported format
Error: Unsupported format: xml

Solution: Use supported formats: tsv, csv, json, xlsx.

Permission denied
Error: Export failed: Permission denied

Solution: Check write permissions for output directory.

Invalid columns
Error: Column 'missing_col' not found in dataset

Solution: Verify column names exist in the dataset.

Best Practices

  1. Use descriptive filenames including dataset type and timestamp

  2. Choose appropriate formats for intended use:

    • TSV/CSV for data processing

    • JSON for web applications

    • Excel for manual analysis

  3. Specify column subsets to reduce file size and focus on key data

  4. Use absolute paths in production environments

  5. Include metadata in filenames (date, version, parameters)

  6. Plan directory structure for organized output management

Performance Notes

  • Export speed depends on dataset size and format complexity

  • TSV exports are fastest for large datasets

  • Excel exports may be slower due to formatting overhead

  • JSON exports with many columns can be memory-intensive

  • Column filtering reduces export time and file size

File Size Considerations

Large Datasets (>100K rows)
  • Prefer TSV format for efficiency

  • Consider column filtering to reduce size

  • Use compression if supported by downstream tools

Memory Usage
  • Scales with dataset size

  • JSON format uses more memory during export

  • Excel format may require significant memory for large datasets

Integration Patterns

End-of-Pipeline Export .. code-block:: yaml

steps:

# … processing steps …

  • name: export_final_results action:

    type: EXPORT_DATASET params:

    input_key: “processed_data” output_path: “/results/final_analysis.tsv”

Multi-Format Export .. code-block:: yaml

steps:

# … processing steps …

  • name: export_for_analysis action:

    type: EXPORT_DATASET params:

    input_key: “results” output_path: “/output/analysis.xlsx” format: “xlsx”

  • name: export_for_api action:

    type: EXPORT_DATASET params:

    input_key: “results” output_path: “/api/data.json” format: “json” columns: [“id”, “name”, “value”]

Conditional Export .. code-block:: yaml

steps:

# … processing steps …

  • name: export_if_successful action:

    type: EXPORT_DATASET params:

    input_key: “validated_results” output_path: “/output/success_${date}.tsv” format: “tsv”

## Verification Sources Last verified: 2025-08-22

This documentation was verified against the following project resources:

  • /biomapper/src/actions/export_dataset.py (actual implementation with pandas export and UniversalContext integration)

  • /biomapper/src/actions/typed_base.py (TypedStrategyAction base class)

  • /biomapper/src/actions/registry.py (self-registration via @register_action decorator)

  • /biomapper/src/core/standards/context_handler.py (UniversalContext for unified context access)

  • /biomapper/src/core/standards/base_models.py (ActionParamsBase inheritance)

  • /biomapper/CLAUDE.md (2025 standardizations and parameter naming)