LOAD_DATASET_IDENTIFIERS
Load identifiers from CSV/TSV files with flexible column mapping and data validation.
Purpose
This action loads biological identifiers from tabular data files, providing:
Flexible column mapping for different file formats
Data validation and cleaning
Metadata preservation for downstream analysis
Support for various biological entity types
Parameters
Required Parameters
- file_path (string)
Path to the data file (CSV or TSV format). Supports absolute and relative paths.
- identifier_column (string)
Name of the column containing the biological identifiers.
- output_key (string)
Key name to store the loaded data in the execution context.
Optional Parameters
- file_type (string)
File format specification: “csv”, “tsv”, or “auto” for automatic detection. Default: “auto”
- strip_prefix (string)
Prefix to remove from identifiers (e.g., “UniProtKB:”). Default: None
- filter_column (string)
Column name to apply filtering on. Default: None
- filter_values (list of strings)
Values or regex patterns to match for filtering. Default: None
- filter_mode (string)
Filter mode: “include” to keep matches, “exclude” to remove matches. Default: “include”
- drop_empty_ids (boolean)
Whether to remove rows with empty identifier values. Default: true
Example Usage
Basic Protein Loading
- name: load_proteins
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "/data/proteins/ukbb_proteins.csv"
identifier_column: "UniProt"
output_key: "ukbb_proteins"
Input File Format
The action expects CSV or TSV files with headers:
protein_name,UniProt,panel,description
AARSD1,Q9BTE6,Oncology,Alanyl-tRNA synthetase domain
ABL1,P00519,Oncology,Tyrosine-protein kinase ABL1
ACE,P12821,Cardiology,Angiotensin-converting enzyme
Output Format
The action stores data in the context under the specified output_key:
# Context after execution
{
"datasets": {
"ukbb_proteins": [
{
"protein_name": "AARSD1",
"UniProt": "Q9BTE6",
"panel": "Oncology",
"description": "Alanyl-tRNA synthetase domain"
},
# ... more rows
]
},
"metadata": {
"ukbb_proteins": {
"source_file": "/data/proteins/ukbb_proteins.csv",
"row_count": 1463,
"identifier_column": "UniProt",
"columns": ["protein_name", "UniProt", "panel", "description", "_row_number", "_source_file"],
"filtered": false,
"prefix_stripped": false
}
}
}
Supported File Types
- CSV Files (.csv)
Comma-separated values with headers
- TSV Files (.tsv, .txt)
Tab-separated values with headers
- Auto-Detection
File format is auto-detected based on extension (.tsv files use tab delimiter, others use comma)
Data Validation
The action performs several validation steps:
File existence: Verifies the file exists and is readable
Header validation: Ensures specified columns exist
Empty value handling: Optionally removes rows with empty identifier values
Robust file loading: Uses BiologicalFileLoader for enhanced parsing
Filter validation: Validates filter columns exist before applying filters
Error Handling
Common errors and solutions:
- File not found
Error: File not found: /data/proteins.csv
Solution: Use absolute paths and verify file exists.
- Column not found
Error: Column 'uniprot' not found. Available: ['UniProt', 'protein_name']
Solution: Check column name matches exactly (case-sensitive).
- Empty dataset
Warning: No valid identifiers found in dataset
Solution: Verify identifier column contains data.
Best Practices
Use absolute file paths to avoid path resolution issues
Match column names exactly (case-sensitive)
Clean data beforehand to remove empty rows
Use descriptive output keys like “ukbb_proteins” instead of “data1”
Add dataset names for better logging and debugging
Advanced Features
- Prefix Stripping
Remove common prefixes while preserving original values:
params: strip_prefix: "UniProtKB:" # Transforms "UniProtKB:P12345" to "P12345" # Original saved as "UniProt_original" column
- Regex Filtering
Filter rows based on pattern matching:
params: filter_column: "panel" filter_values: ["Oncology", "Cardiology"] filter_mode: "include"
- Metadata Tracking
Each row gets tracking columns:
_row_number: Original file row number (1-based, accounting for header)_source_file: Absolute path to source file[identifier_column]_original: Original value if prefix stripping is applied
Performance Notes
Uses pandas for reliable file parsing with automatic format detection
Handles various encodings and delimiters based on file extension
Memory efficient for large files (tested with 100K+ rows)
TSV files parse faster than CSV due to simpler delimiter structure
Adds metadata columns for row tracking and provenance
Integration
This action is typically used as the first step in mapping strategies:
steps:
# 1. Load source data
- name: load_source
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "/data/source.csv"
identifier_column: "id"
output_key: "source_data"
# 2. Load target data
- name: load_target
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "/data/target.csv"
identifier_column: "uniprot"
output_key: "target_data"
# 3. Process the loaded data
- name: merge_data
action:
type: MERGE_DATASETS
params:
input_key: "source_data"
secondary_key: "target_data"
output_key: "merged_result"
merge_strategy: "union"
—
## Verification Sources Last verified: 2025-08-22
This documentation was verified against the following project resources:
/biomapper/src/actions/load_dataset_identifiers.py (actual implementation using pandas with dual context support)
/biomapper/src/actions/typed_base.py (TypedStrategyAction base class and StandardActionResult)
/biomapper/src/actions/registry.py (self-registration mechanism via @register_action decorator)
/biomapper/CLAUDE.md (2025 standardizations and parameter naming conventions)
/biomapper/pyproject.toml (project dependencies including pandas for file loading)