FILTER_DATASET
Filter datasets by column values using flexible conditions and logical operators.
Purpose
This action provides powerful dataset filtering capabilities with support for:
Multiple filter conditions with AND/OR logic
Rich set of comparison operators
String matching with case sensitivity options
Regex pattern matching
Null value handling
Keep or remove matching rows
Parameters
Required Parameters
- input_key (string)
Key of the dataset to filter from context[‘datasets’].
- filter_conditions (list of objects)
List of filter conditions to apply. Each condition contains:
column (string): Column name to filter on
operator (string): Filter operator (see operators below)
value (any): Value to compare against (not needed for null checks)
case_sensitive (boolean): Case sensitivity for string operations (default: true)
- output_key (string)
Key where filtered dataset will be stored in context.
Optional Parameters
- logic_operator (string)
How to combine multiple conditions: ‘AND’ or ‘OR’. Default: ‘AND’
- keep_or_remove (string)
Whether to ‘keep’ matching rows or ‘remove’ matching rows. Default: ‘keep’
- add_filter_log (boolean)
Whether to add detailed metadata about filtering. Default: true
Supported Operators
- Comparison Operators
equals- Exact equality matchnot_equals- Not equal to valuegreater_than- Greater than numeric valueless_than- Less than numeric valuegreater_equal- Greater than or equal toless_equal- Less than or equal to
- String Operators
contains- String contains substringnot_contains- String does not contain substringregex- Matches regular expression pattern
- List Operators
in_list- Value is in provided listnot_in_list- Value is not in provided list
- Null Operators
is_null- Column value is null/NaNnot_null- Column value is not null/NaN
Example Usage
Basic Filtering
- name: filter_high_confidence
action:
type: FILTER_DATASET
params:
input_key: "protein_matches"
filter_conditions:
- column: "confidence"
operator: "greater_equal"
value: 0.8
output_key: "high_conf_proteins"
Multiple Conditions with AND Logic
- name: filter_quality_proteins
action:
type: FILTER_DATASET
params:
input_key: "all_proteins"
filter_conditions:
- column: "confidence"
operator: "greater_than"
value: 0.7
- column: "category"
operator: "equals"
value: "reviewed"
- column: "uniprot_id"
operator: "not_null"
logic_operator: "AND"
output_key: "quality_proteins"
String Matching with Case Insensitivity
- name: filter_metabolite_names
action:
type: FILTER_DATASET
params:
input_key: "metabolites"
filter_conditions:
- column: "compound_name"
operator: "contains"
value: "glucose"
case_sensitive: false
output_key: "glucose_related"
List-Based Filtering
- name: filter_target_proteins
action:
type: FILTER_DATASET
params:
input_key: "protein_data"
filter_conditions:
- column: "uniprot_id"
operator: "in_list"
value: ["P12345", "Q67890", "O11111"]
output_key: "target_proteins"
Regex Pattern Matching
- name: filter_uniprot_format
action:
type: FILTER_DATASET
params:
input_key: "identifiers"
filter_conditions:
- column: "protein_id"
operator: "regex"
value: "^[A-Z][0-9][A-Z0-9]{3}[0-9]$"
output_key: "valid_uniprot_ids"
Removing Unwanted Data
- name: remove_low_quality
action:
type: FILTER_DATASET
params:
input_key: "raw_data"
filter_conditions:
- column: "quality_score"
operator: "less_than"
value: 0.3
keep_or_remove: "remove"
output_key: "filtered_data"
Complex OR Logic
- name: filter_multiple_categories
action:
type: FILTER_DATASET
params:
input_key: "compounds"
filter_conditions:
- column: "category"
operator: "equals"
value: "amino_acids"
- column: "category"
operator: "equals"
value: "lipids"
- column: "hmdb_id"
operator: "not_null"
logic_operator: "OR"
output_key: "target_compounds"
Output Format
The action stores the filtered dataset in the context:
# Context after execution
{
"datasets": {
"filtered_proteins": [
{
"uniprot_id": "P12345",
"confidence": 0.95,
"category": "reviewed"
},
# ... only rows matching filter conditions
]
}
}
Detailed Statistics
When add_filter_log is true, detailed statistics are included:
{
"total_input_rows": 1000,
"total_output_rows": 234,
"filter_conditions_count": 2,
"logic_operator": "AND",
"keep_or_remove": "keep",
"input_key": "raw_data",
"output_key": "filtered_data"
}
Error Handling
- Column not found
Error: Column 'missing_col' not found in dataset
Solution: Verify column names match exactly (case-sensitive).
- Invalid regex pattern
Error: Invalid regex pattern 'unterminated[': bad character
Solution: Use valid regex syntax and test patterns.
- Type mismatch
Error: Cannot compare string with numeric value
Solution: Ensure operator and value types are compatible.
Best Practices
Test regex patterns before using in production filters
Use appropriate operators for data types (numeric vs string)
Consider case sensitivity for string operations
Validate column names exist in dataset before filtering
Use descriptive output keys to track filtering steps
Combine conditions logically - AND for restrictive, OR for inclusive
Handle null values explicitly when data quality varies
Performance Notes
Filtering is performed using pandas operations for efficiency
Large datasets (>100K rows) filter quickly
Regex operations may be slower than simple comparisons
Multiple conditions are optimized with vectorized operations
Memory usage scales with output dataset size
Common Use Cases
- Quality Control
Remove low-confidence matches or invalid identifiers
- Data Subset Selection
Extract specific categories or value ranges for analysis
- Validation Filtering
Keep only records meeting specific format requirements
- Experimental Design
Select target compounds or proteins for focused studies
- Outlier Removal
Filter extreme values or anomalous data points
Integration
This action typically follows data loading and precedes analysis:
steps:
# 1. Load raw data
- name: load_data
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "/data/proteins.csv"
identifier_column: "uniprot_id"
output_key: "raw_proteins"
# 2. Filter for quality
- name: quality_filter
action:
type: FILTER_DATASET
params:
input_key: "raw_proteins"
filter_conditions:
- column: "confidence"
operator: "greater_equal"
value: 0.8
- column: "uniprot_id"
operator: "not_null"
output_key: "quality_proteins"
# 3. Continue with analysis
- name: analyze_quality_data
action:
type: CALCULATE_SET_OVERLAP
params:
dataset_key: "quality_proteins"