FILTER_DATASET

Filter datasets by column values using flexible conditions and logical operators.

Purpose

This action provides powerful dataset filtering capabilities with support for:

  • Multiple filter conditions with AND/OR logic

  • Rich set of comparison operators

  • String matching with case sensitivity options

  • Regex pattern matching

  • Null value handling

  • Keep or remove matching rows

Parameters

Required Parameters

input_key (string)

Key of the dataset to filter from context[‘datasets’].

filter_conditions (list of objects)

List of filter conditions to apply. Each condition contains:

  • column (string): Column name to filter on

  • operator (string): Filter operator (see operators below)

  • value (any): Value to compare against (not needed for null checks)

  • case_sensitive (boolean): Case sensitivity for string operations (default: true)

output_key (string)

Key where filtered dataset will be stored in context.

Optional Parameters

logic_operator (string)

How to combine multiple conditions: ‘AND’ or ‘OR’. Default: ‘AND’

keep_or_remove (string)

Whether to ‘keep’ matching rows or ‘remove’ matching rows. Default: ‘keep’

add_filter_log (boolean)

Whether to add detailed metadata about filtering. Default: true

Supported Operators

Comparison Operators
  • equals - Exact equality match

  • not_equals - Not equal to value

  • greater_than - Greater than numeric value

  • less_than - Less than numeric value

  • greater_equal - Greater than or equal to

  • less_equal - Less than or equal to

String Operators
  • contains - String contains substring

  • not_contains - String does not contain substring

  • regex - Matches regular expression pattern

List Operators
  • in_list - Value is in provided list

  • not_in_list - Value is not in provided list

Null Operators
  • is_null - Column value is null/NaN

  • not_null - Column value is not null/NaN

Example Usage

Basic Filtering

- name: filter_high_confidence
  action:
    type: FILTER_DATASET
    params:
      input_key: "protein_matches"
      filter_conditions:
        - column: "confidence"
          operator: "greater_equal"
          value: 0.8
      output_key: "high_conf_proteins"

Multiple Conditions with AND Logic

- name: filter_quality_proteins
  action:
    type: FILTER_DATASET
    params:
      input_key: "all_proteins"
      filter_conditions:
        - column: "confidence"
          operator: "greater_than"
          value: 0.7
        - column: "category"
          operator: "equals"
          value: "reviewed"
        - column: "uniprot_id"
          operator: "not_null"
      logic_operator: "AND"
      output_key: "quality_proteins"

String Matching with Case Insensitivity

- name: filter_metabolite_names
  action:
    type: FILTER_DATASET
    params:
      input_key: "metabolites"
      filter_conditions:
        - column: "compound_name"
          operator: "contains"
          value: "glucose"
          case_sensitive: false
      output_key: "glucose_related"

List-Based Filtering

- name: filter_target_proteins
  action:
    type: FILTER_DATASET
    params:
      input_key: "protein_data"
      filter_conditions:
        - column: "uniprot_id"
          operator: "in_list"
          value: ["P12345", "Q67890", "O11111"]
      output_key: "target_proteins"

Regex Pattern Matching

- name: filter_uniprot_format
  action:
    type: FILTER_DATASET
    params:
      input_key: "identifiers"
      filter_conditions:
        - column: "protein_id"
          operator: "regex"
          value: "^[A-Z][0-9][A-Z0-9]{3}[0-9]$"
      output_key: "valid_uniprot_ids"

Removing Unwanted Data

- name: remove_low_quality
  action:
    type: FILTER_DATASET
    params:
      input_key: "raw_data"
      filter_conditions:
        - column: "quality_score"
          operator: "less_than"
          value: 0.3
      keep_or_remove: "remove"
      output_key: "filtered_data"

Complex OR Logic

- name: filter_multiple_categories
  action:
    type: FILTER_DATASET
    params:
      input_key: "compounds"
      filter_conditions:
        - column: "category"
          operator: "equals"
          value: "amino_acids"
        - column: "category"
          operator: "equals"
          value: "lipids"
        - column: "hmdb_id"
          operator: "not_null"
      logic_operator: "OR"
      output_key: "target_compounds"

Output Format

The action stores the filtered dataset in the context:

# Context after execution
{
    "datasets": {
        "filtered_proteins": [
            {
                "uniprot_id": "P12345",
                "confidence": 0.95,
                "category": "reviewed"
            },
            # ... only rows matching filter conditions
        ]
    }
}

Detailed Statistics

When add_filter_log is true, detailed statistics are included:

{
    "total_input_rows": 1000,
    "total_output_rows": 234,
    "filter_conditions_count": 2,
    "logic_operator": "AND",
    "keep_or_remove": "keep",
    "input_key": "raw_data",
    "output_key": "filtered_data"
}

Error Handling

Column not found
Error: Column 'missing_col' not found in dataset

Solution: Verify column names match exactly (case-sensitive).

Invalid regex pattern
Error: Invalid regex pattern 'unterminated[': bad character

Solution: Use valid regex syntax and test patterns.

Type mismatch
Error: Cannot compare string with numeric value

Solution: Ensure operator and value types are compatible.

Best Practices

  1. Test regex patterns before using in production filters

  2. Use appropriate operators for data types (numeric vs string)

  3. Consider case sensitivity for string operations

  4. Validate column names exist in dataset before filtering

  5. Use descriptive output keys to track filtering steps

  6. Combine conditions logically - AND for restrictive, OR for inclusive

  7. Handle null values explicitly when data quality varies

Performance Notes

  • Filtering is performed using pandas operations for efficiency

  • Large datasets (>100K rows) filter quickly

  • Regex operations may be slower than simple comparisons

  • Multiple conditions are optimized with vectorized operations

  • Memory usage scales with output dataset size

Common Use Cases

Quality Control

Remove low-confidence matches or invalid identifiers

Data Subset Selection

Extract specific categories or value ranges for analysis

Validation Filtering

Keep only records meeting specific format requirements

Experimental Design

Select target compounds or proteins for focused studies

Outlier Removal

Filter extreme values or anomalous data points

Integration

This action typically follows data loading and precedes analysis:

steps:
  # 1. Load raw data
  - name: load_data
    action:
      type: LOAD_DATASET_IDENTIFIERS
      params:
        file_path: "/data/proteins.csv"
        identifier_column: "uniprot_id"
        output_key: "raw_proteins"

  # 2. Filter for quality
  - name: quality_filter
    action:
      type: FILTER_DATASET
      params:
        input_key: "raw_proteins"
        filter_conditions:
          - column: "confidence"
            operator: "greater_equal"
            value: 0.8
          - column: "uniprot_id"
            operator: "not_null"
        output_key: "quality_proteins"

  # 3. Continue with analysis
  - name: analyze_quality_data
    action:
      type: CALCULATE_SET_OVERLAP
      params:
        dataset_key: "quality_proteins"