Sync to Google Drive

Overview

The Google Drive sync actions (SYNC_TO_GOOGLE_DRIVE_V2 and SYNC_TO_GOOGLE_DRIVE_V3) enable uploading biomapper results directly to Google Drive for sharing and archival. Both actions support OAuth2 and Service Account authentication methods.

V3 is recommended for new implementations as it includes enhanced error handling, chunked upload support, and better progress tracking.

Key Features

Multiple Authentication: OAuth2 (personal) and Service Account (automated)
Chunked Upload: Efficient handling of large files
Progress Tracking: Real-time upload progress
Automatic Organization: Creates organized folder structures
Error Recovery: Robust handling of network issues and timeouts

Authentication Methods

OAuth2 (Recommended for Personal Use)

Benefits: - Upload to your personal Drive storage - No quota limitations (uses your personal storage) - Full ownership and control of uploaded files - Easy sharing using Drive’s native sharing features

Setup: See Google Drive Integration for complete OAuth2 setup guide.

Service Account (For Automation)

Benefits: - Programmatic access without user interaction - Suitable for automated pipelines - Can be shared across team members

Limitations: - Cannot store files (no storage quota) - Must upload to folders shared with the service account - More complex permission management

Parameters

SYNC_TO_GOOGLE_DRIVE_V3

Parameter	Type	Required	Description
`input_key`	string	Yes	Key for the dataset to upload
`output_key`	string	Yes	Key for the output with upload metadata
`file_path`	string	Yes	Local file path to upload
`drive_folder_name`	string	Yes	Google Drive folder name for organization
`strategy_name`	string	No	Strategy name for folder organization
`auth_method`	string	No	‘oauth2’ or ‘service_account’ (default: auto-detect)
`chunk_size`	integer	No	Upload chunk size in bytes (default: 10MB)

SYNC_TO_GOOGLE_DRIVE_V2

Legacy version with similar parameters but without chunked upload support.

Example Usage

YAML Strategy (V3)

steps:
  - name: upload_results
    action:
      type: SYNC_TO_GOOGLE_DRIVE_V3
      params:
        input_key: final_results
        output_key: upload_status
        file_path: "/tmp/biomapper_results.csv"
        drive_folder_name: "Biomapper Results"
        strategy_name: "metabolomics_pipeline_v3"
        auth_method: "oauth2"
        chunk_size: 5242880  # 5MB chunks

Python Client

from src.client.client_v2 import BiomapperClient

client = BiomapperClient(base_url="http://localhost:8000")

result = await client.run_action(
    action_type="SYNC_TO_GOOGLE_DRIVE_V3",
    params={
        "input_key": "results_dataset",
        "output_key": "drive_upload",
        "file_path": "/results/metabolite_mapping.csv",
        "drive_folder_name": "Metabolomics Analysis",
        "strategy_name": "arivale_harmonization",
        "auth_method": "oauth2"
    },
    context=context
)

# Access upload metadata
upload_info = result.context["datasets"]["drive_upload"]
print(f"Uploaded to: {upload_info['drive_url']}")

Folder Organization

The actions create organized folder structures on Google Drive:

Biomapper Results/
├── metabolomics_pipeline_v3/
│   ├── 2025-08-22_14-30-15/
│   │   ├── mapping_results.csv
│   │   ├── statistics.json
│   │   └── visualizations.png
│   └── latest/  # Symlink to most recent
└── protein_harmonization/
    ├── 2025-08-22_15-45-30/
    └── latest/

Output Format

The action returns metadata about the upload:

Field	Description
`file_id`	Google Drive file ID
`drive_url`	Direct link to view the file
`folder_id`	Parent folder ID
`upload_size`	File size in bytes
`upload_duration`	Time taken for upload
`chunks_uploaded`	Number of chunks (V3 only)

Performance Considerations

Upload Speed

Typical performance metrics: - Small files (<10MB): 2-5 seconds - Medium files (10-100MB): 10-30 seconds - Large files (>100MB): 1-3 minutes

Factors affecting speed: - Network bandwidth - File size and type - Chunk size configuration - Google Drive API limits

Optimization Tips

Adjust chunk size: Larger chunks for faster networks
Use compression: Pre-compress large CSV files
Batch uploads: Group related files together
Monitor quotas: Be aware of daily API limits

Authentication Setup

OAuth2 Setup

Create OAuth2 credentials in Google Cloud Console
Download client configuration JSON file

Run setup script:

poetry run python scripts/setup_oauth2_drive.py

Follow browser authorization flow

Test upload:

python scripts/verify_google_drive_setup.py

Service Account Setup

Create service account in Google Cloud Console
Download credentials JSON file

Set environment variable:

export GOOGLE_SERVICE_ACCOUNT_PATH="/path/to/credentials.json"

Create shared folder and share with service account email

Error Handling

Common Issues

Error	Solution
`Authentication failed`	Re-run OAuth2 setup or check service account credentials
`Quota exceeded`	Wait for quota reset or reduce upload frequency
`Folder not found`	Ensure folder exists and is shared with service account
`Network timeout`	Retry upload or reduce chunk size
`Permission denied`	Verify folder sharing permissions

Retry Logic

V3 includes automatic retry logic: - Exponential backoff for rate limiting - Automatic retry on network errors - Chunked resume for interrupted uploads - Progress preservation across retries

Best Practices

File Naming: Use timestamp and strategy name for uniqueness
Folder Organization: Create logical folder hierarchies
Authentication: Prefer OAuth2 for personal use, Service Account for automation
Error Handling: Always check upload status in workflows
Cleanup: Regularly clean up old result files

Security Considerations

Credentials Protection: Never commit OAuth2 tokens or service account keys
Access Control: Limit folder sharing to necessary users only
Data Sensitivity: Be aware of data privacy when uploading to cloud storage
Audit Trail: Monitor upload logs for security compliance

Integration Examples

Complete Pipeline with Upload

name: "metabolomics_with_drive_sync"
description: "Complete metabolomics pipeline with Google Drive upload"

steps:
  - name: load_data
    action:
      type: LOAD_DATASET_IDENTIFIERS
      # ... data loading parameters

  - name: progressive_matching
    action:
      type: PROGRESSIVE_SEMANTIC_MATCH
      # ... matching parameters

  - name: export_results
    action:
      type: EXPORT_DATASET
      params:
        input_key: final_results
        file_path: "/tmp/metabolomics_results.csv"

  - name: sync_to_drive
    action:
      type: SYNC_TO_GOOGLE_DRIVE_V3
      params:
        input_key: final_results
        output_key: drive_metadata
        file_path: "/tmp/metabolomics_results.csv"
        drive_folder_name: "Metabolomics Analysis"
        strategy_name: "progressive_pipeline_v3"