Sync to Google Drive

Overview

The Google Drive sync actions (SYNC_TO_GOOGLE_DRIVE_V2 and SYNC_TO_GOOGLE_DRIVE_V3) enable uploading biomapper results directly to Google Drive for sharing and archival. Both actions support OAuth2 and Service Account authentication methods.

V3 is recommended for new implementations as it includes enhanced error handling, chunked upload support, and better progress tracking.

Key Features

  • Multiple Authentication: OAuth2 (personal) and Service Account (automated)

  • Chunked Upload: Efficient handling of large files

  • Progress Tracking: Real-time upload progress

  • Automatic Organization: Creates organized folder structures

  • Error Recovery: Robust handling of network issues and timeouts

Authentication Methods

Service Account (For Automation)

Benefits: - Programmatic access without user interaction - Suitable for automated pipelines - Can be shared across team members

Limitations: - Cannot store files (no storage quota) - Must upload to folders shared with the service account - More complex permission management

Parameters

SYNC_TO_GOOGLE_DRIVE_V3

Parameter

Type

Required

Description

input_key

string

Yes

Key for the dataset to upload

output_key

string

Yes

Key for the output with upload metadata

file_path

string

Yes

Local file path to upload

drive_folder_name

string

Yes

Google Drive folder name for organization

strategy_name

string

No

Strategy name for folder organization

auth_method

string

No

‘oauth2’ or ‘service_account’ (default: auto-detect)

chunk_size

integer

No

Upload chunk size in bytes (default: 10MB)

SYNC_TO_GOOGLE_DRIVE_V2

Legacy version with similar parameters but without chunked upload support.

Example Usage

YAML Strategy (V3)

steps:
  - name: upload_results
    action:
      type: SYNC_TO_GOOGLE_DRIVE_V3
      params:
        input_key: final_results
        output_key: upload_status
        file_path: "/tmp/biomapper_results.csv"
        drive_folder_name: "Biomapper Results"
        strategy_name: "metabolomics_pipeline_v3"
        auth_method: "oauth2"
        chunk_size: 5242880  # 5MB chunks

Python Client

from src.client.client_v2 import BiomapperClient

client = BiomapperClient(base_url="http://localhost:8000")

result = await client.run_action(
    action_type="SYNC_TO_GOOGLE_DRIVE_V3",
    params={
        "input_key": "results_dataset",
        "output_key": "drive_upload",
        "file_path": "/results/metabolite_mapping.csv",
        "drive_folder_name": "Metabolomics Analysis",
        "strategy_name": "arivale_harmonization",
        "auth_method": "oauth2"
    },
    context=context
)

# Access upload metadata
upload_info = result.context["datasets"]["drive_upload"]
print(f"Uploaded to: {upload_info['drive_url']}")

Folder Organization

The actions create organized folder structures on Google Drive:

Biomapper Results/
├── metabolomics_pipeline_v3/
│   ├── 2025-08-22_14-30-15/
│   │   ├── mapping_results.csv
│   │   ├── statistics.json
│   │   └── visualizations.png
│   └── latest/  # Symlink to most recent
└── protein_harmonization/
    ├── 2025-08-22_15-45-30/
    └── latest/

Output Format

The action returns metadata about the upload:

Field

Description

file_id

Google Drive file ID

drive_url

Direct link to view the file

folder_id

Parent folder ID

upload_size

File size in bytes

upload_duration

Time taken for upload

chunks_uploaded

Number of chunks (V3 only)

Performance Considerations

Upload Speed

Typical performance metrics: - Small files (<10MB): 2-5 seconds - Medium files (10-100MB): 10-30 seconds - Large files (>100MB): 1-3 minutes

Factors affecting speed: - Network bandwidth - File size and type - Chunk size configuration - Google Drive API limits

Optimization Tips

  1. Adjust chunk size: Larger chunks for faster networks

  2. Use compression: Pre-compress large CSV files

  3. Batch uploads: Group related files together

  4. Monitor quotas: Be aware of daily API limits

Authentication Setup

OAuth2 Setup

  1. Create OAuth2 credentials in Google Cloud Console

  2. Download client configuration JSON file

  3. Run setup script:

    poetry run python scripts/setup_oauth2_drive.py
    
  4. Follow browser authorization flow

  5. Test upload:

    python scripts/verify_google_drive_setup.py
    

Service Account Setup

  1. Create service account in Google Cloud Console

  2. Download credentials JSON file

  3. Set environment variable:

    export GOOGLE_SERVICE_ACCOUNT_PATH="/path/to/credentials.json"
    
  4. Create shared folder and share with service account email

Error Handling

Common Issues

Error

Solution

Authentication failed

Re-run OAuth2 setup or check service account credentials

Quota exceeded

Wait for quota reset or reduce upload frequency

Folder not found

Ensure folder exists and is shared with service account

Network timeout

Retry upload or reduce chunk size

Permission denied

Verify folder sharing permissions

Retry Logic

V3 includes automatic retry logic: - Exponential backoff for rate limiting - Automatic retry on network errors - Chunked resume for interrupted uploads - Progress preservation across retries

Best Practices

  1. File Naming: Use timestamp and strategy name for uniqueness

  2. Folder Organization: Create logical folder hierarchies

  3. Authentication: Prefer OAuth2 for personal use, Service Account for automation

  4. Error Handling: Always check upload status in workflows

  5. Cleanup: Regularly clean up old result files

Security Considerations

  • Credentials Protection: Never commit OAuth2 tokens or service account keys

  • Access Control: Limit folder sharing to necessary users only

  • Data Sensitivity: Be aware of data privacy when uploading to cloud storage

  • Audit Trail: Monitor upload logs for security compliance

Integration Examples

Complete Pipeline with Upload

name: "metabolomics_with_drive_sync"
description: "Complete metabolomics pipeline with Google Drive upload"

steps:
  - name: load_data
    action:
      type: LOAD_DATASET_IDENTIFIERS
      # ... data loading parameters

  - name: progressive_matching
    action:
      type: PROGRESSIVE_SEMANTIC_MATCH
      # ... matching parameters

  - name: export_results
    action:
      type: EXPORT_DATASET
      params:
        input_key: final_results
        file_path: "/tmp/metabolomics_results.csv"

  - name: sync_to_drive
    action:
      type: SYNC_TO_GOOGLE_DRIVE_V3
      params:
        input_key: final_results
        output_key: drive_metadata
        file_path: "/tmp/metabolomics_results.csv"
        drive_folder_name: "Metabolomics Analysis"
        strategy_name: "progressive_pipeline_v3"

See Also

## Verification Sources Last verified: 2025-08-22

This documentation was verified against the following project resources:

  • /biomapper/src/actions/io/sync_to_google_drive_v2.py (actual implementation of SYNC_TO_GOOGLE_DRIVE_V2)

  • /biomapper/src/actions/io/sync_to_google_drive_v3.py (enhanced V3 implementation with improved error handling)

  • /biomapper/src/actions/typed_base.py (TypedStrategyAction base class)

  • /biomapper/src/actions/registry.py (self-registration via @register_action decorator)

  • /biomapper/CLAUDE.md (2025 standardizations and integration patterns)

  • /biomapper/pyproject.toml (Google API client dependencies)