Orchestrate jobs by running dsub pipelines on Batch

This tutorial explains how to run a dsub pipeline on Batch. Specifically, the example dsub pipeline processes DNA sequencing data in a Binary Alignment Map (BAM) file to create a BAM index (BAI) file.

This tutorial is intended for Batch users who want to use dsub with Batch. dsub is an open source job scheduler for orchestrating batch-processing workflows on Google Cloud. To learn more about how to use Batch with dsub, see the dsub documentation for Batch.

Create a Cloud Storage bucket

To create a Cloud Storage bucket for storing the output files from the sample dsub pipeline using the gcloud CLI, run the gcloud storage buckets create command:

gcloud storage buckets create gs://BUCKET_NAME \
    --project PROJECT_ID

Replace the following:

The output is similar to the following:

Creating gs://BUCKET_NAME/...

Run the dsub pipeline

The sample dsub pipeline indexes a BAM file from the 1,000 Genomes Project and outputs the results to a Cloud Storage bucket.

To run the sample dsub pipeline, run the following dsub command:

dsub \
    --provider google-batch \
    --project PROJECT_ID \
    --logging gs://BUCKET_NAME/WORK_DIRECTORY/logs \
    --input BAM=gs://genomics-public-data/1000-genomes/bam/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam \
    --output BAI=gs://BUCKET_NAME/WORK_DIRECTORY/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai \
    --image quay.io/cancercollaboratory/dockstore-tool-samtools-index \
    --command 'samtools index ${BAM} ${BAI}' \
    --wait

Replace the following:

  • PROJECT_ID: the project ID of your Google Cloud project.

  • BUCKET_NAME: the name of the Cloud Storage bucket that you created.

  • WORK_DIRECTORY: the name for a new directory that the pipeline can use to store logs and outputs. For example, enter workDir.

The dsub pipeline runs a Batch job that writes the BAI file and logs to specified directory in your Cloud Storage bucket. Specifically, the dsub repository contains a prebuilt Docker image that uses samtools to index the BAM file that you specified in the --input flag.

The command doesn't finish until the dsub pipeline finishes running, which might vary based on when the Batch job is scheduled. Usually, this takes about 10 minutes: Batch usually starts running the job within a few minutes, and the job's runtime is about 8 minutes.

At first, the command is still running and the output is similar to the following:

Job properties:
  job-id: JOB_NAME
  job-name: samtools
  user-id: USERNAME
Provider internal-id (operation): projects/PROJECT_ID/locations/us-central1/jobs/JOB_NAME
Launched job-id: JOB_NAME
To check the status, run:
  dstat --provider google-batch --project PROJECT_ID --location us-central1 --jobs 'JOB_NAME' --users 'USERNAME' --status '*'
To cancel the job, run:
  ddel --provider google-batch --project PROJECT_ID --location us-central1 --jobs 'JOB_NAME' --users 'USERNAME'
Waiting for job to complete...
Waiting for: JOB_NAME.

Then, after the job has successfully finished, the command ends and the output is similar to the following:

  JOB_NAME: SUCCESS
JOB_NAME

This output includes the following values:

  • JOB_NAME: the name of the job.

  • USERNAME: your Google Cloud username.

  • PROJECT_ID: the project ID of your Google Cloud project.

View the output files

To view the output files created by the sample dsub pipeline using the gcloud CLI, run the gcloud storage ls command:

gcloud storage ls gs://BUCKET_NAME/WORK_DIRECTORY \
    --project PROJECT_ID

Replace the following:

  • BUCKET_NAME: the name of the Cloud Storage bucket that you created.

  • WORK_DIRECTORY: the directory that you specified in the dsub command.

  • PROJECT_ID: the project ID of your Google Cloud project.

The output is similar to the following:

gs://BUCKET_NAME/WORK_DIRECTORY/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
gs://BUCKET_NAME/WORK_DIRECTORY/logs/

This output includes the BAI file and a directory containing the job's logs.