This tutorial explains how to run a
dsub
pipeline
on Batch.
Specifically, the example dsub
pipeline processes DNA sequencing data in a
Binary Alignment Map (BAM) file
to create a BAM index (BAI) file.
This tutorial is intended for Batch users who want to use
dsub
with Batch.
dsub
is an open source job scheduler
for orchestrating batch-processing workflows on Google Cloud.
To learn more about how to use
Batch with dsub
, see the
dsub
documentation for Batch.
Create a Cloud Storage bucket
To create a Cloud Storage bucket for storing the output files from the
sample dsub
pipeline using the gcloud CLI, run the
gcloud storage buckets create
command:
gcloud storage buckets create gs://BUCKET_NAME \
--project PROJECT_ID
Replace the following:
BUCKET_NAME
: a globally unique name for your bucket.PROJECT_ID
: the project ID of your Google Cloud project.
The output is similar to the following:
Creating gs://BUCKET_NAME/...
Run the dsub
pipeline
The sample dsub
pipeline indexes a BAM file from the
1,000 Genomes Project
and outputs the results to a Cloud Storage bucket.
To run the sample dsub
pipeline, run the following dsub
command:
dsub \
--provider google-batch \
--project PROJECT_ID \
--logging gs://BUCKET_NAME/WORK_DIRECTORY/logs \
--input BAM=gs://genomics-public-data/1000-genomes/bam/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam \
--output BAI=gs://BUCKET_NAME/WORK_DIRECTORY/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai \
--image quay.io/cancercollaboratory/dockstore-tool-samtools-index \
--command 'samtools index ${BAM} ${BAI}' \
--wait
Replace the following:
PROJECT_ID
: the project ID of your Google Cloud project.BUCKET_NAME
: the name of the Cloud Storage bucket that you created.WORK_DIRECTORY
: the name for a new directory that the pipeline can use to store logs and outputs. For example, enterworkDir
.
The dsub
pipeline runs a
Batch job that writes the BAI file
and logs to specified directory in your Cloud Storage bucket.
Specifically, the dsub
repository contains a prebuilt Docker
image that uses samtools
to index the
BAM file that you specified in the --input
flag.
The command doesn't finish until the dsub
pipeline finishes running,
which might vary based on when the Batch job is scheduled.
Usually, this takes about 10 minutes: Batch usually starts
running the job within a few minutes, and the job's runtime is about 8 minutes.
At first, the command is still running and the output is similar to the following:
Job properties:
job-id: JOB_NAME
job-name: samtools
user-id: USERNAME
Provider internal-id (operation): projects/PROJECT_ID/locations/us-central1/jobs/JOB_NAME
Launched job-id: JOB_NAME
To check the status, run:
dstat --provider google-batch --project PROJECT_ID --location us-central1 --jobs 'JOB_NAME' --users 'USERNAME' --status '*'
To cancel the job, run:
ddel --provider google-batch --project PROJECT_ID --location us-central1 --jobs 'JOB_NAME' --users 'USERNAME'
Waiting for job to complete...
Waiting for: JOB_NAME.
Then, after the job has successfully finished, the command ends and the output is similar to the following:
JOB_NAME: SUCCESS
JOB_NAME
This output includes the following values:
JOB_NAME
: the name of the job.USERNAME
: your Google Cloud username.PROJECT_ID
: the project ID of your Google Cloud project.
View the output files
To view the output files created by the sample dsub
pipeline using the
gcloud CLI, run the
gcloud storage ls
command:
gcloud storage ls gs://BUCKET_NAME/WORK_DIRECTORY \
--project PROJECT_ID
Replace the following:
BUCKET_NAME
: the name of the Cloud Storage bucket that you created.WORK_DIRECTORY
: the directory that you specified in thedsub
command.PROJECT_ID
: the project ID of your Google Cloud project.
The output is similar to the following:
gs://BUCKET_NAME/WORK_DIRECTORY/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
gs://BUCKET_NAME/WORK_DIRECTORY/logs/
This output includes the BAI file and a directory containing the job's logs.