Training Resnet50 on Cloud TPU with PyTorch

This tutorial shows you how to train the ResNet-50 model on a Cloud TPU device with PyTorch. You can apply the same pattern to other TPU-optimised image classification models that use PyTorch and the ImageNet dataset.

The model in this tutorial is based on Deep Residual Learning for Image Recognition, which first introduces the residual network (ResNet) architecture. The tutorial uses the 50-layer variant, ResNet-50, and demonstrates training the model using PyTorch/XLA.

Create a TPU VM

Open a Cloud Shell window.

Open Cloud Shell
Create a TPU VM
```
gcloud compute tpus tpu-vm create your-tpu-name \
--accelerator-type=v3-8 \
--version=tpu-ubuntu2204-base \
--zone=us-central1-a \
--project=your-project
```
Note: The first time you run a command in a new Cloud Shell VM, an Authorize Cloud Shell page is displayed. Click Authorize at the bottom of the page to allow gcloud to make Google Cloud API calls with your credentials.

Connect to your TPU VM using SSH:

gcloud compute tpus tpu-vm ssh  your-tpu-name --zone=us-central1-a

Install PyTorch/XLA on your TPU VM:

(vm)$ pip install torch torch_xla[tpu] torchvision -f https://storage.googleapis.com/libtpu-releases/index.html -f https://storage.googleapis.com/libtpu-wheels/index.html

Clone the PyTorch/XLA GitHub repo

(vm)$ git clone --depth=1 https://github.com/pytorch/xla.git

Run the training script with fake data

(vm) $ PJRT_DEVICE=TPU python3 xla/test/test_train_mp_imagenet.py --fake_data --batch_size=256 --num_epochs=1