Multi-GPU Training in Pytorch: Data and Model Parallelism

This post will provide an overview of multi-GPU training in Pytorch, including:

training on one GPU;
training on multiple GPUs;
use of data parallelism to accelerate training by processing more examples at once;
use of model parallelism to enable training models that require more memory than available on one GPU;
use of DataLoaders with num_workers > 0 to enable multi-process data loading;
training on only a subset of available devices.

Training on One GPU

Let’s say you have 3 GPUs available and you want to train a model on one of them. You can tell Pytorch which GPU to use by specifying the device:

device = torch.device(‘cuda:0’) for GPU 0
device = torch.device(‘cuda:1’) for GPU 1
device = torch.device(‘cuda:2’) for GPU 2

Training on Multiple GPUs

To allow Pytorch to “see” all available GPUs, use:

device = torch.device(‘cuda’)

There are a few different ways to use multiple GPUs, including data parallelism and model parallelism.

Data Parallelism

Data parallelism refers to using multiple GPUs to increase the number of examples processed simultaneously. For example, if a batch size of 256 fits on one GPU, you can use data parallelism to increase the batch size to 512 by using two GPUs, and Pytorch will automatically assign ~256 examples to one GPU and ~256 examples to the other GPU.

Using data parallelism can be accomplished easily through DataParallel. For example, let’s say you have a model called “custom_net” that is currently initialized as follows:

import torch, torch.nn as nn

model = custom_net(**custom_net_args).to(device)

Now, all you have to do to use data parallelism is wrap the custom_net in DataParallel:

model = nn.DataParallel(custom_net(**custom_net_args)).to(device)

You’ll also want to increase the batch size to make use of all your available devices to their fullest extent.

For more information on data parallelism, see this article.

Model Parallelism

You can use model parallelism to train a model that requires more memory than is available on one GPU. Model parallelism allows you to distribute different parts of the model across different devices.

There are two steps to using model parallelism. The first step is to specify in your model definition which parts of the model should go on which device. Here’s an example from the Pytorch documentation:

The second step is to ensure that the labels are on the same device as the model’s outputs when you call the loss function.

For example, you may want to start out by moving your labels to device ‘cuda:1’ and your data to device ‘cuda:0’. Then you can process your data with a part of the model on ‘cuda:0’, then move the intermediate representation to ‘cuda:1’ and produce the final predictions on ‘cuda:1’. Because your labels are already on ‘cuda:1’ Pytorch will be able to calculate the loss and perform backpropagation without any further modifications.

For more information on model parallelism, see this article.

Loading Data Faster with Num_Workers

Pytorch’s DataLoader provides an efficient way to automatically load and batch your data. You can use it for any data set, no matter how complicated. All you need to do is first define your own Dataset that inherits from Pytorch’s Dataset class:

The only requirements on your Dataset are that it defines the methods __len__ and __getitem__.

The __len__ method must return the total number of examples in your dataset.
The __getitem__ method must return a single example based on an integer index.

How you actually prepare the examples and what the examples are is entirely up to you.

Once you’ve created a Dataset, you need to wrap that Dataset in Pytorch’s Dataloader as follows:

from torch.utils.data import Dataset, DataLoader

dataset_train = MyComplicatedCustomDataset(**dataset_args)

train_dataloader = DataLoader(dataset_train, batch_size=256, shuffle=True, num_workers = 4)

In order to get batches all you have to do is iterate through the DataLoader:

for batch_idx, batch in enumerate(train_dataloader):

do stuff

If you want to accelerate data loading, you can use more than one worker. Notice in the call to DataLoader you specify a number of workers:

train_dataloader = DataLoader(dataset_train, batch_size=256, shuffle=True, num_workers = 4)

By default, num_workers is set to 0. Setting num_workers to a positive integer turns on multi-process data loading in which data will be loaded using the specified number of loader worker processes. (Note that this isn’t really multi-GPU, as these loader worker processes are different processes on the CPU, but since it’s related to accelerating model training I decided to put it in the same article).

Note that more worker processes is not always better. If you set num_workers too high, it can actually slow down your data loading. There are also no great rules about how to choose the optimal number of workers. There are numerous online discussions about it (e.g. here) but no conclusive answers. The reason there aren’t any great rules about how to choose the number of workers is that the optimal number of workers depends on what kind of machine you are using, what kind of data set you are using, and how much on-the-fly pre-processing your data requires.

A good way to choose a number of workers is to run some small experiments on your data set in which you time how long it takes to load a fixed number of examples using different numbers of workers. As you increase num_workers up from 0, you will first see an increase in data loading speed, followed by a decrease in data loading speed once you hit “too many workers.”

For more information see “Multi-process data loading” on this page.

Model Parallelism and Data Parallelism Simultaneously

If you want to use both model parallelism and data parallelism at the same time, then the data parallelism will have to be implemented in a slightly different way, using DistributedDataParallel instead of DataParallel. For more information, see “Getting Started with Distributed Data Parallel.”

Training on a Subset of Available Devices

What if you want to use model parallelism or data parallelism but you don’t want to take up all available devices for a single model? In that case, you can restrict which devices Pytorch can see for each model . Within your code, you’ll set the device as if you want to use all GPUs (i.e. using device = torch.device(‘cuda’)) but when you run the code you’ll restrict which GPUs can be seen.

Let’s say you have 6 GPUs and you want to train Model A on 2 of them and Model B on 4 of them. You can do that as follows:

CUDA_VISIBLE_DEVICES=0,1 python model_A.py

CUDA_VISIBLE_DEVICES=2,3,4,5 python model_B.py

Or, if you have 3 GPUs and you want to train Model A on 1 of them and Model B on 2 of them, you could do this:

CUDA_VISIBLE_DEVICES=1 python model_A.py

CUDA_VISIBLE_DEVICES=0,2 python model_B.py

Happy multi-GPU training!

About the Featured Image

The featured image is a painting called “Harvesters” by Anna Ancher. It is in the public domain. Source: Wikipedia.

Want to be the first to hear about my articles bridging healthcare, artificial intelligence, and business—and get a free list of my favorite health AI resources? Sign up here.

Glass Box Medicine

Healthcare & Artificial Intelligence, by Rachel Draelos, MD, PhD

Multi-GPU Training in Pytorch: Data and Model Parallelism

Share this:

Related