Frequently Asked Questions and Troubleshooting

Dataloader sometimes hang when num_workers > 0

Add torch.multiprocessing.set_start_method("forkserver"). The default "fork" strategy is error prone by design. For more information, see PyTorch documentation, and StackOverflow.

Error when installing Rust

If you see some error like the message below, just clean the original installation record first by rm -rf /root/.rustup and reinstall.

error: could not rename component file from '/root/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/share/doc/cargo' to '/root/.rustup/tmp/m74fkrv0gv6708f6_dir/bk'error: caused by: other os error.

Hang when running a distributed program

You can try to check whether the machine has multiple network interfaces, and use command NCCL_SOCKET_IFNAME=network card name (such as eth01) to specify the one you want to use (usually a physical one). Card information can be obtained by ls /sys/class/net/.

NCCL WARN Reduce: invalid reduction operation 4

Bagua requires NCCL version >= 2.10 to support AVG reduction operation. If you encounter error NCCL WARN Reduce: invalid reduction operation 4 during training, you can run import bagua_core; bagua_core.install_deps() in your Python interpreter or the bagua_install_deps.py script to install latest libraries.

To check the NCCL version Bagua is using, run import bagua_core; bagua_core.show_version().

Model accuracy drops

Using a different algorithm or using more GPUs has similar effect as using a different optimizer, so you need to retune your hyperparameters. Some tricks you can try:

  1. Train more epochs and increase the number of training iterations to 0.2-0.3 times more than the original.
  2. Scale the learning rate. If the total batch size of distributed training is increased by times, the learning rate should also be increased by times to be .
  3. Performing a gradual learning rate warmup for several epochs often helps (see also Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour).