Frequently Asked Questions and Troubleshooting
Dataloader sometimes hang when num_workers
> 0
Add torch.multiprocessing.set_start_method("forkserver")
. The default "fork"
strategy is error prone by design. For more information, see PyTorch documentation, and StackOverflow.
Error when installing Rust
If you see some error like the message below, just clean the original installation record first by rm -rf /root/.rustup
and reinstall.
error: could not rename component file from '/root/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/share/doc/cargo' to '/root/.rustup/tmp/m74fkrv0gv6708f6_dir/bk'error: caused by: other os error.
Hang when running a distributed program
You can try to check whether the machine has multiple network interfaces, and
use command NCCL_SOCKET_IFNAME=network card name (such as eth01)
to specify
the one you want to use (usually a physical one). Card information can be
obtained by ls /sys/class/net/
.
NCCL WARN Reduce: invalid reduction operation 4
Bagua requires NCCL version >= 2.10 to support AVG
reduction operation. If you encounter error NCCL WARN Reduce: invalid reduction operation 4
during training, you can run import bagua_core; bagua_core.install_deps()
in your Python interpreter or the bagua_install_deps.py
script to install latest libraries.
To check the NCCL version Bagua is using, run import bagua_core; bagua_core.show_version()
.
Model accuracy drops
Using a different algorithm or using more GPUs has similar effect as using a different optimizer, so you need to retune your hyperparameters. Some tricks you can try:
- Train more epochs and increase the number of training iterations to 0.2-0.3 times more than the original.
- Scale the learning rate. If the total batch size of distributed training is increased by times, the learning rate should also be increased by times to be .
- Performing a gradual learning rate warmup for several epochs often helps (see also Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour).