If you see some error like the message below, just clean the original installation record first by
rm -rf /root/.rustup and reinstall.
error: could not rename component file from '/root/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/share/doc/cargo' to '/root/.rustup/tmp/m74fkrv0gv6708f6_dir/bk'error: caused by: other os error.
You can try to check whether the machine has multiple network interfaces, and
NCCL_SOCKET_IFNAME=network card name (such as eth01) to specify
the one you want to use (usually a physical one). Card information can be
Bagua requires NCCL version >= 2.10 to support
AVG reduction operation. If you encounter error
NCCL WARN Reduce: invalid reduction operation 4 during training, you can run
import bagua_core; bagua_core.install_deps() in your Python interpreter or the
bagua_install_deps.py script to install latest libraries.
To check the NCCL version Bagua is using, run
import bagua_core; bagua_core.show_version().
Using a different algorithm or using more GPUs has similar effect as using a different optimizer, so you need to retune your hyperparameters. Some tricks you can try:
- Train more epochs and increase the number of training iterations to 0.2-0.3 times more than the original.
- Scale the learning rate. If the total batch size of distributed training is increased by times, the learning rate should also be increased by times to be .
- Performing a gradual learning rate warmup for several epochs often helps (see also Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour).