Multi-GPU Training

Cold Start Training with Multiple GPUs

For training a model from scratch using multiple GPUs, you can use a SLURM script like the following:

#!/bin/bash
#SBATCH -p GPU-shared
#SBATCH --output=/path/to/your/log/file.txt
#SBATCH -N 1
#SBATCH -t 18:00:00
#SBATCH --gpus=v100-32:4

module load openmpi/4.0.5-gcc10.2.0

singularity exec --nv /path/to/your/isciml.sif mpirun -np 4 isciml train \
    --sample_folder /path/to/your/samples/*Adj*Data \
    --target_folder /path/to/your/targets/*Sus*Models \
    --n_gpus 4 \
    --batch_size 25 \
    --max_epochs 12 \
    --save_top_k 4 \
    --strategy ddp_find_unused_parameters_true \
    --n_blocks 5 \
    --start_filters 128

Key points:

Adjust the SLURM parameters (-p, --output, -N, -t, --gpus) according to your HPC environment.
The --nv flag in the singularity command enables NVIDIA GPU support.
The -np 4 in the mpirun command should match the number of GPUs you're using.
Adjust the paths for --sample_folder, --target_folder, and the Singularity image file.
The --strategy ddp_find_unused_parameters_true is used for distributed data parallel training.

Warm Start Training

To continue training from a previously saved model (warm start), use a script like this:

#!/bin/bash
#SBATCH -p GPU-shared
#SBATCH --output=/path/to/your/log/file.txt
#SBATCH -N 1
#SBATCH -t 18:00:00
#SBATCH --gpus=v100-32:4

module load openmpi/4.0.5-gcc10.2.0

singularity exec --nv /path/to/your/isciml.sif isciml train \
    --sample_folder /path/to/your/samples/ \
    --target_folder /path/to/your/targets/ \
    --max_epochs 31 \
    --save_top_k 4 \
    --batch_size 32 \
    --load_model ./lightning_checkpoint_folder/epoch=6-step=49.ckpt \
    --checkpoint_folder ./lightning_checkpoint_folder_2

Key points for warm start training:

Use --load_model to specify the checkpoint to start from.
Ensure --max_epochs is greater than the epoch number of the loaded checkpoint.
Specify a new --checkpoint_folder to store the new checkpoints.
The --save_top_k 4 option will save the best 4 models based on validation loss.

Important Notes

Adjust all paths and parameters to match your specific setup and requirements.
The --save_top_k option saves the best K models based on validation loss. Choose the checkpoint that makes the most sense for your use case.
In warm start training, the new training will begin from the epoch specified in the loaded checkpoint and continue until max_epochs is reached.
Always ensure that the module versions (e.g., openmpi) are compatible with your system and isciml version.
The exact GPU specification (--gpus=v100-32:4) may vary depending on your HPC system. Consult your system's documentation for the correct format.

For more detailed information on training options and best practices, refer to the full isciml documentation or consult with your HPC support team.

PreviousData generation using Slurm NextContact Us

Last updated 10 months ago