Multi-GPU Training
Cold Start Training with Multiple GPUs
For training a model from scratch using multiple GPUs, you can use a SLURM script like the following:
Key points:
Adjust the SLURM parameters (
-p
,--output
,-N
,-t
,--gpus
) according to your HPC environment.The
--nv
flag in the singularity command enables NVIDIA GPU support.The
-np 4
in the mpirun command should match the number of GPUs you're using.Adjust the paths for
--sample_folder
,--target_folder
, and the Singularity image file.The
--strategy ddp_find_unused_parameters_true
is used for distributed data parallel training.
Warm Start Training
To continue training from a previously saved model (warm start), use a script like this:
Key points for warm start training:
Use
--load_model
to specify the checkpoint to start from.Ensure
--max_epochs
is greater than the epoch number of the loaded checkpoint.Specify a new
--checkpoint_folder
to store the new checkpoints.The
--save_top_k 4
option will save the best 4 models based on validation loss.
Important Notes
Adjust all paths and parameters to match your specific setup and requirements.
The
--save_top_k
option saves the best K models based on validation loss. Choose the checkpoint that makes the most sense for your use case.In warm start training, the new training will begin from the epoch specified in the loaded checkpoint and continue until
max_epochs
is reached.Always ensure that the module versions (e.g., openmpi) are compatible with your system and isciml version.
The exact GPU specification (
--gpus=v100-32:4
) may vary depending on your HPC system. Consult your system's documentation for the correct format.
For more detailed information on training options and best practices, refer to the full isciml documentation or consult with your HPC support team.
Last updated