isciml: train

Overview

The train subcommand in isciml is used for training machine learning models on the data generated from the generate command. This command allows users to configure various aspects of the training process, including model architecture, training parameters, and data handling.

Usage

singularity exec isciml.sif isciml train [OPTIONS]

Options

Option

Description

Default

Required

--sample_folder PATH

Folder containing sample files

Yes

--target_folder PATH

Folder containing target files

Yes

--n_blocks INTEGER

Number of blocks in UNet

--start_filters INTEGER

Number of start filters

--batch_size INTEGER

Batch size for training

--max_epochs INTEGER

Maximum number of epochs

--learning_rate FLOAT

Adam optimizer learning rate

0.001

--save_model PATH

File name to save the checkpoint at the end of training

pytorch_model.ckpt

--load_model PATH

Checkpoint file name to load at the beginning of training

--train_size FLOAT

Training size (proportion of data used for training)

0.8

--num_workers INTEGER

Number of workers for data loader

--n_gpus INTEGER

Number of GPUs used for training

--strategy TEXT

Distributed Data Parallel Strategy

auto

--checkpoint_folder PATH

Checkpoint folder

./lightning_checkpoint_folder

--every_n_epochs INTEGER

Number of epochs between checkpoints

--save_top_k INTEGER

Number of best models to save

`--reshape_base [two

eight]`

Reshape 1D to 2D using base 2 or 8

eight

--dim INTEGER

Dimension of the grid of solution

--help

Show the help message and exit

Description

The train subcommand allows you to train a UNet model on your generated data. It provides various options to customize the training process, including model architecture, training parameters, and data handling.

Example Usage

To train a model with default settings:

singularity exec isciml.sif isciml train \
    --sample_folder /path/to/samples \
    --target_folder /path/to/targets

To train a more complex model with custom settings:

singularity exec isciml.sif isciml train \
    --sample_folder /path/to/samples \
    --target_folder /path/to/targets \
    --n_blocks 5 \
    --start_filters 64 \
    --batch_size 32 \
    --max_epochs 100 \
    --learning_rate 0.0001 \
    --save_model custom_model.ckpt \
    --train_size 0.7 \
    --n_gpus 2 \
    --strategy ddp \
    --every_n_epochs 5 \
    --save_top_k 3

Notes

The sample and target folders should contain the input and output data generated using the generate command.
The UNet architecture is defined by n_blocks and start_filters. Increasing these values will create a more complex model.
batch_size, max_epochs, and learning_rate are key training parameters that affect the learning process.
The save_model option specifies where the final trained model will be saved.
Use load_model to continue training from a previously saved checkpoint.
train_size determines the split between training and validation data.
n_gpus and strategy are used for distributed training across multiple GPUs.
The checkpointing system saves models periodically and can keep the top K models based on validation performance.
reshape_base and dim options are used to reshape the input data, which may be necessary depending on your data format.

generate-models: Used to create the physical models.
generate: Used to create the training data.
inference: Used to apply the trained model to new data.

Overview

Usage

Options

Description

Example Usage

Notes

Related Commands

See Also