Script of Starting Training

After installing InternEvo, users need to write their own training startup scripts. Please refer to: train.py

The process in the script can be divided into three steps: parameter parsing, initialization, and starting training. For the specific principles of parameter parsing and initialization, please refer to: Training Initialization

Configuration Parameter Parsing

args = parse_args()

Call the parse_args function to parse the parameters set in the configuration file when starting the training. For more details, see: Argument Parsing

Initialization process

  • Initialize Distributed Training Environment

initialize_distributed_env(config=args.config, launcher=args.launcher, master_port=args.port, seed=args.seed)

Call the initialize_distributed_env function, which supports launching the training script through Slurm or Torch, and pass in information such as the configuration file, port number, and process random seed. Detailed description of the function is as follows:

  • Initialize Model

model = initialize_model_and_parallel_communicator()

Detailed introduction refer to: Model Initialization

  • Initialize Training Dataloader

train_dl, dataset_types = build_train_loader_with_data_type()

Detailed introduction refer to: Dataloader Initialization

  • Initialize Validation Dataloader

val_dls = build_valid_loader_with_data_type()

Initialize the validation data loader, which has a loading process similar to that of the training data. The path to the validation dataset is set through the VALID_FOLDER field in the configuration file.

  • Initialize Trainer

trainer = TrainerBuilder(model, train_dl, val_dls, **kwargs)

The TrainerBuilder interface inherits from the Trainer class, and the training API of InternEvo is managed by internlm.core.trainer.Trainer. After defining the training engine and scheduler, we can call the Trainer API to perform model training, evaluation, gradient clearing, and parameter updating, etc.

For detailed usage, please refer to Trainer API documentation and examples.

Start Training Process

trainer.fit()

Firstly, by using the self.train() method, the model is set to training mode.

During each step of the training process, the load_new_batch function is used to load the dataset. Then, the execute_schedule scheduler is used to initiate training, and the forward_backward_step begins the forward and backward training process. Afterwards, the self.step() updates the parameters and returns the gradient values. If the step count reaches the number required for validation, the model’s training results are evaluated using evaluate_on_val_dls. Finally, if the checkpoint saving function is enabled, the intermediate training state and the final training results are saved using the try_save_checkpoint function.