Training Initialization

The training process of InternEvo can be summarized into two steps:

Initialization
- Initialize model, optimizer, dataloader, trainer, and create different types of process groups to prepare for iterative steps of hybrid parallel training.
- Initialize logger, checkpoint manager, monitor manager, and profiler to watch, alert, and record the iterative training steps.
Iterative training steps
- Load the training engine and scheduler for hybrid parallel training according to the configuration such as tensor parallel size, pipeline parallel size, and data parallel size.
- In iterative training steps, the Trainer API is called to perform zero gradients, forward-loss-backward, and parameter update.

_images/hybrid_parallel_training.png — InternEvo training process

Argument Parsing

InternEvo uses the argparse library to supply commandline configuration to the InternEvo runtime.

Use internlm.initialize.get_default_parser() to get InternEvo’s default parser with some builtin arguments, users can add custom parameters to this parser.

# Get InternEvo default parser
parser = internlm.initialize.get_default_parser()
# Add new argument
parser.add_argument("--user_arg", type=int, default=-1, help="arguments add by user.")
cmd_args = parser.parse_args()

Model Initialization

InternEvo uses the field model_type and model in the config file to control model initialization process. An example model initialization configuratio

model_type = "INTERNLM"  # default is "INTERNLM", used to register classes and modules for model initialization
NUM_ATTENTION_HEAD = 32
VOCAB_SIZE = 103168
HIDDEN_SIZE = 4096
NUM_LAYER = 32
MLP_RATIO = 8 / 3
model = dict(
    checkpoint=False,  # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
    num_attention_heads=NUM_ATTENTION_HEAD,
    embed_split_hidden=True,
    vocab_size=VOCAB_SIZE,
    embed_grad_scale=1,
    parallel_output=True,
    hidden_size=HIDDEN_SIZE,
    num_layers=NUM_LAYER,
    mlp_ratio=MLP_RATIO,
    apply_post_layer_norm=False,
    dtype="torch.bfloat16",  # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
    norm_type="rmsnorm",
    layer_norm_epsilon=1e-5,
    use_flash_attn=True,
    num_chunks=1,  # if num_chunks > 1, interleaved pipeline scheduler is used.
)

The field model_type specifics the model type has been registered and to be initialized.
The parameters in field model specific the configuration settings during model initialization.

It is worth noting that users can define new model types and register the model initialization functions via register_module. An example is shown as follows:

model_initializer = Registry("model_initializer")

def register_model_initializer() -> None:
    model_initializer.register_module("INTERNLM", InternLM1)

In this context, “INTERNLM” is the new model type, and InternLM1 is the entry function for the new model.

Dataloader Initialization

InternEvo uses the field data in the configuration file to control the initialization process of the data loader. The example configuration for initializing the data loader is defined as follows:

TRAIN_FOLDER = None  # "/path/to/dataset"
VALID_FOLDER = None  # "/path/to/dataset"
data = dict(
    seq_len=SEQ_LEN,
    # micro_num means the number of micro_batch contained in one gradient update
    micro_num=4,
    # packed_length = micro_bsz * SEQ_LEN
    micro_bsz=2,
    # defaults to the value of micro_num
    valid_micro_num=4,
    # defaults to 0, means disable evaluate
    valid_every=50,
    pack_sample_into_one=False,
    total_steps=50000,
    skip_batches="",
    # rampup_batch_size (str): A string with three space-separated integers representing the
    #       starting batch size, the increment, and the number of steps between
    #       each increment. For example, "192 24 8" means that the batch size (micro_num)
    #       starts at 192 and increases by 24 every 8 steps. Defaults to None.
    #       (IMPORTANT): The interval step size is 'micro_bsz'.
    rampup_batch_size="",
    # Datasets with less than 50 rows will be discarded
    min_length=50,
    train_folder=TRAIN_FOLDER,
    valid_folder=VALID_FOLDER,
    empty_cache_and_diag_interval=200,
    diag_outlier_ratio=1.1,
    # whether use shared memory to load meta files
    use_shm=False,
    # when use shm, the default shm_path is "/dev/shm/metacache"
    # shm_path="/dev/shm/metacache"
)

This section supports the initialization of three types of datasets, including dummy datasets, tokenized datasets, and streaming datasets.

dummy dataset

If the TRAIN_FOLDER is set to None, a dummy dataset will be randomly generated, and if the random seed is the same, the generated dataset will remain consistent.
tokenized dataset

If the TRAIN_FOLDER is set to a local path where the .bin and .meta files after tokenization are stored, then a tokenized dataset will be loaded.
streaming dataset

If the TRAIN_FOLDER is set to a specified local path that contains the dataset downloaded from HuggingFace, and new fields type and tokenizer_path are added in the data configuration, then a streaming dataset will be loaded.

type="streaming",
tokenizer_path="/path/to/tokenizer",

For detailed instructions on the formats for tokenized datasets and streaming datasets, please refer to the User Guide

Parallel Communication Initialization

Initialize the communication status under different parallel modes using the initialize_parallel_communicator function.

In the ISP parallel mode, handle overlap optimization and register the All_Gather communication for linear layers. In the MTP parallel mode, register communication functions for weights that are row-wise and column-wise partitioned. In MSP and FSP parallel modes, register communication functions for sequence parallelism. In the MoE model, register the MoE serialized parallel communication function.

Optimizer Initialization

InternEvo utilizes the fields grad_scaler, hybrid_zero_optimizer, adam, lr_scheduler, and beta2_scheduler in the configuration file to control the initialization process of the optimizer. An example configuration for initializing the optimizer is defined as follows:

grad_scaler = dict(
    fp16=dict(
        # the initial loss scale, defaults to 2**16
        initial_scale=2**16,
        # the minimum loss scale, defaults to None
        min_scale=1,
        # the number of steps to increase loss scale when no overflow occurs
        growth_interval=1000,
    ),
    # the multiplication factor for increasing loss scale, defaults to 2
    growth_factor=2,
    # the multiplication factor for decreasing loss scale, defaults to 0.5
    backoff_factor=0.5,
    # the maximum loss scale, defaults to None
    max_scale=2**24,
    # the number of overflows before decreasing loss scale, defaults to 2
    hysteresis=2,
)

hybrid_zero_optimizer = dict(
    # Enable low_level_optimzer overlap_communication
    overlap_sync_grad=True,
    overlap_sync_param=False,
    # bucket size for nccl communication params
    reduce_bucket_size=512 * 1024 * 1024,
    # grad clipping
    clip_grad_norm=1.0,
    # whether use new optm
    use_split_tensor_optim=False,
    # when use split tensor optm
    # Perform all gather with a set of parameters of all_gather_size
    all_gather_size=512 * 1024 * 1024,
)

adam = dict(
    lr=1e-4,
    adam_beta1=0.9,
    adam_beta2=0.95,
    adam_beta2_c=0,
    adam_eps=1e-8,
    weight_decay=0.01,
)

lr_scheduler = dict(
    total_steps=data["total_steps"],
    init_steps=0,  # optimizer_warmup_step
    warmup_ratio=0.01,
    eta_min=1e-5,
    last_epoch=-1,
)

beta2_scheduler = dict(
    init_beta2=adam["adam_beta2"],
    c=adam["adam_beta2_c"],
    cur_iter=-1,
)

Users initialize the optimizer through the function initialize_optimizer and pass in the isp_communicator parameter to handle communication in the ISP parallel mode.

Trainer Initialization

The initialize_trainer function is used to initiate the training process, requiring parameters such as the created model, initialized optimizer, scheduler, and other related parameters.