Monitor and Alert

Monitoring

InternEvo uses internlm.monitor.initialize_monitor_manager() to initialize context monitor. During this time, a singleton internlm.monitor.monitor.MonitorManager will manage monitoring thread and track training status with internlm.monitor.monitor.MonitorTracker.

Alerting

InternEvo monitor thread periodically tracks loss spike, potential stuck condition, runtime exception, and SIGTERM signal. When above situation occurs, an alert will be triggered and a message will be sent to the Feishu webhook address by calling internlm.monitor.alert.send_feishu_msg_with_webhook().

Light Monitoring

The InternEvo light monitoring tool employs a heartbeat mechanism to real-time monitor various metrics during the training process, such as loss, grad_norm, and training phase duration. Additionally, InternEvo can present these metric details through a grafana dashboard, allowing users to conduct more comprehensive and in-depth training analysis in an intuitive manner.

The configuration for light monitoring is specified by the monitor field in the configuration file. Users can modify monitoring settings by editing the configuration file config file. Here is an example of a monitoring configuration:

monitor = dict(
    alert=dict(
        enable_feishu_alert=False,
        feishu_alert_address=None,
        light_monitor_address=None,
        alert_file_path=f"llm_alter/{JOB_NAME}_alert.log",
    ),
)
  • enable_feishu_alert: Whether to enable Feishu alerts. Defaults: False.

  • feishu_alert_address: The webhook address for Feishu alerts. Defaults: None.

  • light_monitor_address: The address for lightweight monitoring. Defaults: None.

  • alert_file_path: path of alert. Defaults: None.