Skip to content

[TOOLS] Monitor training abnormalities and send reminders #410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions tools/monitor/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Training Log Monitor

This monitoring tool is used to monitor training logs on remote servers, check for anomalies in the logs, send reminders via email or Feishu robot. This tool aims to ensure that problems during the training process can be identified in a timely manner, including timely monitoring of training jamming or slowing down.

# NOTE

For email reminders:
This program requires a password to be entered during runtime, so please ensure that it is used in a secure environment to avoid the risk of password leakage.

For Feishu robot reminders:
This program requires a URL of Feishu Robot to be entered during runtime, so please ensure that it is used in a secure environment to avoid the risk of URL leakage. Configuration method reference link https://open.feishu.cn/document/client-docs/bot-v3/add-custom-bot, set the keyword to "monitor".

Training anomaly monitoring relies on historical training data analysis using various statistical methods. Please manually observe the logs for a period of time to ensure that at least the first 10 iterations are normal.

## Features

- Monitors a remote log file for training status.
- Sends corresponding abnormal information prompt emails based on log analysis results, including sample content for clarity.
- Configurable check interval.

## Prerequisites

Before running the script, ensure you have a password-free SSH login to the remote host.

## Installation

```bash
git clone https://github.com/FlagOpen/FlagScale.git
cd FlagScale/tools/monitor
pip install -r requirements.txt
```

## Configuration

1. For Email:

Modify the provided configuration file 'config-email.yaml' example to set actual values:

```yaml
# Target email address for receiving alerts
target_email: [email protected] # The email address that will receive alerts

# SMTP server setup for sending emails
smtp_server: smtp.example.com # The SMTP server used for sending emails

# Email address used to send alerts
source_email: [email protected] # The email address to send alerts from

# Remote host IP address for accessing log files
remote_host: 192.0.2.1 # The IP address of the remote host where logs are stored

# Username for SSH login to the remote host
remote_user: example_user # The username for SSH login

# Port number for SSH access
remote_port: 22 # Standard SSH port

# Path to the log file on the remote host
remote_log_path: /path/example_log_file.log # Path to the log file

# Interval in seconds for log checking
check_interval: 1200 # Check logs every 1200 seconds
```

2. For Feishu Root

Modify the provided configuration file 'config-feishu.yaml' example to set actual values:

```yaml
# Remote host IP address for accessing log files
remote_host: 192.0.2.1 # The IP address of the remote host where logs are stored

# Username for SSH login to the remote host
remote_user: example_user # The username for SSH login

# Port number for SSH access
remote_port: 22 # Standard SSH port

# Path to the log file on the remote host
remote_log_path: /path/example_log_file.log # Path to the log file

# Interval in seconds for log checking
check_interval: 1200 # Check logs every 1200 seconds
```


## Usage

1. For Email:

```bash
python monitor.py --notice email
```

You will then be prompted to enter your source email's password.

2. For Feishu robot:

```bash
python monitor.py --notice feishu
```

You will then be prompted to enter Feishu robot URL.

## Next steps

We will add monitoring perspectives, including:
- Prompt when training ends.
- Perform communication group-based monitoring.
- Monitor hardware utilization anomalies.
- More user needs...
23 changes: 23 additions & 0 deletions tools/monitor/config-email.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Target email address for receiving alerts
target_email: [email protected] # This is the email address that will receive alerts

# SMTP server setup for sending emails, can query email configuration to obtain
smtp_server: smtp.example.com # This is the SMTP server used for sending emails

# Email address used to send alerts
source_email: [email protected] # This is the email address used to send alerts

# Remote host IP address for accessing log files
remote_host: 192.0.2.1 # This IP address identifies the remote host where logs are stored

# Username for SSH login to the remote host
remote_user: example_user # This is the username used to log in to the remote host via SSH

# Port number used for SSH access
remote_port: 22 # This is the standard SSH port used to connect to the remote host

# Path to the log file on the remote host
remote_log_path: /path/example_log_file.log

# Interval in seconds for checking the logs
check_interval: 1200
14 changes: 14 additions & 0 deletions tools/monitor/config-feishu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Remote host IP address for accessing log files
remote_host: 192.0.2.1 # This IP address identifies the remote host where logs are stored

# Username for SSH login to the remote host
remote_user: example_user # This is the username used to log in to the remote host via SSH

# Port number used for SSH access
remote_port: 22 # This is the standard SSH port used to connect to the remote host

# Path to the log file on the remote host
remote_log_path: /path/example_log_file.log

# Interval in seconds for checking the logs
check_interval: 1200
Loading
Loading