Skip to content

Commit 1f17e0c

Browse files
Merge pull request #193 from amd/development
dev -> main
2 parents ca37412 + fba3bc8 commit 1f17e0c

36 files changed

Lines changed: 2474 additions & 1220 deletions

README.md

Lines changed: 23 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ usage: cli.py [-h] [--version] [--sys-name STRING]
7878
[--sys-location {LOCAL,REMOTE}]
7979
[--sys-interaction-level {PASSIVE,INTERACTIVE,DISRUPTIVE}]
8080
[--sys-sku STRING] [--sys-platform STRING]
81-
[--plugin-configs [STRING ...]] [--system-config STRING]
81+
[--plugin-configs LIST] [--system-config STRING]
8282
[--connection-config STRING] [--log-path STRING]
8383
[--log-level {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET}]
8484
[--no-console-log] [--gen-reference-config] [--skip-sudo]
@@ -112,10 +112,11 @@ options:
112112
--sys-sku STRING Manually specify SKU of system (default: None)
113113
--sys-platform STRING
114114
Specify system platform (default: None)
115-
--plugin-configs [STRING ...]
116-
built-in config names or paths to plugin config JSONs.
117-
Available built-in configs: NodeStatus, AllPlugins
118-
(default: None)
115+
--plugin-configs LIST
116+
Comma-separated built-in names and/or plugin config
117+
JSON paths (e.g. --plugin-
118+
configs=NodeStatus,/path/c.json). Built-ins:
119+
NodeStatus, AllPlugins (default: None)
119120
--system-config STRING
120121
Path to system config json (default: None)
121122
--connection-config STRING
@@ -337,6 +338,16 @@ You can extend the built-in error detection with custom regex patterns. Create a
337338
"event_category": "SW_DRIVER",
338339
"event_priority": 4
339340
}
341+
],
342+
"priority_override_rules": [
343+
{
344+
"message": "Application Crash",
345+
"new_priority": "ERROR"
346+
},
347+
{
348+
"event_category": "SW_DRIVER",
349+
"new_priority": "WARNING"
350+
}
340351
]
341352
}
342353
}
@@ -348,7 +359,7 @@ You can extend the built-in error detection with custom regex patterns. Create a
348359
Save this to `dmesg_custom_config.json` and run:
349360
350361
```sh
351-
node-scraper --plugin-configs dmesg_custom_config.json run-plugins DmesgPlugin
362+
node-scraper --plugin-configs=dmesg_custom_config.json run-plugins DmesgPlugin
352363
```
353364
354365
#### **'compare-runs' subcommand**
@@ -539,8 +550,9 @@ Built-in configs include **NodeStatus** (a subset of plugins) and **AllPlugins**
539550
registered plugin with default arguments—useful for generating a reference config from the full system).
540551
541552
**NodeStatus plus additional plugins** — built-in configs merge with plugins named after `run-plugins`.
542-
Use **`--plugin-configs=<name>`** (equals form): with a space
543-
after `--plugin-configs`. See below for examples:
553+
Values are comma-separated; pass as **`--plugin-configs=…`** or **`--plugin-configs` …** (same as other
554+
optional flags), e.g. `--plugin-configs=NodeStatus,/path/extra.json`.
555+
Examples:
544556
```sh
545557
node-scraper --plugin-configs=NodeStatus run-plugins PciePlugin
546558
```
@@ -551,7 +563,7 @@ node-scraper --log-path ./logs --plugin-configs=NodeStatus run-plugins PciePlugi
551563
552564
Using a JSON file:
553565
```sh
554-
node-scraper --plugin-configs plugin_config.json
566+
node-scraper --plugin-configs=plugin_config.json
555567
```
556568
Here is an example of a comprehensive plugin config that specifies analyzer args for each plugin:
557569
```json
@@ -613,7 +625,7 @@ data.
613625
614626
**Run all registered plugins (AllPlugins config):**
615627
```sh
616-
node-scraper --plugin-config AllPlugins
628+
node-scraper --plugin-configs=AllPlugins
617629
618630
```
619631
@@ -647,7 +659,7 @@ This will generate the following config:
647659
```
648660
This config can later be used on a different platform for comparison, using the steps at #2:
649661
```sh
650-
node-scraper --plugin-configs reference_config.json
662+
node-scraper --plugin-configs=reference_config.json
651663
652664
```
653665

docs/PLUGIN_DOC.md

Lines changed: 3 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
| Plugin | Collection | Analyzer Args | Collection Args | DataModel | Collector | Analyzer |
66
| --- | --- | --- | --- | --- | --- | --- |
7-
| AmdSmiPlugin | bad-pages<br>firmware --json<br>list --json<br>metric -g all<br>partition --json<br>process --json<br>ras --cper --folder={folder}<br>ras --afid --cper-file {cper_file}<br>static -g all --json<br>static -g {gpu_id} --json<br>topology<br>version --json<br>xgmi -l<br>xgmi -m | **Analyzer Args:**<br>- `check_static_data`: bool — If True, run static data checks (e.g. driver version, partition mode).<br>- `expected_gpu_processes`: Optional[int] — Expected number of GPU processes.<br>- `expected_max_power`: Optional[int] — Expected maximum power value (e.g. watts).<br>- `expected_driver_version`: Optional[str] — Expected AMD driver version string.<br>- `expected_memory_partition_mode`: Optional[str] — Expected memory partition mode (e.g. sp3, dp).<br>- `expected_compute_partition_mode`: Optional[str] — Expected compute partition mode.<br>- `expected_firmware_versions`: Optional[dict[str, str]] — Expected firmware versions keyed by amd-smi fw_id (e.g. PLDM_BUNDLE).<br>- `l0_to_recovery_count_error_threshold`: Optional[int] — L0-to-recovery count above which an error is raised.<br>- `l0_to_recovery_count_warning_threshold`: Optional[int] — L0-to-recovery count above which a warning is raised.<br>- `vendorid_ep`: Optional[str] — Expected endpoint vendor ID (e.g. for PCIe).<br>- `vendorid_ep_vf`: Optional[str] — Expected endpoint VF vendor ID.<br>- `devid_ep`: Optional[str] — Expected endpoint device ID.<br>- `devid_ep_vf`: Optional[str] — Expected endpoint VF device ID.<br>- `sku_name`: Optional[str] — Expected SKU name string for GPU.<br>- `expected_xgmi_speed`: Optional[list[float]] — Expected xGMI speed value(s) (e.g. link rate).<br>- `analysis_range_start`: Optional[datetime.datetime] — Start of time range for time-windowed analysis.<br>- `analysis_range_end`: Optional[datetime.datetime] — End of time range for time-windowed analysis. | **Collection Args:**<br>- `cper_file_path`: Optional[str] — Path to CPER folder or file for RAS AFID collection (ras --afid --cper-file). | [AmdSmiDataModel](#AmdSmiDataModel-Model) | [AmdSmiCollector](#Collector-Class-AmdSmiCollector) | [AmdSmiAnalyzer](#Data-Analyzer-Class-AmdSmiAnalyzer) |
7+
| AmdSmiPlugin | bad-pages<br>firmware --json<br>list --json<br>metric -g all<br>partition --json<br>process --json<br>ras --cper --folder={folder}<br>ras --afid --cper-file {cper_file}<br>static -g all --json<br>static -g {gpu_id} --json<br>topology<br>version --json<br>xgmi -l<br>xgmi -m | **Analyzer Args:**<br>- `check_static_data`: bool — If True, run static data checks (e.g. driver version, partition mode).<br>- `expected_gpu_processes`: Optional[int] — Expected number of GPU processes.<br>- `expected_max_power`: Optional[int] — Expected maximum power value (e.g. watts).<br>- `expected_driver_version`: Optional[str] — Expected AMD driver version string.<br>- `expected_memory_partition_mode`: Optional[str] — Expected memory partition mode (e.g. sp3, dp).<br>- `expected_compute_partition_mode`: Optional[str] — Expected compute partition mode.<br>- `expected_firmware_versions`: Optional[dict[str, str]] — Expected firmware versions keyed by amd-smi fw_id (e.g. PLDM_BUNDLE).<br>- `l0_to_recovery_count_error_threshold`: Optional[int] — L0-to-recovery count above which an error is raised.<br>- `l0_to_recovery_count_warning_threshold`: Optional[int] — L0-to-recovery count above which a warning is raised.<br>- `vendorid_ep`: Optional[str] — Expected endpoint vendor ID (e.g. for PCIe).<br>- `vendorid_ep_vf`: Optional[str] — Expected endpoint VF vendor ID.<br>- `devid_ep`: Optional[str] — Expected endpoint device ID.<br>- `devid_ep_vf`: Optional[str] — Expected endpoint VF device ID.<br>- `sku_name`: Optional[str] — Expected SKU name string for GPU.<br>- `expected_xgmi_speed`: Optional[list[float]] — Expected xGMI speed value(s) (e.g. link rate).<br>- `analysis_range_start`: Optional[datetime.datetime] — Start of time range for time-windowed analysis.<br>- `analysis_range_end`: Optional[datetime.datetime] — End of time range for time-windowed analysis. | **Collection Args:**<br>- `analysis_firmware_ids`: Optional[list[str]] — amd-smi fw_id values to record in analysis_ref.firmware_versions<br>- `cper_file_path`: Optional[str] — Path to CPER folder or file for RAS AFID collection (ras --afid --cper-file). | [AmdSmiDataModel](#AmdSmiDataModel-Model) | [AmdSmiCollector](#Collector-Class-AmdSmiCollector) | [AmdSmiAnalyzer](#Data-Analyzer-Class-AmdSmiAnalyzer) |
88
| BiosPlugin | sh -c 'cat /sys/devices/virtual/dmi/id/bios_version'<br>wmic bios get SMBIOSBIOSVersion /Value | **Analyzer Args:**<br>- `exp_bios_version`: list[str] — Expected BIOS version(s) to match against collected value (str or list).<br>- `regex_match`: bool — If True, match exp_bios_version as regex; otherwise exact match. | - | [BiosDataModel](#BiosDataModel-Model) | [BiosCollector](#Collector-Class-BiosCollector) | [BiosAnalyzer](#Data-Analyzer-Class-BiosAnalyzer) |
99
| CmdlinePlugin | cat /proc/cmdline | **Analyzer Args:**<br>- `required_cmdline`: Union[str, List] — Command-line parameters that must be present (e.g. 'pci=bfsort').<br>- `banned_cmdline`: Union[str, List] — Command-line parameters that must not be present.<br>- `os_overrides`: Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig] — Per-OS overrides for required_cmdline and banned_cmdline (keyed by OS identifier).<br>- `platform_overrides`: Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig] — Per-platform overrides for required_cmdline and banned_cmdline (keyed by platform). | - | [CmdlineDataModel](#CmdlineDataModel-Model) | [CmdlineCollector](#Collector-Class-CmdlineCollector) | [CmdlineAnalyzer](#Data-Analyzer-Class-CmdlineAnalyzer) |
1010
| DeviceEnumerationPlugin | powershell -Command "(Get-WmiObject -Class Win32_Processor &#124; Measure-Object).Count"<br>lspci -d {vendorid_ep}: &#124; grep -i 'VGA\&#124;Display\&#124;3D' &#124; wc -l<br>powershell -Command "(wmic path win32_VideoController get name &#124; findstr AMD &#124; Measure-Object).Count"<br>lscpu<br>lshw<br>lspci -d {vendorid_ep}: &#124; grep -i 'Virtual Function' &#124; wc -l<br>powershell -Command "(Get-VMHostPartitionableGpu &#124; Measure-Object).Count" | **Analyzer Args:**<br>- `cpu_count`: Optional[list[int]] — Expected CPU count(s); pass as int or list of ints. Analysis passes if actual is in list.<br>- `gpu_count`: Optional[list[int]] — Expected GPU count(s); pass as int or list of ints. Analysis passes if actual is in list.<br>- `vf_count`: Optional[list[int]] — Expected virtual function count(s); pass as int or list of ints. Analysis passes if actual is in list. | - | [DeviceEnumerationDataModel](#DeviceEnumerationDataModel-Model) | [DeviceEnumerationCollector](#Collector-Class-DeviceEnumerationCollector) | [DeviceEnumerationAnalyzer](#Data-Analyzer-Class-DeviceEnumerationAnalyzer) |
@@ -970,6 +970,8 @@ Data model for amd-smi data.
970970
- **xgmi_link**: `Optional[list[nodescraper.plugins.inband.amdsmi.amdsmidata.XgmiLinks]]`
971971
- **cper_data**: `Optional[list[nodescraper.models.datamodel.FileModel]]`
972972
- **cper_afids**: `dict[str, int]`
973+
- **analysis_firmware_ids**: `Optional[list[str]]`
974+
- **analysis_ref**: `Optional[nodescraper.plugins.inband.amdsmi.amdsmidata.AmdSmiAnalysisRef]`
973975

974976
## BiosDataModel Model
975977

@@ -1691,98 +1693,6 @@ Check RDMA statistics for errors (RoCE and other RDMA error counters).
16911693

16921694
**Link to code**: [rdma_analyzer.py](https://github.com/amd/node-scraper/blob/HEAD/nodescraper/plugins/inband/rdma/rdma_analyzer.py)
16931695

1694-
### Class Variables
1695-
1696-
- **ERROR_FIELDS**: `[
1697-
recoverable_errors,
1698-
tx_roce_errors,
1699-
tx_roce_discards,
1700-
rx_roce_errors,
1701-
rx_roce_discards,
1702-
local_ack_timeout_err,
1703-
packet_seq_err,
1704-
max_retry_exceeded,
1705-
rnr_nak_retry_err,
1706-
implied_nak_seq_err,
1707-
unrecoverable_err,
1708-
bad_resp_err,
1709-
local_qp_op_err,
1710-
local_protection_err,
1711-
mem_mgmt_op_err,
1712-
req_remote_invalid_request,
1713-
req_remote_access_errors,
1714-
remote_op_err,
1715-
duplicate_request,
1716-
res_exceed_max,
1717-
resp_local_length_error,
1718-
res_exceeds_wqe,
1719-
res_opcode_err,
1720-
res_rx_invalid_rkey,
1721-
res_rx_domain_err,
1722-
res_rx_no_perm,
1723-
res_rx_range_err,
1724-
res_tx_invalid_rkey,
1725-
res_tx_domain_err,
1726-
res_tx_no_perm,
1727-
res_tx_range_err,
1728-
res_irrq_oflow,
1729-
res_unsup_opcode,
1730-
res_unaligned_atomic,
1731-
res_rem_inv_err,
1732-
res_mem_err,
1733-
res_srq_err,
1734-
res_cmp_err,
1735-
res_invalid_dup_rkey,
1736-
res_wqe_format_err,
1737-
res_cq_load_err,
1738-
res_srq_load_err,
1739-
res_tx_pci_err,
1740-
res_rx_pci_err,
1741-
out_of_buffer,
1742-
out_of_sequence,
1743-
req_cqe_error,
1744-
req_cqe_flush_error,
1745-
resp_cqe_error,
1746-
resp_cqe_flush_error,
1747-
resp_remote_access_errors,
1748-
req_rx_pkt_seq_err,
1749-
req_rx_rnr_retry_err,
1750-
req_rx_rmt_acc_err,
1751-
req_rx_rmt_req_err,
1752-
req_rx_oper_err,
1753-
req_rx_impl_nak_seq_err,
1754-
req_rx_cqe_err,
1755-
req_rx_cqe_flush,
1756-
req_rx_dup_response,
1757-
req_rx_inval_pkts,
1758-
req_tx_loc_acc_err,
1759-
req_tx_loc_oper_err,
1760-
req_tx_mem_mgmt_err,
1761-
req_tx_retry_excd_err,
1762-
req_tx_loc_sgl_inv_err,
1763-
resp_rx_dup_request,
1764-
resp_rx_outof_buf,
1765-
resp_rx_outouf_seq,
1766-
resp_rx_cqe_err,
1767-
resp_rx_cqe_flush,
1768-
resp_rx_loc_len_err,
1769-
resp_rx_inval_request,
1770-
resp_rx_loc_oper_err,
1771-
resp_rx_outof_atomic,
1772-
resp_tx_pkt_seq_err,
1773-
resp_tx_rmt_inval_req_err,
1774-
resp_tx_rmt_acc_err,
1775-
resp_tx_rmt_oper_err,
1776-
resp_tx_rnr_retry_err,
1777-
resp_tx_loc_sgl_inv_err,
1778-
resp_rx_s0_table_err,
1779-
resp_rx_ccl_cts_outouf_seq,
1780-
tx_rdma_ack_timeout,
1781-
tx_rdma_ccl_cts_ack_timeout,
1782-
rx_rdma_mtu_discard_pkts
1783-
]`
1784-
- **CRITICAL_ERROR_FIELDS**: `['unrecoverable_err', 'res_tx_pci_err', 'res_rx_pci_err', 'res_mem_err']`
1785-
17861696
## Data Analyzer Class RocmAnalyzer
17871697

17881698
### Description

nodescraper/cli/__init__.py

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
#
33
# MIT License
44
#
5-
# Copyright (c) 2025 Advanced Micro Devices, Inc.
5+
# Copyright (C) 2026 Advanced Micro Devices, Inc.
66
#
77
# Permission is hereby granted, free of charge, to any person obtaining a copy
88
# of this software and associated documentation files (the "Software"), to deal
@@ -24,6 +24,24 @@
2424
#
2525
###############################################################################
2626

27+
from .cli import get_cli_top_level_subcommands
2728
from .cli import main as cli_entry
29+
from .embed import CLI_TOP_LEVEL_SUBCOMMANDS, run_cli_return_code, run_main_return_code
30+
from .invocation import (
31+
PluginRunInvocation,
32+
get_plugin_run_invocation,
33+
plugin_run_invocation_scope,
34+
run_plugin_queue_with_invocation,
35+
)
2836

29-
__all__ = ["cli_entry"]
37+
__all__ = [
38+
"CLI_TOP_LEVEL_SUBCOMMANDS",
39+
"cli_entry",
40+
"get_cli_top_level_subcommands",
41+
"run_cli_return_code",
42+
"run_main_return_code",
43+
"PluginRunInvocation",
44+
"get_plugin_run_invocation",
45+
"plugin_run_invocation_scope",
46+
"run_plugin_queue_with_invocation",
47+
]

0 commit comments

Comments
 (0)