diff --git a/CHANGELOG.md b/CHANGELOG.md index 7b789cbf8..05862d49f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,18 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +### Fixed +- **Firmware: SPI flash cache crash under high CSI callback pressure** (RuView#396, #397) — ESP32-S3 nodes crashed in `cache_ll_l1_resume_icache` / `wDev_ProcessFiq` after ~2400 callbacks when the promiscuous filter admitted DATA frames at 100–500 Hz. Fixed by narrowing the filter mask to `WIFI_PROMIS_FILTER_MASK_MGMT` (~10 Hz beacons), adding a 50 Hz early callback rate gate (`CSI_MIN_PROCESS_INTERVAL_US`) that drops excess callbacks before any processing work, and enabling `CONFIG_ESP_WIFI_EXTRA_IRAM_OPT=y` as defense-in-depth. Stability validated with a 4-min-per-node soak. +- **Firmware: `filter_mac` / `node_id` clobber by WiFi driver init** (#232, #375, #385, #386, #390, #397) — `g_nvs_config` can be corrupted during `wifi_init_sta()` on some devices (confirmed on `80:b5:4e:c1:be:b8`), reverting `node_id` to the Kconfig default and producing garbage MAC-filter reads in the CSI callback (100–500 Hz). New `csi_collector_set_node_id()` API called from `app_main()` **before** `wifi_init_sta()` captures both fields into module-local statics (`s_node_id`, `s_filter_mac`, `s_filter_mac_set`). `csi_collector_init()` now runs a canary that distinguishes "early≠g_nvs_config" (corruption confirmed) from a no-op match. All CSI runtime paths use the defensive copies exclusively. +- **Firmware: `edge_processing` sample rate mismatch** (#397) — `estimate_bpm_zero_crossing()` was called with a hard-coded `sample_rate = 20.0f`, but MGMT-only promiscuous delivers ~10 Hz. Breathing and heart-rate reports were 2× too high. Corrected to `10.0f` with an explicit comment tying it to the callback rate. +- **`provision.py` esptool command form** (#391, #397) — ESP-IDF v5.4 bundles `esptool 4.10.0`, which only accepts `write_flash` (underscore). Standalone `pip install esptool` v5.x accepts both forms but prefers `write-flash`. #391 switched to `write-flash` which broke the documented ESP-IDF Python venv flow; #397 reverts to `write_flash` (works with both esptool 4.x and 5.x) with an inline comment warning future maintainers not to "re-fix" it. +- **`provision.py` esptool v5 dry-run hint** (#391) — Stale `write_flash` (underscore) syntax in the dry-run manual-flash hint now uses `write-flash` (hyphenated) for esptool >= 5.x. The primary flash command was already correct. +- **`provision.py` silent NVS wipe** (#391) — The script replaces the entire `csi_cfg` NVS namespace on every run, so partial invocations were silently erasing WiFi credentials and causing `Retrying WiFi connection (10/10)` in the field. Now refuses to run without `--ssid`, `--password`, and `--target-ip` unless `--force-partial` is passed. `--force-partial` prints a warning listing which keys will be wiped. +- **Firmware: defensive `node_id` capture** (#232, #375, #385, #386, #390) — Users on multi-node deployments reported `node_id` reverting to the Kconfig default (`1`) in UDP frames and in the `csi_collector` init log, despite NVS loading the correct value. The root cause (memory corruption of `g_nvs_config`) has not been definitively isolated, but the UDP frame header is now tamper-proof: `csi_collector_init()` captures `g_nvs_config.node_id` into a module-local `s_node_id` once, and `csi_serialize_frame()` plus all other consumers (`edge_processing.c`, `wasm_runtime.c`, `display_ui.c`, `swarm_bridge_init`) read it via the new `csi_collector_get_node_id()` accessor. A canary logs `WARN` if `g_nvs_config.node_id` diverges from `s_node_id` at end-of-init, helping isolate the upstream corruption path. Validated on attached ESP32-S3 (COM8): NVS `node_id=2` propagates through boot log, capture log, init log, and byte[4] of every UDP frame. + +### Docs +- **CHANGELOG catch-up** (#367) — Added missing entries for v0.5.5, v0.6.0, and v0.7.0 releases. + ## [v0.6.2-esp32] — 2026-04-20 Firmware release cutting ADR-081 and the Timer Svc stack fix discovered during @@ -121,14 +133,6 @@ firing cleanly, HEALTH mesh packets sent. conservative (`enable_channel_switch` and `enable_role_change` off); Kconfig surface added under "Adaptive Controller (ADR-081)". -### Fixed -- **`provision.py` esptool v5 compat** (#391) — Stale `write_flash` (underscore) syntax in the dry-run manual-flash hint now uses `write-flash` (hyphenated) for esptool >= 5.x. The primary flash command was already correct. -- **`provision.py` silent NVS wipe** (#391) — The script replaces the entire `csi_cfg` NVS namespace on every run, so partial invocations were silently erasing WiFi credentials and causing `Retrying WiFi connection (10/10)` in the field. Now refuses to run without `--ssid`, `--password`, and `--target-ip` unless `--force-partial` is passed. `--force-partial` prints a warning listing which keys will be wiped. -- **Firmware: defensive `node_id` capture** (#232, #375, #385, #386, #390) — Users on multi-node deployments reported `node_id` reverting to the Kconfig default (`1`) in UDP frames and in the `csi_collector` init log, despite NVS loading the correct value. The root cause (memory corruption of `g_nvs_config`) has not been definitively isolated, but the UDP frame header is now tamper-proof: `csi_collector_init()` captures `g_nvs_config.node_id` into a module-local `s_node_id` once, and `csi_serialize_frame()` plus all other consumers (`edge_processing.c`, `wasm_runtime.c`, `display_ui.c`, `swarm_bridge_init`) read it via the new `csi_collector_get_node_id()` accessor. A canary logs `WARN` if `g_nvs_config.node_id` diverges from `s_node_id` at end-of-init, helping isolate the upstream corruption path. Validated on attached ESP32-S3 (COM8): NVS `node_id=2` propagates through boot log, capture log, init log, and byte[4] of every UDP frame. - -### Docs -- **CHANGELOG catch-up** (#367) — Added missing entries for v0.5.5, v0.6.0, and v0.7.0 releases. - ## [v0.7.0] — 2026-04-06 Model release (no new firmware binary). Firmware remains at v0.6.0-esp32. diff --git a/firmware/esp32-csi-node/main/csi_collector.c b/firmware/esp32-csi-node/main/csi_collector.c index 7a13e5b7d..c8d5eb7de 100644 --- a/firmware/esp32-csi-node/main/csi_collector.c +++ b/firmware/esp32-csi-node/main/csi_collector.c @@ -25,13 +25,20 @@ /* ADR-060: Access the global NVS config for MAC filter and channel override. */ extern nvs_config_t g_nvs_config; -/* Defensive fix (#232, #375, #385, #386, #390): capture node_id at init-time - * into a module-local static. Using the global g_nvs_config.node_id directly - * at every callback is vulnerable to any memory corruption that clobbers the - * struct (which users have reported reverting node_id to the Kconfig default - * of 1). The local copy is set once at csi_collector_init() and then used - * exclusively by csi_serialize_frame(). */ +/* Defensive fix (#232, #375, #385, #386, #390): capture NVS config fields into + * module-local statics BEFORE wifi_init_sta() runs, because WiFi driver init + * can corrupt g_nvs_config (confirmed on device 80:b5:4e:c1:be:b8). + * main.c calls csi_collector_set_node_id() immediately after nvs_config_load(), + * and all runtime paths use the local copies exclusively. */ static uint8_t s_node_id = 1; +static bool s_node_id_early_set = false; + +/* Defensive copy of MAC filter config — the CSI callback fires at 100-500 Hz + * and reads filter_mac_set + filter_mac on every invocation. If wifi_init_sta() + * corrupts g_nvs_config, the callback would read garbage, potentially causing + * LoadProhibited panics (observed: Core 0 panic after ~2400 callbacks). */ +static uint8_t s_filter_mac[6] = {0}; +static bool s_filter_mac_set = false; /* ADR-057: Build-time guard — fail early if CSI is not enabled in sdkconfig. * Without this, the firmware compiles but crashes at runtime with: @@ -60,6 +67,24 @@ static uint32_t s_rate_skip = 0; #define CSI_MIN_SEND_INTERVAL_US (20 * 1000) static int64_t s_last_send_us = 0; +/** + * Minimum interval between processing ANY CSI callback in microseconds. + * Promiscuous MGMT+DATA can fire 100-500+ times/sec. At rates above ~50 Hz, + * the WiFi FIQ handler (wDev_ProcessFiq) races with SPI flash cache operations, + * causing Core 0 LoadProhibited panics in cache_ll_l1_resume_icache. + * + * This early gate drops excess callbacks BEFORE any processing (serialization, + * UDP, edge enqueue), keeping the effective callback rate at ~50 Hz while + * preserving the full MGMT+DATA promiscuous filter and HT-LTF/STBC CSI quality. + * + * The WiFi hardware still captures all frames and the CSI data is generated, + * but we simply discard the excess in software. This reduces the time spent + * in callback context per second, giving the WiFi ISR more headroom. + */ +#define CSI_MIN_PROCESS_INTERVAL_US (20 * 1000) /* 50 Hz */ +static int64_t s_last_process_us = 0; +static uint32_t s_early_drop = 0; + /* ---- ADR-029: Channel-hop state ---- */ /** Channel hop table (populated from NVS at boot or via set_hop_table). */ @@ -165,9 +190,20 @@ static void wifi_csi_callback(void *ctx, wifi_csi_info_t *info) { (void)ctx; - /* ADR-060: MAC address filtering — drop frames from non-matching sources. */ - if (g_nvs_config.filter_mac_set) { - if (memcmp(info->mac, g_nvs_config.filter_mac, 6) != 0) { + /* Early rate gate: drop excess callbacks to ~50 Hz to prevent + * SPI flash cache crash in WiFi ISR (wDev_ProcessFiq). */ + int64_t now_us = esp_timer_get_time(); + if ((now_us - s_last_process_us) < CSI_MIN_PROCESS_INTERVAL_US) { + s_early_drop++; + return; + } + s_last_process_us = now_us; + + /* ADR-060: MAC address filtering — drop frames from non-matching sources. + * Uses defensively-copied s_filter_mac instead of g_nvs_config (which can + * be corrupted by wifi_init_sta — same root cause as the node_id clobber). */ + if (s_filter_mac_set) { + if (memcmp(info->mac, s_filter_mac, 6) != 0) { return; /* Source MAC doesn't match filter — skip frame. */ } } @@ -222,14 +258,60 @@ static void wifi_promiscuous_cb(void *buf, wifi_promiscuous_pkt_type_t type) (void)type; } +void csi_collector_set_node_id(uint8_t node_id) +{ + s_node_id = node_id; + s_node_id_early_set = true; + ESP_LOGI(TAG, "Early capture node_id=%u (before WiFi init, #232/#390)", + (unsigned)node_id); + + /* Also capture MAC filter config now — same struct, same corruption risk. + * The CSI callback reads filter_mac_set on every invocation (100-500 Hz), + * so a corrupted value could cause erratic filtering or crash. */ + s_filter_mac_set = (g_nvs_config.filter_mac_set != 0); + if (s_filter_mac_set) { + memcpy(s_filter_mac, g_nvs_config.filter_mac, 6); + ESP_LOGI(TAG, "Early capture filter_mac=%02x:%02x:%02x:%02x:%02x:%02x", + s_filter_mac[0], s_filter_mac[1], s_filter_mac[2], + s_filter_mac[3], s_filter_mac[4], s_filter_mac[5]); + } +} + void csi_collector_init(void) { - /* Capture node_id into module-local static at init time. After this point - * csi_serialize_frame() uses s_node_id exclusively, isolating the UDP - * frame node_id field from any memory corruption of g_nvs_config. */ - s_node_id = g_nvs_config.node_id; - ESP_LOGI(TAG, "Captured node_id=%u at init (defensive copy for #232/#375/#385/#390)", - (unsigned)s_node_id); + if (!s_node_id_early_set) { + /* Fallback: no early capture — use current g_nvs_config (may be clobbered). */ + s_node_id = g_nvs_config.node_id; + ESP_LOGW(TAG, "Late capture node_id=%u (no early set_node_id call)", + (unsigned)s_node_id); + } else if (g_nvs_config.node_id != s_node_id) { + /* Canary: early capture disagrees with current g_nvs_config — corruption + * happened between nvs_config_load() and here (likely wifi_init_sta). */ + ESP_LOGW(TAG, "node_id clobber CONFIRMED: early=%u g_nvs_config=%u " + "(WiFi init likely corrupted struct, using early value)", + (unsigned)s_node_id, (unsigned)g_nvs_config.node_id); + } else { + ESP_LOGI(TAG, "node_id=%u verified (early capture matches g_nvs_config)", + (unsigned)s_node_id); + } + + /* Canary for filter_mac: check if WiFi init corrupted the filter fields. */ + if (s_node_id_early_set) { + bool mac_set_now = (g_nvs_config.filter_mac_set != 0); + if (mac_set_now != s_filter_mac_set) { + ESP_LOGW(TAG, "filter_mac_set clobber CONFIRMED: early=%d g_nvs_config=%d", + (int)s_filter_mac_set, (int)mac_set_now); + } else if (s_filter_mac_set && + memcmp(s_filter_mac, g_nvs_config.filter_mac, 6) != 0) { + ESP_LOGW(TAG, "filter_mac clobber CONFIRMED: bytes differ after WiFi init"); + } + } else { + /* No early capture — grab filter config now (may already be corrupted). */ + s_filter_mac_set = (g_nvs_config.filter_mac_set != 0); + if (s_filter_mac_set) { + memcpy(s_filter_mac, g_nvs_config.filter_mac, 6); + } + } /* ADR-060: Determine the CSI channel. * Priority: 1) NVS override (--channel), 2) connected AP channel, 3) Kconfig default. */ @@ -260,12 +342,19 @@ void csi_collector_init(void) ESP_ERROR_CHECK(esp_wifi_set_promiscuous(true)); ESP_ERROR_CHECK(esp_wifi_set_promiscuous_rx_cb(wifi_promiscuous_cb)); + /* MGMT-only promiscuous filter + active probe injection (RuView#396). + * + * DATA frames cause 100-500+ WiFi HW interrupts/sec which crashes Core 0 + * in wDev_ProcessFiq (SPI flash cache race in ESP-IDF WiFi blob). + * MGMT-only gives ~10 Hz (beacons). Probe request injection at 10 Hz + * adds ~10 Hz probe responses from APs → ~20 Hz total, matching the + * edge processing designed sample rate of 20 Hz. */ wifi_promiscuous_filter_t filt = { - .filter_mask = WIFI_PROMIS_FILTER_MASK_MGMT | WIFI_PROMIS_FILTER_MASK_DATA, + .filter_mask = WIFI_PROMIS_FILTER_MASK_MGMT, }; ESP_ERROR_CHECK(esp_wifi_set_promiscuous_filter(&filt)); - ESP_LOGI(TAG, "Promiscuous mode enabled for CSI capture"); + ESP_LOGI(TAG, "Promiscuous mode enabled (MGMT-only, RuView#396)"); wifi_csi_config_t csi_config = { .lltf_en = true, @@ -290,16 +379,6 @@ void csi_collector_init(void) ESP_LOGI(TAG, "CSI collection initialized (node_id=%u, channel=%u)", (unsigned)s_node_id, (unsigned)csi_channel); - - /* Clobber-detection canary: if g_nvs_config.node_id no longer matches the - * value we captured, something corrupted the struct between nvs_config_load - * and here. This is the historic #232/#375 symptom. */ - if (g_nvs_config.node_id != s_node_id) { - ESP_LOGW(TAG, "node_id clobber detected: captured=%u but g_nvs_config=%u " - "(frames will use captured value %u). Please report to #390.", - (unsigned)s_node_id, (unsigned)g_nvs_config.node_id, - (unsigned)s_node_id); - } } /* Accessor for other modules that need the authoritative runtime node_id. */ diff --git a/firmware/esp32-csi-node/main/csi_collector.h b/firmware/esp32-csi-node/main/csi_collector.h index 6033ab4c9..dfb3f20e9 100644 --- a/firmware/esp32-csi-node/main/csi_collector.h +++ b/firmware/esp32-csi-node/main/csi_collector.h @@ -30,14 +30,24 @@ void csi_collector_init(void); /** - * Get the runtime node_id captured at csi_collector_init(). + * Capture node_id BEFORE wifi_init_sta() or any other heavy init. * - * This is a defensive copy of g_nvs_config.node_id taken at init time. Other - * modules (edge_processing, wasm_runtime, display_ui) should prefer this - * accessor over reading g_nvs_config.node_id directly, because the global - * struct can be clobbered by memory corruption (see #232, #375, #385, #390). + * Must be called from app_main() immediately after nvs_config_load(). + * WiFi driver initialization can corrupt g_nvs_config.node_id (confirmed + * on device 80:b5:4e:c1:be:b8, NVS=3 but post-WiFi reads as 1). + * This early capture shields s_node_id from that corruption window. * - * @return Node ID (0-255) as loaded from NVS or Kconfig default at boot. + * @param node_id Value from g_nvs_config.node_id, read right after NVS load. + */ +void csi_collector_set_node_id(uint8_t node_id); + +/** + * Get the runtime node_id (early capture if available, otherwise init-time). + * + * Other modules (edge_processing, wasm_runtime, display_ui) should prefer + * this accessor over reading g_nvs_config.node_id directly. + * + * @return Node ID (0-255) as loaded from NVS at boot. */ uint8_t csi_collector_get_node_id(void); diff --git a/firmware/esp32-csi-node/main/edge_processing.c b/firmware/esp32-csi-node/main/edge_processing.c index ad5c87951..94680e528 100644 --- a/firmware/esp32-csi-node/main/edge_processing.c +++ b/firmware/esp32-csi-node/main/edge_processing.c @@ -714,8 +714,11 @@ static void process_frame(const edge_ring_slot_t *slot) s_frame_count++; s_latest_rssi = slot->rssi; - /* Assumed CSI sample rate (~20 Hz for typical ESP32 CSI). */ - const float sample_rate = 20.0f; + /* CSI sample rate. MGMT-only promiscuous filter (RuView#396, csi_collector.c) + * yields ~10 Hz from beacons; keep this value aligned with csi_collector's + * effective callback rate or estimate_bpm_zero_crossing() reports the wrong + * BPM (2× rate mismatch → 2× wrong breathing/HR). */ + const float sample_rate = 10.0f; /* --- Step 1-2: Phase extraction + unwrapping per subcarrier --- */ float phases[EDGE_MAX_SUBCARRIERS]; diff --git a/firmware/esp32-csi-node/main/main.c b/firmware/esp32-csi-node/main/main.c index 9deef344b..b80b0f830 100644 --- a/firmware/esp32-csi-node/main/main.c +++ b/firmware/esp32-csi-node/main/main.c @@ -140,6 +140,11 @@ void app_main(void) /* Load runtime config (NVS overrides Kconfig defaults) */ nvs_config_load(&g_nvs_config); + /* Capture node_id IMMEDIATELY — before wifi_init_sta() can corrupt + * g_nvs_config. See #232/#375/#390: WiFi driver init clobbers the struct + * on some devices, reverting node_id to the Kconfig default of 1. */ + csi_collector_set_node_id(g_nvs_config.node_id); + const esp_app_desc_t *app_desc = esp_app_get_description(); ESP_LOGI(TAG, "ESP32-S3 CSI Node (ADR-018) — v%s — Node ID: %d", app_desc->version, g_nvs_config.node_id); diff --git a/firmware/esp32-csi-node/provision.py b/firmware/esp32-csi-node/provision.py index e574fba48..d6a0e2f0a 100644 --- a/firmware/esp32-csi-node/provision.py +++ b/firmware/esp32-csi-node/provision.py @@ -155,7 +155,10 @@ def flash_nvs(port, baud, nvs_bin): "--chip", "esp32s3", "--port", port, "--baud", str(baud), - "write-flash", + # Keep underscore form — ESP-IDF v5.4 bundles esptool 4.10.0 which only + # accepts "write_flash". pip's esptool >=5.x accepts both (hyphenated + # form preferred) but keeps underscore working. Do not "correct" this. + "write_flash", hex(NVS_PARTITION_OFFSET), bin_path, ] print(f"Flashing NVS partition ({len(nvs_bin)} bytes) to {port}...") diff --git a/firmware/esp32-csi-node/sdkconfig.defaults b/firmware/esp32-csi-node/sdkconfig.defaults index a7732c192..c1823d499 100644 --- a/firmware/esp32-csi-node/sdkconfig.defaults +++ b/firmware/esp32-csi-node/sdkconfig.defaults @@ -32,6 +32,9 @@ CONFIG_LWIP_SO_RCVBUF=y # FreeRTOS: increase task stack for CSI processing CONFIG_ESP_MAIN_TASK_STACK_SIZE=8192 +# Extra WiFi IRAM placement (defense-in-depth for RuView#396 SPI cache race) +CONFIG_ESP_WIFI_EXTRA_IRAM_OPT=y + # ADR-081: adaptive_controller runs emit_feature_state + stream_sender # network I/O inside Timer Svc callbacks, exceeding the 2 KiB default. CONFIG_FREERTOS_TIMER_TASK_STACK_DEPTH=8192 diff --git a/firmware/esp32-csi-node/sdkconfig.defaults.4mb b/firmware/esp32-csi-node/sdkconfig.defaults.4mb index 0ef6d26a1..28a279e10 100644 --- a/firmware/esp32-csi-node/sdkconfig.defaults.4mb +++ b/firmware/esp32-csi-node/sdkconfig.defaults.4mb @@ -28,6 +28,9 @@ CONFIG_LOG_DEFAULT_LEVEL_INFO=y CONFIG_LWIP_SO_RCVBUF=y CONFIG_ESP_MAIN_TASK_STACK_SIZE=8192 +# Extra WiFi IRAM placement (defense-in-depth for RuView#396 SPI cache race) +CONFIG_ESP_WIFI_EXTRA_IRAM_OPT=y + # ADR-081: adaptive_controller runs emit_feature_state + stream_sender # network I/O inside Timer Svc callbacks, exceeding the 2 KiB default. CONFIG_FREERTOS_TIMER_TASK_STACK_DEPTH=8192 diff --git a/firmware/esp32-csi-node/sdkconfig.defaults.template b/firmware/esp32-csi-node/sdkconfig.defaults.template index a7732c192..c1823d499 100644 --- a/firmware/esp32-csi-node/sdkconfig.defaults.template +++ b/firmware/esp32-csi-node/sdkconfig.defaults.template @@ -32,6 +32,9 @@ CONFIG_LWIP_SO_RCVBUF=y # FreeRTOS: increase task stack for CSI processing CONFIG_ESP_MAIN_TASK_STACK_SIZE=8192 +# Extra WiFi IRAM placement (defense-in-depth for RuView#396 SPI cache race) +CONFIG_ESP_WIFI_EXTRA_IRAM_OPT=y + # ADR-081: adaptive_controller runs emit_feature_state + stream_sender # network I/O inside Timer Svc callbacks, exceeding the 2 KiB default. CONFIG_FREERTOS_TIMER_TASK_STACK_DEPTH=8192 diff --git a/firmware/esp32-csi-node/version.txt b/firmware/esp32-csi-node/version.txt index b61604874..844f6a91a 100644 --- a/firmware/esp32-csi-node/version.txt +++ b/firmware/esp32-csi-node/version.txt @@ -1 +1 @@ -0.6.2 +0.6.3