BUILD #01 // MAIN PROJECT

CRASH-RESILIENT DUAL-MCU
TELEMETRY FAILOVER NODE

STM32 + ESP32 architecture with FreeRTOS tasks, CRC-framed telemetry, and a graded watchdog policy tree

STM32 ESP32 FreeRTOS UART CRC16 SPI Flash
IN PROGRESS — ACTIVE BUILD
2 MCU Nodes
3 Fault Tiers
4 RTOS Tasks
NV Persisted Log

What This Is

A two-node embedded system where one MCU can crash or hang and the other keeps things running. The STM32 handles sensor acquisition and control timing under FreeRTOS. The ESP32 acts as a supervisor — it watches heartbeat signals, monitors sequence continuity, and drives recovery based on what it observes. The two nodes talk over UART with CRC-framed packets.

The point was not "build a system that doesn't crash." The point was "build a system where crashes are survivable and auditable." Those are different problems.

STM32 CONTROL NODE // ESP32 SUPERVISOR
[UART TELEMETRY LINK] // [WATCHDOG POLICY TREE]
01

Architecture

The split between the two nodes is intentional: the supervisor needs to be independent from the thing it is supervising. If both roles lived on the same MCU, a runaway task or stack corruption that took out the control node would also take out recovery. Separate silicon gives the supervisor a real chance to intervene.

// NODE ROLES
STM32 (control)Sensor read, control output, FreeRTOS scheduling
ESP32 (supervisor)Heartbeat monitor, recovery dispatch, event log
LinkUART @ 115200, CRC16-CCITT framed, ACK/NACK windowed
Recovery storageSPI NOR flash, ring buffer, boot-readable context

The STM32 runs four FreeRTOS tasks: sensor acquisition, control output, telemetry TX, and a local watchdog kick. Task priorities are assigned so the watchdog kick cannot be starved by sensor or output tasks. If the watchdog misses its kick deadline, that event itself gets logged before reset.

Failure Policy Tree

Failures are not all the same. Treating every fault as a hard reset burns the reboot budget and makes logs useless. The policy tree grades response by how bad the fault actually is.

TIER 1 — MINOR
Task-level recovery. The supervisor detects a delayed heartbeat but within the warning window. It sends a nudge packet. The STM32 task scheduler self-corrects. No reset triggered.
TIER 2 — REPEATED
Subsystem reset. Fault count threshold exceeded within a rolling window. Supervisor issues a targeted peripheral reset. Full MCU reset only if the subsystem reset fails.
TIER 3 — CRITICAL
Controlled degrade. STM32 stops control output, supervisor assumes safe-state hold, full fault context written to NV flash before anything resets. Next boot reads that context before resuming normal operation.

Packet Framing

Raw UART without framing is unreliable under electrical noise or partial writes. Every packet uses a fixed header, payload type byte, 16-bit CRC, and an end delimiter. The supervisor ACKs known-good packets and NACKs bad CRCs. Packets that get no response within the timeout window are retransmitted up to three times before the link is marked degraded.

// PACKET STRUCTURE
Header0xAA 0x55 (sync bytes)
Type1B — heartbeat | telemetry | event | command
Length2B big-endian payload length
PayloadVariable
CRCCRC16-CCITT over type + length + payload
End0xFF delimiter

What's Left

The architecture is stable and the tier-1 and tier-2 paths work in isolation. Tier-3 degrade mode and the full NV write-on-fault sequence are still being validated. The flash ring buffer logic works in unit test but hasn't been stress-tested for power-loss mid-write. That's the active front right now.

Loading