
Observability Belongs on the PC, Not in the Production Binary
Part 7 covered host-first testing. Part 8 added hardware-in-the-loop testing with an IoTest image and a Python harness. Part 9 is about what you do when the system is running and you need to understand behavior without turning the firmware into a logging framework.
The guiding principle is simple:
- Keep target observability minimal and deterministic.
- Move heavy analysis, visualization, and introspection to the host.
This avoids firmware bloat, keeps timing predictable, and makes debugging better rather than noisier.
The boundary: on-target telemetry vs off-target analysis
Most embedded observability failures come from mixing these concerns:
- On-target code tries to format rich logs, allocate strings, and emit verbose traces.
- Those logs change timing, overflow buffers, and create new failure modes.
- Developers then debug the logging system instead of the firmware.
A better split:
- On target: emit small, fixed-format events and counters.
- Off target: decode, correlate, visualize, and analyze.
The target should produce data. The host should produce insight.
What “minimal” looks like on the target
Minimal does not mean “no observability.” It means “observability that cannot break determinism.”
A good target-side observability set:
- Counters
- loop slip count
- queue overflow counts
- parser error counts
- watchdog resets, brownout events
- State snapshots
- current mode/state id
- last fault code
- a small set of key inputs and outputs
- Event stream (optional)
- fixed-size event records in a ring buffer
- drained periodically, not emitted from ISRs unless absolutely necessary
Avoid:
- dynamic formatting
- iostreams
- variable-length strings
- “log everything” builds shipped as production candidates
Unsolicited advice: if your observability changes the system’s behavior, it is not observability. It is a new subsystem.
Use fixed-size event records
If you need traces, use fixed-size records so storage and bandwidth are predictable.
A typical record:
- timestamp or tick count
- event id
- a small number of integral parameters
Keep it boring. Boring is debuggable.
One tight C++ example:
#include <array>
#include <cstdint>
struct TraceEvent final {
std::uint32_t ticks{0U};
std::uint16_t id{0U};
std::int32_t a{0};
std::int32_t b{0};
};
template <std::size_t N>
class TraceBuffer final {
public:
void push(const TraceEvent& e) noexcept {
this->buf_[this->write_] = e;
this->write_ = (this->write_ + 1U) % N;
if(this->count_ < N) {
++this->count_;
} else {
++this->drop_count_;
}
}
[[nodiscard]] std::uint32_t drop_count() const noexcept { return this->drop_count_; }
private:
std::array<TraceEvent, N> buf_{};
std::size_t write_{0U};
std::size_t count_{0U};
std::uint32_t drop_count_{0U};
};
Notes:
- This is deterministic: fixed memory, fixed record size, explicit drop behavior.
- You can flush it on demand via a command or periodically in a non-hot path.
If you are using ETL for target containers, the same pattern applies. The principle is fixed capacity and explicit overflow behavior.
Prefer binary on the wire, decode on the host
Human-readable ASCII is great for IoTest bring-up and a small set of status queries. But for ongoing observability, binary is usually the right default:
- predictable size
- lower bandwidth
- less time spent formatting on the MCU
- easier to version and evolve
You can still keep it debuggable by decoding on the host into human-readable form.
A practical pattern:
- On target: emit compact records with ids and integers.
- On host: map ids to names, apply scaling, and render rich views.
Make “versioning” part of your protocol
Observability that cannot evolve becomes a liability.
Include:
- firmware build id
- protocol version
- record schema version for trace events
This avoids silent mismatches where tooling decodes the wrong format and produces nonsense.
Unsolicited advice: schema mismatch bugs waste days. Version everything.
Host-side tooling: where insight should live
If you keep the target signal clean, host tooling can be as rich as you want:
- trace decoding into JSON or CSV
- timeline views
- state transition diagrams
- slip and overflow dashboards
- correlation with test scenarios
This is also where you can afford heavier dependencies: parsers, GUI libraries, plotting libraries, data processing pipelines.
If you are building a series around GitLab pipelines, host tooling also becomes a first-class artifact:
- collected traces as pipeline artifacts
- automatic decoding jobs
- visual reports attached to merge requests
CI support: make observability actionable, not just available
Observability data is useful only if it is used.
Good pipeline patterns:
- HIL jobs upload trace artifacts.
- A decode job turns traces into readable reports.
- Thresholds fail the pipeline when they indicate regressions:
- slip count increased beyond a limit
- overflow counters non-zero
- unexpected fault codes
- Reports are retained for comparison across releases.
Unsolicited advice: treat overflow counters like failed assertions. If you see them, you are already outside your design envelope.
What not to do
Avoid these common traps:
- Shipping verbose logging in production builds “just in case.”
- Printing from ISRs.
- Allocating memory to build log strings on target.
- Using observability that depends on timing-sensitive host reads.
- Adding “temporary debug code” that becomes permanent.
If you need deep introspection, build a separate debug or IoTest variant and keep production deterministic.
Minimal checklist
- On target: counters, minimal snapshots, fixed-size event records, explicit drop behavior.
- On wire: prefer compact binary for traces, decode on host.
- Version everything: build id, protocol version, schema version.
- On host: rich analysis, visualization, automated report generation.
- In CI: store trace artifacts, decode automatically, and enforce regression thresholds.
Part 10 will tie everything together: a GitLab pipeline blueprint and an incremental migration checklist, including how to structure jobs, enforce quality gates, and keep “deterministic firmware discipline” from becoming optional under schedule pressure.
The Complete “Modern C++ Firmware” Series:
- Modern C++ Firmware: Proven Strategies for Tiny, Critical Systems (Part 1/10)
- The Case for Modern C++ on Tiny, Safety Critical Targets
- Modern C++ Firmware: Proven Strategies for Tiny, Critical Systems (Part 2/10)
- Choosing C++20 Today, C++23 on a Short Leash
- Modern C++ Firmware: Proven Strategies for Tiny, Critical Systems (Part 3/10)
- Deterministic By Construction: The Rules You Do Not Cross
- Modern C++ Firmware: Proven Strategies for Tiny, Critical Systems (Part 4/10)
- Time and Scheduling Without Footguns
- Modern C++ Firmware: Proven Strategies for Tiny, Critical Systems (Part 5/10)
- Concepts for Hardware Platforms, Not Vtables
- Modern C++ Firmware: Proven Strategies for Tiny, Critical Systems (Part 6/10)
- No Allocation in the Loop: Memory Rules That Survive CI
- Modern C++ Firmware: Proven Strategies for Tiny, Critical Systems (Part 7/10)
- Test the Firmware Without the Board: Host First Strategy
- Modern C++ Firmware: Proven Strategies for Tiny, Critical Systems (Part 8/10)
- Python and ASCII Protocols for Hardware in the Loop
- Modern C++ Firmware: Proven Strategies for Tiny, Critical Systems (Part 9/10)
- Observability Belongs on the PC, Not in the Production Binary
- Modern C++ Firmware: Proven Strategies for Tiny, Critical Systems (Part 10/10)
- GitLab Pipeline Blueprint and a Migration Checklist
Need professional firmware development help? Engage with Polyrhythm
Discover more from John Farrier
Subscribe to get the latest posts sent to your email.