Buffer Sanity Inspector
The mpisppy.debug_utils package provides a passive content-check
utility for the MPI-RMA send and receive buffers used by the hub-and-
spoke system. It is intended as a debugging aid when you suspect that
a buffer is being written to from somewhere unexpected (for example,
when a spoke sees a shutdown signal that the hub did not send).
The inspector does not modify producer code, does not introduce new MPI traffic, and is no-cost when not invoked.
When to Use This
A spoke is acting on data it should not have received. A spurious
SHUTDOWNis the canonical example, but the same idea applies to any field — nonants outside their bounds, a write_id that went backwards, NaN data on a buffer the hub claims to have published, etc.A new field/cylinder is being introduced and you want a cheap invariant check during development.
Reproducing intermittent buffer-content bugs where adding a print in the hot path is too noisy unless gated.
What the Inspector Checks
Generic checks (run for every field):
Trailing
write_idslot is a finite, non-negative, integer-valued double.Send buffers: the trailing slot equals
buf.id().Receive buffers: the trailing slot is not less than
buf.id()(the last id thatget_receive_bufferaccepted). An optionalctx.last_write_idprovides an additional, stricter baseline.Data region: no
infvalues; noNaNvalues oncewrite_id >= 1.Padding region (between
logical_lenandpadded_len) remainsNaN— its canonical state fromcommunicator_array. A finite value anywhere in padding is a write that ran past the field’s logical length.
Per-Field checks:
SHUTDOWN: only two legitimate states —NaNdata withwrite_id == 0(initial, no publish yet) ordata[0] == 1.0withwrite_id >= 1(Hub.send_terminatehas fired). Anything else, includingdata[0] == 0.0, is treated as corruption.NONANT: data length is a positive multiple ofctx.get_nonant_count()(the publisher may hold several local scenarios, so the buffer can be wider than one scenario’s worth); componentwise bounds against[ctx.nonant_lower, ctx.nonant_upper]are only checked when the bound arrays match the buffer length.NONANT_LOWER_BOUNDS/NONANT_UPPER_BOUNDS: length check; consistency with the counterpart bound when supplied viactx.OBJECTIVE_INNER_BOUND/OBJECTIVE_OUTER_BOUND: length 1.BEST_XHAT: length at leastctx.get_nonant_count(); nonant prefix within bounds when supplied.
Manual Use
from mpisppy.debug_utils import inspect_buffer, InspectContext
from mpisppy.cylinders.spwindow import Field
ctx = InspectContext(nonant_count=spbase.nonant_length)
rep = inspect_buffer(some_recv_buf, Field.NONANT, ctx, verbose=True)
if not rep.ok:
print(rep)
Report is a small dataclass with ok, findings (list of
strings), severity ("warn" or "error"), and an optional
dump populated when verbose=True. The inspector never raises;
the caller decides whether to log, raise, or treat the read as stale.
Command-Line Trigger at Cylinder Shutdown
The --inspect-buffers-on-shutdown flag, exposed through the
standard Config system (popular_args), causes each spoke to
run the inspector on its SHUTDOWN receive buffer at the moment a
shutdown is detected (inside got_kill_signal, only when the
signal fires — not on every poll). Findings, with rank info, are
printed when the report is not ok:
mpiexec -np N python my_driver.py --inspect-buffers-on-shutdown
When the flag is unset (the default), the inspector is never called and the shutdown-poll cost is unchanged.
Choice of trigger point: a spurious SHUTDOWN is most diagnostic at
the moment of detection — the relevant buffer state has just arrived
and has not yet been overwritten by later activity. The check fires
once per spoke per cylinder shutdown, regardless of whether the
signal was legitimate; legitimate shutdowns produce an empty
findings list and print nothing.
Extending: Adding a Field Checker
Each per-field check is a function with the signature
(buf, report, ctx) -> None that appends findings to report
when invariants are violated. Register it in the
CHECKERS dict in mpisppy/debug_utils/buffer_inspect.py:
def _check_my_field(buf, report, ctx):
data = buf.value_array()
if len(data) != some_expected_length:
report.add(f"MY_FIELD wrong length: {len(data)}", severity="error")
CHECKERS[Field.MY_FIELD] = _check_my_field
Producers are intentionally left untouched; any context the checker
needs (lengths, bounds, scenario tree info) is passed in via
InspectContext.
See Also
The internal design document is at
doc/designs/async_buffer_sanity_design.md, including the
invariants the inspector relies on and explicit non-goals
(cross-cylinder consensus, module-level history state).