Buffer Sanity Inspector

The mpisppy.debug_utils package provides a passive content-check utility for the MPI-RMA send and receive buffers used by the hub-and- spoke system. It is intended as a debugging aid when you suspect that a buffer is being written to from somewhere unexpected (for example, when a spoke sees a shutdown signal that the hub did not send).

The inspector does not modify producer code, does not introduce new MPI traffic, and is no-cost when not invoked.

When to Use This

  • A spoke is acting on data it should not have received. A spurious SHUTDOWN is the canonical example, but the same idea applies to any field — nonants outside their bounds, a write_id that went backwards, NaN data on a buffer the hub claims to have published, etc.

  • A new field/cylinder is being introduced and you want a cheap invariant check during development.

  • Reproducing intermittent buffer-content bugs where adding a print in the hot path is too noisy unless gated.

What the Inspector Checks

Generic checks (run for every field):

  • Trailing write_id slot is a finite, non-negative, integer-valued double.

  • Send buffers: the trailing slot equals buf.id().

  • Receive buffers: the trailing slot is not less than buf.id() (the last id that get_receive_buffer accepted). An optional ctx.last_write_id provides an additional, stricter baseline.

  • Data region: no inf values; no NaN values once write_id >= 1.

  • Padding region (between logical_len and padded_len) remains NaN — its canonical state from communicator_array. A finite value anywhere in padding is a write that ran past the field’s logical length.

Per-Field checks:

  • SHUTDOWN: only two legitimate states — NaN data with write_id == 0 (initial, no publish yet) or data[0] == 1.0 with write_id >= 1 (Hub.send_terminate has fired). Anything else, including data[0] == 0.0, is treated as corruption.

  • NONANT: data length is a positive multiple of ctx.get_nonant_count() (the publisher may hold several local scenarios, so the buffer can be wider than one scenario’s worth); componentwise bounds against [ctx.nonant_lower, ctx.nonant_upper] are only checked when the bound arrays match the buffer length.

  • NONANT_LOWER_BOUNDS / NONANT_UPPER_BOUNDS: length check; consistency with the counterpart bound when supplied via ctx.

  • OBJECTIVE_INNER_BOUND / OBJECTIVE_OUTER_BOUND: length 1.

  • BEST_XHAT: length at least ctx.get_nonant_count(); nonant prefix within bounds when supplied.

Manual Use

from mpisppy.debug_utils import inspect_buffer, InspectContext
from mpisppy.cylinders.spwindow import Field

ctx = InspectContext(nonant_count=spbase.nonant_length)
rep = inspect_buffer(some_recv_buf, Field.NONANT, ctx, verbose=True)
if not rep.ok:
    print(rep)

Report is a small dataclass with ok, findings (list of strings), severity ("warn" or "error"), and an optional dump populated when verbose=True. The inspector never raises; the caller decides whether to log, raise, or treat the read as stale.

Command-Line Trigger at Cylinder Shutdown

The --inspect-buffers-on-shutdown flag, exposed through the standard Config system (popular_args), causes each spoke to run the inspector on its SHUTDOWN receive buffer at the moment a shutdown is detected (inside got_kill_signal, only when the signal fires — not on every poll). Findings, with rank info, are printed when the report is not ok:

mpiexec -np N python my_driver.py --inspect-buffers-on-shutdown

When the flag is unset (the default), the inspector is never called and the shutdown-poll cost is unchanged.

Choice of trigger point: a spurious SHUTDOWN is most diagnostic at the moment of detection — the relevant buffer state has just arrived and has not yet been overwritten by later activity. The check fires once per spoke per cylinder shutdown, regardless of whether the signal was legitimate; legitimate shutdowns produce an empty findings list and print nothing.

Extending: Adding a Field Checker

Each per-field check is a function with the signature (buf, report, ctx) -> None that appends findings to report when invariants are violated. Register it in the CHECKERS dict in mpisppy/debug_utils/buffer_inspect.py:

def _check_my_field(buf, report, ctx):
    data = buf.value_array()
    if len(data) != some_expected_length:
        report.add(f"MY_FIELD wrong length: {len(data)}", severity="error")

CHECKERS[Field.MY_FIELD] = _check_my_field

Producers are intentionally left untouched; any context the checker needs (lengths, bounds, scenario tree info) is passed in via InspectContext.

See Also

The internal design document is at doc/designs/async_buffer_sanity_design.md, including the invariants the inspector relies on and explicit non-goals (cross-cylinder consensus, module-level history state).