public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1
@ 2026-04-24 22:38 Lukasz Raczylo
  2026-04-24 22:38 ` [RFC PATCH net-next 1/3] net: macb: flush PCIe posted write after TSTART doorbell Lukasz Raczylo
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Lukasz Raczylo @ 2026-04-24 22:38 UTC (permalink / raw)
  To: netdev
  Cc: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, linux-kernel,
	linux-arm-kernel, linux-rpi-kernel

Hi netdev, Nicolas, Claudiu, linux-rpi,

This series proposes three candidate fixes for the silent TX stall
observed on Raspberry Pi 5 (BCM2712 SoC, Cadence GEM via RP1 PCIe
south bridge).  The bug has been reported, with reproducers, at:

  * https://github.com/cilium/cilium/issues/43198
  * https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877

Cilium #43198 reports reproduction on linux-raspi 6.17.0-1004, and
explicitly notes reproduction with both Cilium/eBPF and
Calico/nftables dataplanes (i.e. not CNI-specific).  We observe the
same failure mode on kernel 6.18.24 built from raspberrypi/linux
rpi-6.18.y @ f2f68e79f16f, across a 24-node Raspberry Pi 5 fleet.
The 6.17/6.18 commonality and the two-CNI reproduction together
put the root cause below the packet-scheduling layer, in the macb
driver or the RP1 PCIe path.

Observed symptoms (our side, 6.18.24; consistent with the linked
reports):

  * queue->tx_tail stops advancing at a single second;
  * /sys/class/net/<iface>/statistics/tx_packets stops incrementing;
  * qdisc backlog grows past zero; netif_stop_subqueue() is called;
  * RX counters continue advancing; the MAC IRQ line continues to
    fire (RX completions are handled);
  * no kernel log line is produced for the duration of the stall;
  * dev_watchdog does not fire: macb_netdev_ops has no
    .ndo_tx_timeout, and our reading is that trans_start is kept
    fresh by successful xmit prior to the ring filling;
  * recovery on our side has required `ip link set <iface> down/up`
    via an out-of-band watchdog DaemonSet.

Reading the current driver we identified three plausible races
between driver and hardware, each of which could independently
produce the observed behaviour.  We did not determine which is the
actual root cause -- that likely requires either BCM2712/RP1
documentation we do not have, or dynamic tracing of the driver
during an in-situ stall.  The series therefore attempts to close
all three, with each commit message stating which specific race
that patch is targeting.

  Patch 1/3 -- flush PCIe posted write after TSTART doorbell.
  Writes to NCR are posted PCIe writes and may not reach the MAC
  before the driver returns.  If the TSTART doorbell is lost, no
  TX starts, no TCOMP arrives, and the ring goes quiescent.  A
  read-back of NCR after the write is a standard read-after-write
  PCIe flush.

  Patch 2/3 -- re-check ISR after IER re-enable in macb_tx_poll().
  An existing comment in macb_tx_poll() notes that completions
  raised while TCOMP is masked do not re-fire when IER is
  re-enabled, and mitigates the window with macb_tx_complete_pending(),
  which inspects driver-visible ring state only (after rmb()).  On
  PCIe-attached parts the descriptor DMA write that sets TX_USED
  can remain in flight when that check runs; the rmb() orders CPU
  writes but does not retire peripheral DMA.  Reading ISR directly
  after IER re-enable addresses this in two ways: (a) the MMIO read
  is an architected PCIe read barrier for prior DMA writes, so a
  subsequent macb_tx_complete_pending() sees up-to-date TX_USED
  state; (b) it directly observes a pending TCOMP bit if the
  hardware has one set.  Either signal reschedules NAPI.

  Patch 3/3 -- TX stall watchdog.  Defence-in-depth.  If patches
  1 and 2 close the races we identified, this patch performs a
  single spin_lock_irqsave/unlock and a branch per queue per
  second with no other effect.  If a further race remains that we
  have not identified, it invokes the driver's own existing
  macb_tx_restart(), which already verifies that TBQP is behind
  tx_head before re-asserting TSTART.  We include this patch
  because we have empirically observed multi-minute stalls on this
  hardware; we are willing to drop it if the preference is for
  1 and 2 to stand alone.

Status and testing:

  * Apply-tested against Linux net-next HEAD (this series is
    generated from it) and against raspberrypi/linux rpi-6.18.y @
    f2f68e79f16f (the fork our fleet runs): all three apply
    cleanly on net-next; the rpi fork carries an additional local
    `bool tx_pending` field on `struct macb_queue` that is not in
    mainline, so we maintain a small rebased patch 3 hunk for it.
  * Build-tested: the series compiles cleanly as part of our Talos
    image build pipeline on arm64.
  * Runtime-tested, early signal: ~4 h 20 min of post-patch uptime
    on the canary node, ~3 h 15 min on the slowest (last master to
    upgrade), ~95 node-hours cumulative across the 24-node fleet at
    the time this cover letter was written.  During that window the
    fleet-wide counts are zero RECOVER events, zero `[tx-stall]`
    partial markers (an out-of-band userspace detector that records
    even transient one-second freezes that recover before the
    3-second threshold), and zero ping-failure markers.  Pre-patch
    reference window (2026-04-24 14:00-18:10 UTC, when proper
    monitoring was in place) observed multiple stalls per hour at
    fleet level; at that rate we would expect on the order of 50
    stalls in 95 node-hours, actual is zero.  We will follow up
    with a 24 h and a 1-week data point as the same observability
    runs forward; the direction so far is consistent with patches
    1 and 2 closing the underlying race(s) and patch 3 correctly
    being a no-op on healthy hardware.

The series does not depend on any other in-flight work we are
aware of.  Happy to split, rebase, or drop individual patches on
feedback.  All three are independently revertable.

Lukasz Raczylo (3):
  net: macb: flush PCIe posted write after TSTART doorbell
  net: macb: re-check ISR after IER re-enable in macb_tx_poll
  net: macb: add TX stall watchdog as defence-in-depth safety net

 drivers/net/ethernet/cadence/macb.h      |  5 ++
 drivers/net/ethernet/cadence/macb_main.c | 99 +++++++++++++++++++++---
 2 files changed, 94 insertions(+), 10 deletions(-)

-- 
2.53.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-04-25 21:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-24 22:38 [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 Lukasz Raczylo
2026-04-24 22:38 ` [RFC PATCH net-next 1/3] net: macb: flush PCIe posted write after TSTART doorbell Lukasz Raczylo
2026-04-24 22:38 ` [RFC PATCH net-next 2/3] net: macb: re-check ISR after IER re-enable in macb_tx_poll Lukasz Raczylo
2026-04-24 22:38 ` [RFC PATCH net-next 3/3] net: macb: add TX stall watchdog as defence-in-depth safety net Lukasz Raczylo
2026-04-25 21:48 ` [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 Lukasz Raczylo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox