From: Lukasz Raczylo <lukasz@raczylo.com>
To: netdev@vger.kernel.org
Cc: Nicolas Ferre <nicolas.ferre@microchip.com>,
Claudiu Beznea <claudiu.beznea@tuxon.dev>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S . Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-rpi-kernel@lists.infradead.org
Subject: [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1
Date: Fri, 24 Apr 2026 23:38:30 +0100 [thread overview]
Message-ID: <cover.1777064117.git.lukasz@raczylo.com> (raw)
Hi netdev, Nicolas, Claudiu, linux-rpi,
This series proposes three candidate fixes for the silent TX stall
observed on Raspberry Pi 5 (BCM2712 SoC, Cadence GEM via RP1 PCIe
south bridge). The bug has been reported, with reproducers, at:
* https://github.com/cilium/cilium/issues/43198
* https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
Cilium #43198 reports reproduction on linux-raspi 6.17.0-1004, and
explicitly notes reproduction with both Cilium/eBPF and
Calico/nftables dataplanes (i.e. not CNI-specific). We observe the
same failure mode on kernel 6.18.24 built from raspberrypi/linux
rpi-6.18.y @ f2f68e79f16f, across a 24-node Raspberry Pi 5 fleet.
The 6.17/6.18 commonality and the two-CNI reproduction together
put the root cause below the packet-scheduling layer, in the macb
driver or the RP1 PCIe path.
Observed symptoms (our side, 6.18.24; consistent with the linked
reports):
* queue->tx_tail stops advancing at a single second;
* /sys/class/net/<iface>/statistics/tx_packets stops incrementing;
* qdisc backlog grows past zero; netif_stop_subqueue() is called;
* RX counters continue advancing; the MAC IRQ line continues to
fire (RX completions are handled);
* no kernel log line is produced for the duration of the stall;
* dev_watchdog does not fire: macb_netdev_ops has no
.ndo_tx_timeout, and our reading is that trans_start is kept
fresh by successful xmit prior to the ring filling;
* recovery on our side has required `ip link set <iface> down/up`
via an out-of-band watchdog DaemonSet.
Reading the current driver we identified three plausible races
between driver and hardware, each of which could independently
produce the observed behaviour. We did not determine which is the
actual root cause -- that likely requires either BCM2712/RP1
documentation we do not have, or dynamic tracing of the driver
during an in-situ stall. The series therefore attempts to close
all three, with each commit message stating which specific race
that patch is targeting.
Patch 1/3 -- flush PCIe posted write after TSTART doorbell.
Writes to NCR are posted PCIe writes and may not reach the MAC
before the driver returns. If the TSTART doorbell is lost, no
TX starts, no TCOMP arrives, and the ring goes quiescent. A
read-back of NCR after the write is a standard read-after-write
PCIe flush.
Patch 2/3 -- re-check ISR after IER re-enable in macb_tx_poll().
An existing comment in macb_tx_poll() notes that completions
raised while TCOMP is masked do not re-fire when IER is
re-enabled, and mitigates the window with macb_tx_complete_pending(),
which inspects driver-visible ring state only (after rmb()). On
PCIe-attached parts the descriptor DMA write that sets TX_USED
can remain in flight when that check runs; the rmb() orders CPU
writes but does not retire peripheral DMA. Reading ISR directly
after IER re-enable addresses this in two ways: (a) the MMIO read
is an architected PCIe read barrier for prior DMA writes, so a
subsequent macb_tx_complete_pending() sees up-to-date TX_USED
state; (b) it directly observes a pending TCOMP bit if the
hardware has one set. Either signal reschedules NAPI.
Patch 3/3 -- TX stall watchdog. Defence-in-depth. If patches
1 and 2 close the races we identified, this patch performs a
single spin_lock_irqsave/unlock and a branch per queue per
second with no other effect. If a further race remains that we
have not identified, it invokes the driver's own existing
macb_tx_restart(), which already verifies that TBQP is behind
tx_head before re-asserting TSTART. We include this patch
because we have empirically observed multi-minute stalls on this
hardware; we are willing to drop it if the preference is for
1 and 2 to stand alone.
Status and testing:
* Apply-tested against Linux net-next HEAD (this series is
generated from it) and against raspberrypi/linux rpi-6.18.y @
f2f68e79f16f (the fork our fleet runs): all three apply
cleanly on net-next; the rpi fork carries an additional local
`bool tx_pending` field on `struct macb_queue` that is not in
mainline, so we maintain a small rebased patch 3 hunk for it.
* Build-tested: the series compiles cleanly as part of our Talos
image build pipeline on arm64.
* Runtime-tested, early signal: ~4 h 20 min of post-patch uptime
on the canary node, ~3 h 15 min on the slowest (last master to
upgrade), ~95 node-hours cumulative across the 24-node fleet at
the time this cover letter was written. During that window the
fleet-wide counts are zero RECOVER events, zero `[tx-stall]`
partial markers (an out-of-band userspace detector that records
even transient one-second freezes that recover before the
3-second threshold), and zero ping-failure markers. Pre-patch
reference window (2026-04-24 14:00-18:10 UTC, when proper
monitoring was in place) observed multiple stalls per hour at
fleet level; at that rate we would expect on the order of 50
stalls in 95 node-hours, actual is zero. We will follow up
with a 24 h and a 1-week data point as the same observability
runs forward; the direction so far is consistent with patches
1 and 2 closing the underlying race(s) and patch 3 correctly
being a no-op on healthy hardware.
The series does not depend on any other in-flight work we are
aware of. Happy to split, rebase, or drop individual patches on
feedback. All three are independently revertable.
Lukasz Raczylo (3):
net: macb: flush PCIe posted write after TSTART doorbell
net: macb: re-check ISR after IER re-enable in macb_tx_poll
net: macb: add TX stall watchdog as defence-in-depth safety net
drivers/net/ethernet/cadence/macb.h | 5 ++
drivers/net/ethernet/cadence/macb_main.c | 99 +++++++++++++++++++++---
2 files changed, 94 insertions(+), 10 deletions(-)
--
2.53.0
next reply other threads:[~2026-04-24 22:38 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-24 22:38 Lukasz Raczylo [this message]
2026-04-24 22:38 ` [RFC PATCH net-next 1/3] net: macb: flush PCIe posted write after TSTART doorbell Lukasz Raczylo
2026-04-24 22:38 ` [RFC PATCH net-next 2/3] net: macb: re-check ISR after IER re-enable in macb_tx_poll Lukasz Raczylo
2026-04-24 22:38 ` [RFC PATCH net-next 3/3] net: macb: add TX stall watchdog as defence-in-depth safety net Lukasz Raczylo
2026-04-25 21:48 ` [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 Lukasz Raczylo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1777064117.git.lukasz@raczylo.com \
--to=lukasz@raczylo.com \
--cc=andrew+netdev@lunn.ch \
--cc=claudiu.beznea@tuxon.dev \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=kuba@kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rpi-kernel@lists.infradead.org \
--cc=netdev@vger.kernel.org \
--cc=nicolas.ferre@microchip.com \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox