From: Lukasz Raczylo <lukasz@raczylo.com>
To: netdev@vger.kernel.org
Cc: Theo Lebrun <theo.lebrun@bootlin.com>,
Andrea della Porta <andrea.porta@suse.com>,
Nicolas Ferre <nicolas.ferre@microchip.com>,
Claudiu Beznea <claudiu.beznea@tuxon.dev>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S . Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-rpi-kernel@lists.infradead.org
Subject: [PATCH net-next v2 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1
Date: Thu, 14 May 2026 22:54:56 +0100 [thread overview]
Message-ID: <20260514215459.36109-1-lukasz@raczylo.com> (raw)
In-Reply-To: <cover.1777064117.git.lukasz@raczylo.com>
Hi netdev, Théo, Andrea, linux-rpi,
v2 of the silent TX stall series. The v1 RFC sits at:
https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/
Reframing first. The v1 cover claimed "zero events post-patch";
that was true at the user-space watchdog visibility level only.
A dmesg sweep prompted by Andrea's review -- with patch 3's warn
made unconditional, per his ask -- revealed kernel-level evidence
that patches 1 and 2 are partial at best. Patch 3 is empirically
the load-bearing fix on this platform: it caught and recovered a
real lost-TCOMP stall on pi-data-02 at 2026-05-05T13:24:09Z
(queue 0, tail=259564431 head=259564433 after ~260M TX, HW
ETHS tx_frames counter advancing through the event while driver
tx_tail did not) without user-space involvement.
So the v2 narrative reads:
* Patch 1 (PCIe posted-write flush) and patch 2 (PCIe read
barrier before descriptor check) close two specific
candidate races in the TSTART / TX_USED paths. Plausible
and well-motivated, but I cannot prove either fires in
isolation on this hardware -- my 1 Hz trace shows TX
freezes, not which mechanism caused them.
* Patch 3 (TX stall watchdog) is the safety net that
empirically does the recovery work. 13 days of production
runtime on 24 nodes since 2026-05-02 in the same form
(anchored against the rpi-6.18.y vendor fork, in
raspberrypi/linux#7340 -- merged 2026-05-08 after review
feedback from pelwell that this v2 incorporates).
The v1 cover's "zero stalls in 95 node-hours of post-patch
uptime" framing was misleading. Apologies for that.
## What changed in v2
Patch 1 (PCIe posted-write flush after TSTART doorbell):
* Gates the readback behind a new MACB_CAPS_PCIE_POSTED_WRITES
capability, set only on raspberrypi_rp1_config. v1
applied the readback to every macb variant; SoC-integrated
parts (Atmel, Microchip, SiFive, Xilinx) have no posted-write
fabric and were paying the readback latency for no benefit.
* Commit message notes that the readback also flushes the
preceding macb_tx_lpi_wake() NCR write on the same path --
not just TSTART -- since it functions as a PCIe read barrier
for all prior posted writes by the same requester.
Patch 2 (PCIe read barrier before TX completion descriptor check):
* Dropped the ISR read. v1 read ISR in macb_tx_poll() with
`queue_readl(queue, ISR) & MACB_BIT(TCOMP)`; that's
destructive on RP1 silicon (MACB_CAPS_ISR_CLEAR_ON_WRITE
is not set on raspberrypi_rp1_config; the existing handler
assumes read-clear semantics and processes every bit
returned from queue_readl(queue, ISR) in one pass). v1's
masked-and-discarded read silently consumed any other bit
set in ISR at that instant -- RCOMP being the worst case
(RX completion never scheduled until the line re-asserts).
* v2 substitutes `(void)queue_readl(queue, IMR)` -- IMR is
the read-only interrupt mask mirror, no side effects, still
flushes prior peripheral DMA writes via PCIe completion
ordering. Loses the "directly sample latched TCOMP" half
of v1's claim; keeps the PCIe-barrier half, which is the
half that addresses the documented race in the existing
macb_tx_complete_pending() rmb() comment.
Patch 3 (TX stall watchdog):
* Tail movement is tracked via a `bool tx_stall_tail_moved`
set by macb_tx_complete() under tx_ptr_lock when tail
advances, and cleared by the watchdog tick on the same lock.
v1 snapshotted tx_tail and compared between ticks; while
that worked correctly given tx_tail is free-running u32,
the bool form is unambiguously cleaner, doesn't depend on
counter behaviour, and is what pelwell asked for when he
reviewed the same series on the rpi side
(raspberrypi/linux#7340).
* netif_carrier_ok() gate added at the top of the watchdog
tick. Eliminates the boot-time false positive seen in v1
where, between macb_open() and link-autoneg-completion,
queue->tx_head can advance from kernel-queued packets while
tx_tail stays at 0 (no TCOMPs yet), tripping the snapshot
check. Observed 6 such fires during a 2026-05-02 fleet
rolling reboot.
* netdev_warn_once -> netdev_warn_ratelimited. v1's
netdev_warn_once made operational accounting impossible
after the first fire on a given netdev; ratelimited keeps
bounded log noise but lets operators count events. Andrea
asked for this directly.
Patches 1 and 3 are independently revertable. Patch 2 v2 is a
two-line readback before an existing check; trivially revertable
in isolation, semantically dependent on the existing
macb_tx_complete_pending() recovery path that it strengthens.
## What I haven't done
* TSO+SG-off canary. rtheobald (cilium#43198 #4188846955)
and the launchpad #2133877 commenter (#34) both report
TSO+SG-off *together* mask the stall; my matrix has TSO+GSO
tested off, not TSO+SG. Happy to canary-test this on one
node if reviewers want the data point before deciding which
of patches 1/2 the SG path actually exercises.
* Per-patch isolation testing. All three deployed
simultaneously on the 24-node fleet; I cannot independently
prove patch 1 or patch 2 does anything on its own. Patch 3
has direct production evidence (lost-TCOMP recovery
described above). If reviewers want a bisection-style
canary I can stagger one-patch / two-patch / three-patch
nodes for >=1 week each.
## Status and testing
* Mainline-anchored: v2 builds clean against current net-next
HEAD, applies cleanly. Boot-tested and brief-sanity in a
canary build before this send.
* raspberrypi/linux rpi-6.18.y anchored equivalents: in
production on 24 nodes since 2026-05-02 (now 13 days); in
raspberrypi/linux master since 2026-05-08 (6 days).
* The v2 patch 2 IMR-barrier form was rolled to all 24 Pi
nodes earlier today (2026-05-14, ~14:00 UTC) as a
vendor-fork-anchored update. ~120 cumulative node-hours
of runtime since: zero mid-runtime TX stalls; zero user-space
watchdog RECOVER events. Cover-letter-thread reply with
detail accompanies this series.
The series does not depend on any other in-flight work. Happy
to split, rebase, drop, or restructure on feedback.
Lukasz Raczylo (3):
net: macb: flush PCIe posted write after TSTART doorbell (PCIe-only)
net: macb: insert PCIe read barrier before TX completion descriptor
check
net: macb: add TX stall watchdog to recover from lost TCOMP interrupts
drivers/net/ethernet/cadence/macb.h | 14 ++++
drivers/net/ethernet/cadence/macb_main.c | 95 ++++++++++++++++++++++++
2 files changed, 109 insertions(+)
--
2.54.0
next prev parent reply other threads:[~2026-05-14 21:55 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-24 22:38 [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 Lukasz Raczylo
2026-04-24 22:38 ` [RFC PATCH net-next 1/3] net: macb: flush PCIe posted write after TSTART doorbell Lukasz Raczylo
2026-05-05 13:17 ` Andrea della Porta
2026-04-24 22:38 ` [RFC PATCH net-next 2/3] net: macb: re-check ISR after IER re-enable in macb_tx_poll Lukasz Raczylo
2026-04-24 22:38 ` [RFC PATCH net-next 3/3] net: macb: add TX stall watchdog as defence-in-depth safety net Lukasz Raczylo
2026-05-05 13:30 ` Andrea della Porta
2026-04-25 21:48 ` [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 Lukasz Raczylo
2026-05-14 10:31 ` Théo Lebrun
2026-05-14 21:51 ` Lukasz Raczylo
2026-05-14 21:54 ` Lukasz Raczylo [this message]
2026-05-14 21:54 ` [PATCH net-next v2 1/3] net: macb: flush PCIe posted write after TSTART doorbell (PCIe-only) Lukasz Raczylo
2026-05-14 21:54 ` [PATCH net-next v2 2/3] net: macb: insert PCIe read barrier before TX completion descriptor check Lukasz Raczylo
2026-05-14 21:54 ` [PATCH net-next v2 3/3] net: macb: add TX stall watchdog to recover from lost TCOMP interrupts Lukasz Raczylo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260514215459.36109-1-lukasz@raczylo.com \
--to=lukasz@raczylo.com \
--cc=andrea.porta@suse.com \
--cc=andrew+netdev@lunn.ch \
--cc=claudiu.beznea@tuxon.dev \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=kuba@kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rpi-kernel@lists.infradead.org \
--cc=netdev@vger.kernel.org \
--cc=nicolas.ferre@microchip.com \
--cc=pabeni@redhat.com \
--cc=theo.lebrun@bootlin.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox