Netdev List
 help / color / mirror / Atom feed
From: Lukasz Raczylo <lukasz@raczylo.com>
To: netdev@vger.kernel.org
Cc: Theo Lebrun <theo.lebrun@bootlin.com>,
	Andrea della Porta <andrea.porta@suse.com>,
	Nicolas Ferre <nicolas.ferre@microchip.com>,
	Claudiu Beznea <claudiu.beznea@tuxon.dev>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S . Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-rpi-kernel@lists.infradead.org
Subject: [PATCH net-next v2 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1
Date: Thu, 14 May 2026 22:54:56 +0100	[thread overview]
Message-ID: <20260514215459.36109-1-lukasz@raczylo.com> (raw)
In-Reply-To: <cover.1777064117.git.lukasz@raczylo.com>

Hi netdev, Théo, Andrea, linux-rpi,

v2 of the silent TX stall series.  The v1 RFC sits at:

  https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/

Reframing first.  The v1 cover claimed "zero events post-patch";
that was true at the user-space watchdog visibility level only.
A dmesg sweep prompted by Andrea's review -- with patch 3's warn
made unconditional, per his ask -- revealed kernel-level evidence
that patches 1 and 2 are partial at best.  Patch 3 is empirically
the load-bearing fix on this platform: it caught and recovered a
real lost-TCOMP stall on pi-data-02 at 2026-05-05T13:24:09Z
(queue 0, tail=259564431 head=259564433 after ~260M TX, HW
ETHS tx_frames counter advancing through the event while driver
tx_tail did not) without user-space involvement.

So the v2 narrative reads:

  * Patch 1 (PCIe posted-write flush) and patch 2 (PCIe read
    barrier before descriptor check) close two specific
    candidate races in the TSTART / TX_USED paths.  Plausible
    and well-motivated, but I cannot prove either fires in
    isolation on this hardware -- my 1 Hz trace shows TX
    freezes, not which mechanism caused them.

  * Patch 3 (TX stall watchdog) is the safety net that
    empirically does the recovery work.  13 days of production
    runtime on 24 nodes since 2026-05-02 in the same form
    (anchored against the rpi-6.18.y vendor fork, in
    raspberrypi/linux#7340 -- merged 2026-05-08 after review
    feedback from pelwell that this v2 incorporates).

The v1 cover's "zero stalls in 95 node-hours of post-patch
uptime" framing was misleading.  Apologies for that.

## What changed in v2

Patch 1 (PCIe posted-write flush after TSTART doorbell):
  * Gates the readback behind a new MACB_CAPS_PCIE_POSTED_WRITES
    capability, set only on raspberrypi_rp1_config.  v1
    applied the readback to every macb variant; SoC-integrated
    parts (Atmel, Microchip, SiFive, Xilinx) have no posted-write
    fabric and were paying the readback latency for no benefit.
  * Commit message notes that the readback also flushes the
    preceding macb_tx_lpi_wake() NCR write on the same path --
    not just TSTART -- since it functions as a PCIe read barrier
    for all prior posted writes by the same requester.

Patch 2 (PCIe read barrier before TX completion descriptor check):
  * Dropped the ISR read.  v1 read ISR in macb_tx_poll() with
    `queue_readl(queue, ISR) & MACB_BIT(TCOMP)`; that's
    destructive on RP1 silicon (MACB_CAPS_ISR_CLEAR_ON_WRITE
    is not set on raspberrypi_rp1_config; the existing handler
    assumes read-clear semantics and processes every bit
    returned from queue_readl(queue, ISR) in one pass).  v1's
    masked-and-discarded read silently consumed any other bit
    set in ISR at that instant -- RCOMP being the worst case
    (RX completion never scheduled until the line re-asserts).
  * v2 substitutes `(void)queue_readl(queue, IMR)` -- IMR is
    the read-only interrupt mask mirror, no side effects, still
    flushes prior peripheral DMA writes via PCIe completion
    ordering.  Loses the "directly sample latched TCOMP" half
    of v1's claim; keeps the PCIe-barrier half, which is the
    half that addresses the documented race in the existing
    macb_tx_complete_pending() rmb() comment.

Patch 3 (TX stall watchdog):
  * Tail movement is tracked via a `bool tx_stall_tail_moved`
    set by macb_tx_complete() under tx_ptr_lock when tail
    advances, and cleared by the watchdog tick on the same lock.
    v1 snapshotted tx_tail and compared between ticks; while
    that worked correctly given tx_tail is free-running u32,
    the bool form is unambiguously cleaner, doesn't depend on
    counter behaviour, and is what pelwell asked for when he
    reviewed the same series on the rpi side
    (raspberrypi/linux#7340).
  * netif_carrier_ok() gate added at the top of the watchdog
    tick.  Eliminates the boot-time false positive seen in v1
    where, between macb_open() and link-autoneg-completion,
    queue->tx_head can advance from kernel-queued packets while
    tx_tail stays at 0 (no TCOMPs yet), tripping the snapshot
    check.  Observed 6 such fires during a 2026-05-02 fleet
    rolling reboot.
  * netdev_warn_once -> netdev_warn_ratelimited.  v1's
    netdev_warn_once made operational accounting impossible
    after the first fire on a given netdev; ratelimited keeps
    bounded log noise but lets operators count events.  Andrea
    asked for this directly.

Patches 1 and 3 are independently revertable.  Patch 2 v2 is a
two-line readback before an existing check; trivially revertable
in isolation, semantically dependent on the existing
macb_tx_complete_pending() recovery path that it strengthens.

## What I haven't done

  * TSO+SG-off canary.  rtheobald (cilium#43198 #4188846955)
    and the launchpad #2133877 commenter (#34) both report
    TSO+SG-off *together* mask the stall; my matrix has TSO+GSO
    tested off, not TSO+SG.  Happy to canary-test this on one
    node if reviewers want the data point before deciding which
    of patches 1/2 the SG path actually exercises.

  * Per-patch isolation testing.  All three deployed
    simultaneously on the 24-node fleet; I cannot independently
    prove patch 1 or patch 2 does anything on its own.  Patch 3
    has direct production evidence (lost-TCOMP recovery
    described above).  If reviewers want a bisection-style
    canary I can stagger one-patch / two-patch / three-patch
    nodes for >=1 week each.

## Status and testing

  * Mainline-anchored:  v2 builds clean against current net-next
    HEAD, applies cleanly.  Boot-tested and brief-sanity in a
    canary build before this send.
  * raspberrypi/linux rpi-6.18.y anchored equivalents:  in
    production on 24 nodes since 2026-05-02 (now 13 days); in
    raspberrypi/linux master since 2026-05-08 (6 days).
  * The v2 patch 2 IMR-barrier form was rolled to all 24 Pi
    nodes earlier today (2026-05-14, ~14:00 UTC) as a
    vendor-fork-anchored update.  ~120 cumulative node-hours
    of runtime since: zero mid-runtime TX stalls; zero user-space
    watchdog RECOVER events.  Cover-letter-thread reply with
    detail accompanies this series.

The series does not depend on any other in-flight work.  Happy
to split, rebase, drop, or restructure on feedback.

Lukasz Raczylo (3):
  net: macb: flush PCIe posted write after TSTART doorbell (PCIe-only)
  net: macb: insert PCIe read barrier before TX completion descriptor
    check
  net: macb: add TX stall watchdog to recover from lost TCOMP interrupts

 drivers/net/ethernet/cadence/macb.h      | 14 ++++
 drivers/net/ethernet/cadence/macb_main.c | 95 ++++++++++++++++++++++++
 2 files changed, 109 insertions(+)

-- 
2.54.0


  parent reply	other threads:[~2026-05-14 21:55 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-24 22:38 [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 Lukasz Raczylo
2026-04-24 22:38 ` [RFC PATCH net-next 1/3] net: macb: flush PCIe posted write after TSTART doorbell Lukasz Raczylo
2026-05-05 13:17   ` Andrea della Porta
2026-04-24 22:38 ` [RFC PATCH net-next 2/3] net: macb: re-check ISR after IER re-enable in macb_tx_poll Lukasz Raczylo
2026-04-24 22:38 ` [RFC PATCH net-next 3/3] net: macb: add TX stall watchdog as defence-in-depth safety net Lukasz Raczylo
2026-05-05 13:30   ` Andrea della Porta
2026-04-25 21:48 ` [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 Lukasz Raczylo
2026-05-14 10:31 ` Théo Lebrun
2026-05-14 21:51 ` Lukasz Raczylo
2026-05-14 21:54 ` Lukasz Raczylo [this message]
2026-05-14 21:54   ` [PATCH net-next v2 1/3] net: macb: flush PCIe posted write after TSTART doorbell (PCIe-only) Lukasz Raczylo
2026-05-14 21:54   ` [PATCH net-next v2 2/3] net: macb: insert PCIe read barrier before TX completion descriptor check Lukasz Raczylo
2026-05-14 21:54   ` [PATCH net-next v2 3/3] net: macb: add TX stall watchdog to recover from lost TCOMP interrupts Lukasz Raczylo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260514215459.36109-1-lukasz@raczylo.com \
    --to=lukasz@raczylo.com \
    --cc=andrea.porta@suse.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=claudiu.beznea@tuxon.dev \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=kuba@kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rpi-kernel@lists.infradead.org \
    --cc=netdev@vger.kernel.org \
    --cc=nicolas.ferre@microchip.com \
    --cc=pabeni@redhat.com \
    --cc=theo.lebrun@bootlin.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox