Netdev List
 help / color / mirror / Atom feed
From: Lukasz Raczylo <lukasz@raczylo.com>
To: netdev@vger.kernel.org
Cc: Theo Lebrun <theo.lebrun@bootlin.com>,
	Andrea della Porta <andrea.porta@suse.com>,
	Nicolas Ferre <nicolas.ferre@microchip.com>,
	Claudiu Beznea <claudiu.beznea@tuxon.dev>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S . Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-rpi-kernel@lists.infradead.org
Subject: Re: [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1
Date: Thu, 14 May 2026 22:51:29 +0100	[thread overview]
Message-ID: <20260514215129.33559-1-lukasz@raczylo.com> (raw)
In-Reply-To: <cover.1777064117.git.lukasz@raczylo.com>

Andrea, Théo --

Thanks both.  Replying to multiple points in one mail since they
intersect.

First a correction to the v1 cover: the "zero events post-patch"
claim was true only at the user-space watchdog visibility level.
With patch 3's warn made unconditional (which is what Andrea's
review re-tested with on v6.19.10), kernel-level evidence on my
side now matches what Andrea saw -- patches 1 and 2 are partial,
patch 3 is empirically the load-bearing fix on this platform.
The v2 cover acknowledges this and reframes.


## Andrea -- patches 1 and 3

Concrete data from a dmesg sweep across the 24-node fleet
(macb-v2 image, 2026-05-02 to 2026-05-05, 6.18.24 + this series
with netdev_warn_once swapped to netdev_warn):

  * 1 mid-runtime real stall, pi-data-02 2026-05-05T13:24:09Z,
    queue 0, tail=259564431 head=259564433 (2 stuck descriptors
    after ~260M TX).  HW ETHS tx_frames counter advanced through
    the event (~1535 fps mid-stall); driver tx_tail did not.
    Patch 3 watchdog re-kicked TSTART; user-space watchdog
    logged zero RECOVER events for that window -- kernel-level
    recovery completed before user-space noticed.  Consistent
    with the lost-TCOMP pattern.

  * 6 boot-time false positives during the 2026-05-02 rolling
    reboot, all tail=0 head=5-7.  Patch 3 v1 has a real bug:
    tx_stall_last_tail is initialised to tx_tail (=0) at
    macb_open(), first tick fires at +1000 ms; if autoneg
    hasn't completed by then, tail is still 0 from no-TCOMPs
    while head moves from kernel-queued packets, false stall
    fires.  v2 fixes this (details below).


## Théo -- the four questions on the cover

Welcome to macb maintenance.  Quick answers; raw 1 Hz traces,
dmesg dumps, and event logs available on request for any of
these.

  1. Tx-only or Tx+Rx broken?

     Tx-only at the MAC counter level.  Per-node 1 Hz trace
     captures the same signature on every wedge:
     /sys/class/net/end0/statistics/tx_packets freezes at a
     single second, rx_packets continues to grow, MAC IRQ
     counter (on the assigned CPU) continues to advance,
     NET_RX softirq counter continues to advance.  The
     "broken Tx & Rx" reports (lexfrei et al on cilium#43198)
     are the user-space symptom: TX dead -> TCP can't ACK ->
     nothing comes back -> looks bidirectionally dead from
     applications.  At the MAC counters the asymmetry is
     unambiguous.

  2. Recovery: link down/up vs module reload vs power cycle?

     On any system that's still otherwise responsive, link
     down/up suffices and is the lightest known recovery.  My
     user-space watchdog has run that primitive on 24 nodes
     for ~3 weeks; ~6-10 s wedge-to-Ready, every time.
     `modprobe -r macb && modprobe macb` (rtheobald) is a
     heavier-weight equivalent of the same thing -- works but
     unnecessary if link-toggle alone fixes the silicon state.
     "Power cycle only" (lexfrei) was on Ubuntu raspi 6.17 with
     the ondemand governor + frequent RCU stalls preceding the
     wedge (per launchpad bug body's "KEY OBSERVATION" section);
     my read is that case escalated past clean NIC freeze into
     kernel-wide unresponsiveness, so a NIC-level recovery
     couldn't reach it.

  3. TSO / SG / EEE disabling helped some reporters?

     Mixed picture, with one gap in my matrix.

     Tested fleet-wide since 2026-04-24 as baseline before this
     series:
       * EEE off (--set-eee end0 eee off + advertise 0x0)
       * TSO off (-K tso off)
       * GSO off (-K gso off)
       * RX/TX rings 4096/2048 (default 512)
       * IRQ affinity moved off CPU0 to CPU3
       * CPU governor schedutil (default) -- earlier tested
         performance, no change
       * qdisc fq -> pfifo_fast

     With all of those, the stall still fires.  Pre-patch
     fleet rate was multiple per hour with these knobs already
     applied.

     Untested by me: TSO + SG (scatter-gather, NETIF_F_SG) off
     *together*.  That is the specific combination rtheobald
     (cilium#43198 comment 4188846955) and the launchpad
     commenter (#34) report as masking the stall -- "must be
     both, not just one".  I tested TSO + GSO, not TSO + SG.

     Both patch 1 (PCIe-fabric race on TSTART) and patch 2
     (peripheral DMA retirement race on TX_USED) are
     consistent with descriptor-fragment-path interactions
     that SG-off would mask, so the workaround being real
     isn't surprising.  Closing the race rather than masking
     it should still be the right thing for mainline.  Happy
     to canary-test TSO+SG-off on one node if you want the
     data point before v2 review.

  4. cdns,*-max-pipe DT props -- lead dead?

     Yes, dead, per the commenter who pursued it (launchpad
     #2133877 comment #30, April 2026): they patched mainline
     6.18.18 to add the device_property_read calls for
     cdns,ar2r-max-pipe / cdns,aw2w-max-pipe / cdns,use-aw2b-fill
     *and* added those properties to bcm2712-rpi-5-b.dts; ran
     a few hours; their own follow-up: "this is a dud" -- node
     hung anyway.

     For reference, my build (raspberrypi/linux rpi-6.18.y @
     f2f68e79f16f) has the driver-side reads present but the
     DTS does not set the properties, so the AMP register
     defaults are in effect on my fleet and the bug still
     fires.  Two independent confirmations the DT props are
     not the fix.


## My own audit -- two issues v2 fixes

Worth disclosing before someone else catches them.

  * Patch 2 (v1) reads ISR in macb_tx_poll(), masks off
    everything except TCOMP, and discards the rest.
    raspberrypi_rp1_config does not set
    MACB_CAPS_ISR_CLEAR_ON_WRITE -- the driver assumes
    read-clear ISR semantics on RP1, and macb_interrupt()
    relies on processing every bit it reads in one pass for
    that case to be correct.  My v1 patch breaks that contract:
    any RCOMP / ROVR / TXUBR / etc. set in ISR at the moment of
    my read is silently consumed and never processed.  RCOMP
    being lost is the worst case (RX completion never scheduled
    until something else asserts the line).  Race window is
    ~200-500 ns per macb_tx_poll completion; significant at
    moderate-to-heavy RX load on a level-triggered IRQ where
    the consumed bit drops the line before GIC delivery.

    v2 patch 2 drops the ISR read entirely and substitutes
    (void)queue_readl(queue, IMR).  IMR is the read-only mask
    mirror, no side effects, still flushes prior peripheral
    DMA writes via PCIe ordering.  Loses the "directly sample
    latched TCOMP" half of v1's claim; keeps the PCIe-barrier
    half (which is the half that actually addresses the
    documented race in the existing macb_tx_complete_pending()
    rmb() comment).

  * Patch 3 (v1) boot-time false positive described above to
    Andrea.  v2 fixes:

      - netif_carrier_ok() gate -- no carrier, no completion is
        possible, don't fire.

      - tracks tail movement via a bool tx_stall_tail_moved set
        by macb_tx_complete() under tx_ptr_lock when tail
        advances, cleared by the watchdog tick on the same
        lock.  Form suggested by pelwell when he reviewed the
        same series anchored against rpi-6.18.y at
        raspberrypi/linux#7340 (merged 2026-05-08); 13 days of
        fleet runtime in this form since 2026-05-02 across 24
        nodes.  Folded into mainline v2 patch 3 directly rather
        than carried as a fix-up.

      - netdev_warn_once -> netdev_warn_ratelimited per
        Andrea's ask -- operator visibility doesn't disappear
        after the first fire.


## v2 follows

Sending [PATCH net-next v2 0/3] threaded under the v1 cover
shortly.  Plan to drop "RFC" from the subject prefix (~3 weeks
production runtime + the rpi in-tree merge tip the balance
toward a regular PATCH); happy to revert to RFC if you'd
prefer the iterative-review cadence.

Tested-on context for v2:
  * mainline-anchored:  builds clean against current net-next
    HEAD, applies cleanly.  Boot-tested in a canary build of
    my Talos image before this reply.
  * rpi-6.18.y-anchored equivalents:  in production on 24 nodes
    since 2026-05-02; in raspberrypi/linux master tip since
    2026-05-08.
  * The v2 patch 2 IMR-barrier form (the one that fixes the
    destructive ISR-read regression described above) was rolled
    to all 24 Pi nodes earlier today (2026-05-14, ~14:00 UTC) as
    a vendor-fork-anchored update.  ~120 cumulative node-hours
    of runtime since:  zero mid-runtime TX stalls; zero user-space
    watchdog RECOVER events; the 5 dmesg "TX stall detected" lines
    are all boot-time false positives of the form `tail=0 head=N`
    that the new `netif_carrier_ok()` gate in patch 3 eliminates.
    Pre-patch baseline rate (~0.5 stall/node-hour at fleet level)
    would have predicted on the order of 60 mid-runtime stalls in
    that window; observed is 0.

Thanks again to both for the review.

--
Lukasz Raczylo

  parent reply	other threads:[~2026-05-14 21:51 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-24 22:38 [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 Lukasz Raczylo
2026-04-24 22:38 ` [RFC PATCH net-next 1/3] net: macb: flush PCIe posted write after TSTART doorbell Lukasz Raczylo
2026-05-05 13:17   ` Andrea della Porta
2026-04-24 22:38 ` [RFC PATCH net-next 2/3] net: macb: re-check ISR after IER re-enable in macb_tx_poll Lukasz Raczylo
2026-04-24 22:38 ` [RFC PATCH net-next 3/3] net: macb: add TX stall watchdog as defence-in-depth safety net Lukasz Raczylo
2026-05-05 13:30   ` Andrea della Porta
2026-04-25 21:48 ` [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 Lukasz Raczylo
2026-05-14 10:31 ` Théo Lebrun
2026-05-14 21:51 ` Lukasz Raczylo [this message]
2026-05-14 21:54 ` [PATCH net-next v2 " Lukasz Raczylo
2026-05-14 21:54   ` [PATCH net-next v2 1/3] net: macb: flush PCIe posted write after TSTART doorbell (PCIe-only) Lukasz Raczylo
2026-05-14 21:54   ` [PATCH net-next v2 2/3] net: macb: insert PCIe read barrier before TX completion descriptor check Lukasz Raczylo
2026-05-14 21:54   ` [PATCH net-next v2 3/3] net: macb: add TX stall watchdog to recover from lost TCOMP interrupts Lukasz Raczylo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260514215129.33559-1-lukasz@raczylo.com \
    --to=lukasz@raczylo.com \
    --cc=andrea.porta@suse.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=claudiu.beznea@tuxon.dev \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=kuba@kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rpi-kernel@lists.infradead.org \
    --cc=netdev@vger.kernel.org \
    --cc=nicolas.ferre@microchip.com \
    --cc=pabeni@redhat.com \
    --cc=theo.lebrun@bootlin.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox