Re: [RFC PATCH net-next 1/3] net: macb: flush PCIe posted write after TSTART doorbell

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* Re: [RFC PATCH net-next 1/3] net: macb: flush PCIe posted write after TSTART doorbell
       [not found] ` <3106d546d494f2f52ec832e7f7d04f534286e254.1777064117.git.lukasz@raczylo.com>
@ 2026-05-05 13:17   ` Andrea della Porta
  0 siblings, 0 replies; 8+ messages in thread
From: Andrea della Porta @ 2026-05-05 13:17 UTC (permalink / raw)
  To: Lukasz Raczylo
  Cc: netdev, Nicolas Ferre, Claudiu Beznea, Andrew Lunn,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	linux-kernel, linux-arm-kernel, linux-rpi-kernel

Hi Lukasz,

On 23:38 Fri 24 Apr     , Lukasz Raczylo wrote:
> macb_start_xmit() and macb_tx_restart() kick transmission by
> OR-ing MACB_BIT(TSTART) into NCR.  On PCIe-attached macb instances
> (BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 is the setup we
> have in front of us), writes to NCR are posted PCIe writes: they
> are not guaranteed to reach the device before the issuing CPU
> returns.  If the TSTART doorbell does not reach the MAC, no TX
> begins, no TCOMP completion arrives, and the ring remains
> quiescent without any kernel-visible indication.
> 
> Note that the raspberrypi/linux vendor fork carries a local patch
> around the TSTART site (a queue->tx_pending breadcrumb that is
> promoted to queue->txubr_pending by the next TCOMP interrupt,
> triggering macb_tx_restart()).  That workaround makes the loss
> recoverable under traffic, but it cannot help if TCOMP itself is
> not raised because no TX started -- which is exactly the case we
> are targeting here.  The handshake is not present in mainline.
> 
> Add a read-back of NCR after each TSTART write in macb_start_xmit()
> and macb_tx_restart().  The read is an architected PCIe read
> barrier for earlier posted writes on the same path; it ensures the
> doorbell has reached the MAC before the functions return.
> 
> We do not yet have direct hardware evidence that TSTART is being
> lost on the RP1 path (that would require a PCIe protocol analyser,
> or at minimum a before/after counter on queue->tx_stall_last_tail
> with and without this patch applied in isolation).  This patch is
> one of a three-patch series ("candidate fixes for silent TX stall
> on BCM2712/RP1"); see the cover letter for context.  We have
> verified the series compiles and applies cleanly against mainline
> HEAD and against raspberrypi/linux rpi-6.18.y @ f2f68e79f16f;
> runtime verification is pending.
> 
> Link: https://github.com/cilium/cilium/issues/43198
> Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
> Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
> ---
>  drivers/net/ethernet/cadence/macb_main.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index a12aa2124..b6cca55ad 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -1922,6 +1922,13 @@ static void macb_tx_restart(struct macb_queue *queue)
>  
>  	spin_lock(&bp->lock);
>  	macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART));
> +	/*
> +	 * Flush the PCIe posted-write queue so the TSTART doorbell
> +	 * reliably reaches the MAC.  Without this, the write can sit
> +	 * in the fabric and the MAC never advances, causing a silent
> +	 * TX stall.
> +	 */
> +	(void)macb_readl(bp, NCR);

Do you expect it to be a temporal issue (in the comment you stated that the write should be delivered
before the function returns) or is it just a dropped packet error? In the former case can you please
explain why? If it's the latter, I'd expect the dropped packet to increment the counter in the AER,
and I'm pretty sure it didn't when I performed my tests.

Many thanks,
Andrea

>  	spin_unlock(&bp->lock);
>  
>  out_tx_ptr_unlock:
> @@ -2560,6 +2567,11 @@ static netdev_tx_t macb_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	spin_lock(&bp->lock);
>  	macb_tx_lpi_wake(bp);
>  	macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART));
> +	/*
> +	 * Flush the PCIe posted-write queue; see the comment in
> +	 * macb_tx_restart() for the reasoning.
> +	 */
> +	(void)macb_readl(bp, NCR);
>  	spin_unlock(&bp->lock);
>  
>  	if (CIRC_SPACE(queue->tx_head, queue->tx_tail, bp->tx_ring_size) < 1)
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <c0469642f42ada85d91a8a685eb7c0e04cb99131.1777064117.git.lukasz@raczylo.com>]

* Re: [RFC PATCH net-next 3/3] net: macb: add TX stall watchdog as defence-in-depth safety net
       [not found] ` <c0469642f42ada85d91a8a685eb7c0e04cb99131.1777064117.git.lukasz@raczylo.com>
@ 2026-05-05 13:30   ` Andrea della Porta
  0 siblings, 0 replies; 8+ messages in thread
From: Andrea della Porta @ 2026-05-05 13:30 UTC (permalink / raw)
  To: Lukasz Raczylo
  Cc: netdev, Nicolas Ferre, Claudiu Beznea, Andrew Lunn,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	linux-kernel, linux-arm-kernel, linux-rpi-kernel

Hi Lukasz,

On 23:38 Fri 24 Apr     , Lukasz Raczylo wrote:
> Patches 1/3 and 2/3 address two candidate races that could lead
> to a TCOMP completion being missed on PCIe-attached macb
> instances.  This patch adds a defence-in-depth safety net, in
> case a further race remains that we have not identified.
> 
> The watchdog is a per-queue delayed_work that runs once per
> second.  It snapshots queue->tx_tail; if the ring is non-empty
> (queue->tx_head != queue->tx_tail) and tx_tail has not advanced
> since the previous tick, it calls macb_tx_restart().
> 
> No new recovery logic is introduced.  macb_tx_restart() already
> exists in this file, is correctly locked (tx_ptr_lock, bp->lock),
> and verifies that the hardware's TBQP is behind the driver's
> head index before re-asserting TSTART.  On a healthy ring it is
> a no-op at the hardware level; the watchdog only supplies the
> missing trigger.
> 
> On a healthy queue the per-tick cost is one spin_lock_irqsave()
> / spin_unlock_irqrestore() and one branch.  The delayed_work is
> only scheduled between macb_open() and macb_close(), and is
> cancelled synchronously on close.
> 
> Context for submission: on our 24-node Raspberry Pi 5 fleet,
> before this series, an out-of-band user-space watchdog
> (monitoring tx_packets from /sys/class/net/.../statistics and
> toggling the link down/up when it froze) was required to keep
> nodes usable.  We include this kernel-side watchdog as a cleaner
> in-kernel equivalent for any residual stall that patches 1 and
> 2 do not cover.  We are willing to drop this patch if the view
> is that 1 and 2 should stand alone.
> 
> Link: https://github.com/cilium/cilium/issues/43198
> Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
> Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
> ---
>  drivers/net/ethernet/cadence/macb.h      |  5 ++
>  drivers/net/ethernet/cadence/macb_main.c | 59 ++++++++++++++++++++++++
>  2 files changed, 64 insertions(+)
> 
> diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
> index 2de56017e..9115f2b47 100644
> --- a/drivers/net/ethernet/cadence/macb.h
> +++ b/drivers/net/ethernet/cadence/macb.h
> @@ -1278,6 +1278,11 @@ struct macb_queue {
>  	dma_addr_t		tx_ring_dma;
>  	struct work_struct	tx_error_task;
>  	bool			txubr_pending;
> +
> +	/* TX stall watchdog -- see macb_tx_stall_watchdog() in macb_main.c */
> +	struct delayed_work	tx_stall_watchdog_work;
> +	unsigned int		tx_stall_last_tail;
> +
>  	struct napi_struct	napi_tx;
>  
>  	dma_addr_t		rx_ring_dma;
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index ea231b1c5..ea2306ef7 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -2002,6 +2002,59 @@ static int macb_tx_poll(struct napi_struct *napi, int budget)
>  	return work_done;
>  }
>  
> +#define MACB_TX_STALL_INTERVAL_MS	1000
> +
> +/*
> + * TX stall watchdog.
> + *
> + * Defence-in-depth against lost TCOMP interrupts.  macb already has a
> + * recovery chain (tx_pending -> txubr_pending -> macb_tx_restart())
> + * that fires on TCOMP; if TCOMP itself is lost the TX ring stalls
> + * silently until something else kicks TSTART.  This watchdog runs
> + * once per second per queue, snapshots tx_tail, and calls
> + * macb_tx_restart() if the ring is non-empty and tx_tail has not
> + * advanced since the previous tick.
> + *
> + * macb_tx_restart() already checks the hardware's TBQP against the
> + * driver's head index before re-asserting TSTART, so on a healthy
> + * ring this is a no-op at the hardware level.  The watchdog only
> + * adds the missing trigger.
> + */
> +static void macb_tx_stall_watchdog(struct work_struct *work)
> +{
> +	struct macb_queue *queue = container_of(to_delayed_work(work),
> +						struct macb_queue,
> +						tx_stall_watchdog_work);
> +	struct macb *bp = queue->bp;
> +	unsigned int cur_tail, cur_head;
> +	bool stalled = false;
> +	unsigned long flags;
> +
> +	if (!netif_running(bp->dev))
> +		return;
> +
> +	spin_lock_irqsave(&queue->tx_ptr_lock, flags);
> +	cur_tail = queue->tx_tail;
> +	cur_head = queue->tx_head;
> +	if (cur_head != cur_tail &&
> +	    cur_tail == queue->tx_stall_last_tail)
> +		stalled = true;
> +	else
> +		queue->tx_stall_last_tail = cur_tail;
> +	spin_unlock_irqrestore(&queue->tx_ptr_lock, flags);
> +
> +	if (stalled) {
> +		netdev_warn_once(bp->dev,
> +				 "TX stall detected on queue %u (tail=%u head=%u); re-kicking TSTART\n",
> +				 (unsigned int)(queue - bp->queues),
> +				 cur_tail, cur_head);
> +		macb_tx_restart(queue);
> +	}
> +
> +	schedule_delayed_work(&queue->tx_stall_watchdog_work,
> +			      msecs_to_jiffies(MACB_TX_STALL_INTERVAL_MS));
> +}
> +
>  static void macb_hresp_error_task(struct work_struct *work)
>  {
>  	struct macb *bp = from_work(bp, work, hresp_err_bh_work);
> @@ -3190,6 +3243,9 @@ static int macb_open(struct net_device *dev)
>  	for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
>  		napi_enable(&queue->napi_rx);
>  		napi_enable(&queue->napi_tx);
> +		queue->tx_stall_last_tail = queue->tx_tail;
> +		schedule_delayed_work(&queue->tx_stall_watchdog_work,
> +				      msecs_to_jiffies(MACB_TX_STALL_INTERVAL_MS));
>  	}
>  
>  	macb_init_hw(bp);
> @@ -3240,6 +3296,7 @@ static int macb_close(struct net_device *dev)
>  	for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
>  		napi_disable(&queue->napi_rx);
>  		napi_disable(&queue->napi_tx);
> +		cancel_delayed_work_sync(&queue->tx_stall_watchdog_work);
>  		netdev_tx_reset_queue(netdev_get_tx_queue(dev, q));
>  	}
>  
> @@ -4802,6 +4859,8 @@ static int macb_init_dflt(struct platform_device *pdev)
>  		}
>  
>  		INIT_WORK(&queue->tx_error_task, macb_tx_error_task);
> +		INIT_DELAYED_WORK(&queue->tx_stall_watchdog_work,
> +				  macb_tx_stall_watchdog);
>  		q++;
>  	}
>  
> -- 
> 2.53.0
>

I've applied all three patches to v6.19.10 changing netdev_warn_once() from this one to
netdev_warn() and it the "TX stall" warning appears several time. So it seems that there
could be another cause escaping the filtering in the first two patches.

Interestingly enough, running the same tests after substituing the entire macb driver with
the downstream version works ok.

Not sure how to interpret these results since they seem to be the opposite of yours.
More investigation is ongoing from my side.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1
       [not found] <cover.1777064117.git.lukasz@raczylo.com>
       [not found] ` <3106d546d494f2f52ec832e7f7d04f534286e254.1777064117.git.lukasz@raczylo.com>
       [not found] ` <c0469642f42ada85d91a8a685eb7c0e04cb99131.1777064117.git.lukasz@raczylo.com>
@ 2026-05-14 10:31 ` Théo Lebrun
  2026-05-14 21:51 ` Lukasz Raczylo
  2026-05-14 21:54 ` [PATCH net-next v2 " Lukasz Raczylo
  4 siblings, 0 replies; 8+ messages in thread
From: Théo Lebrun @ 2026-05-14 10:31 UTC (permalink / raw)
  To: Lukasz Raczylo, netdev
  Cc: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, linux-kernel,
	linux-arm-kernel, linux-rpi-kernel

Hello Lukasz,

On Sat Apr 25, 2026 at 12:38 AM CEST, Lukasz Raczylo wrote:
> This series proposes three candidate fixes for the silent TX stall
> observed on Raspberry Pi 5 (BCM2712 SoC, Cadence GEM via RP1 PCIe
> south bridge).  The bug has been reported, with reproducers, at:

I've taken over the MACB driver maintenance following Clandiu & Nicolas'
work. I have read with curiosity your series and links attached
(though I skimmed over parts because there's been a lot of discussion).

Have you moved forward since your initial post? I've seen your
still-working-on-it message from May 8th on the rpi kernel PR. I have a
few remarks and/or questions:

 - You still think the two fix patches solve it? Any clearer picture of
   which of the two patches inbetween [1/3] or[2/3] fixes it? Does
   [3/3] ever trigger on your targets?

 - Can you clarify the exact symptoms? I've seen a few contradictory
   facts. Two I remember:

    - You say here it is a Tx stall but I've seen messages in the linked
      threads that say explicitely broken Tx & Rx.
      https://github.com/cilium/cilium/issues/43198#issue-3706713821

    - You say here link down/up fixes it, but there is a comment that
      says they unload/reload the module (rtheobald). They don't say
      explicitely that link down/up doesn't work for them though, but
      someone before in the thread recommended link down/up. Another
      one says "Only power cycle recovers the node" (lexfrei).

    - Also, some messages point out disabling TSO / SG / EEE helped it.
      Any comment on that? It would help point fingers.

    - Some comments are about DT props missing. Is that lead dead now?

 - I've seen no mention of the bug having been reproduced on upstream
   kernel (?). What does the rpi kernel bring to the table that makes
   everyone use it?

 - Anything was found to increase the reproducibility of the bug? If it
   was then a bisect could be made possible, as I've seen mentions that
   it didn't appear on some older kernels.

Now about the patches:

> Reading the current driver we identified three plausible races
> between driver and hardware, each of which could independently
> produce the observed behaviour.  We did not determine which is the
> actual root cause -- that likely requires either BCM2712/RP1
> documentation we do not have, or dynamic tracing of the driver
> during an in-situ stall.  The series therefore attempts to close
> all three, with each commit message stating which specific race
> that patch is targeting.
>
>   Patch 1/3 -- flush PCIe posted write after TSTART doorbell.
>   Writes to NCR are posted PCIe writes and may not reach the MAC
>   before the driver returns.  If the TSTART doorbell is lost, no
>   TX starts, no TCOMP arrives, and the ring goes quiescent.  A
>   read-back of NCR after the write is a standard read-after-write
>   PCIe flush.

 - Makes sense, but only on MACB mounted over PCI, which is not the
   majority.
 - IDK if we can do better than a readl(NCR) on all platforms.
 - I am surprised it is the only writel() that needs to be flushed?

>   Patch 2/3 -- re-check ISR after IER re-enable in macb_tx_poll().
>   An existing comment in macb_tx_poll() notes that completions
>   raised while TCOMP is masked do not re-fire when IER is
>   re-enabled, and mitigates the window with macb_tx_complete_pending(),
>   which inspects driver-visible ring state only (after rmb()).  On
>   PCIe-attached parts the descriptor DMA write that sets TX_USED
>   can remain in flight when that check runs; the rmb() orders CPU
>   writes but does not retire peripheral DMA.  Reading ISR directly
>   after IER re-enable addresses this in two ways: (a) the MMIO read
>   is an architected PCIe read barrier for prior DMA writes, so a
>   subsequent macb_tx_complete_pending() sees up-to-date TX_USED
>   state; (b) it directly observes a pending TCOMP bit if the
>   hardware has one set.  Either signal reschedules NAPI.

This will not fly because ISR might be read-to-clear.
See macb_queue_isr_clear() and how it is used. So we cannot re-read ISR
safely on those platforms.

>   Patch 3/3 -- TX stall watchdog.  Defence-in-depth.  If patches
>   1 and 2 close the races we identified, this patch performs a
>   single spin_lock_irqsave/unlock and a branch per queue per
>   second with no other effect.  If a further race remains that we
>   have not identified, it invokes the driver's own existing
>   macb_tx_restart(), which already verifies that TBQP is behind
>   tx_head before re-asserting TSTART.  We include this patch
>   because we have empirically observed multi-minute stalls on this
>   hardware; we are willing to drop it if the preference is for
>   1 and 2 to stand alone.

Good idea, but that is what ndo_tx_timeout() is meant for no? It is a
mechanism that is not specific to our HW so that should be implemented
at the subsystem level, and it looks like it already is. :-)

We are aware of a few software scheduling races that we plan on fixing.
If your above patches ended up not fixing the issue, you could look
into those.
https://lore.kernel.org/netdev/DHIT9TPJQJ46.21A89R5UAFXVH@bootlin.com/

Thanks!

--
Théo Lebrun, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1
       [not found] <cover.1777064117.git.lukasz@raczylo.com>
                   ` (2 preceding siblings ...)
  2026-05-14 10:31 ` [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 Théo Lebrun
@ 2026-05-14 21:51 ` Lukasz Raczylo
  2026-05-14 21:54 ` [PATCH net-next v2 " Lukasz Raczylo
  4 siblings, 0 replies; 8+ messages in thread
From: Lukasz Raczylo @ 2026-05-14 21:51 UTC (permalink / raw)
  To: netdev
  Cc: Theo Lebrun, Andrea della Porta, Nicolas Ferre, Claudiu Beznea,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-kernel, linux-arm-kernel, linux-rpi-kernel

Andrea, Théo --

Thanks both.  Replying to multiple points in one mail since they
intersect.

First a correction to the v1 cover: the "zero events post-patch"
claim was true only at the user-space watchdog visibility level.
With patch 3's warn made unconditional (which is what Andrea's
review re-tested with on v6.19.10), kernel-level evidence on my
side now matches what Andrea saw -- patches 1 and 2 are partial,
patch 3 is empirically the load-bearing fix on this platform.
The v2 cover acknowledges this and reframes.


## Andrea -- patches 1 and 3

Concrete data from a dmesg sweep across the 24-node fleet
(macb-v2 image, 2026-05-02 to 2026-05-05, 6.18.24 + this series
with netdev_warn_once swapped to netdev_warn):

  * 1 mid-runtime real stall, pi-data-02 2026-05-05T13:24:09Z,
    queue 0, tail=259564431 head=259564433 (2 stuck descriptors
    after ~260M TX).  HW ETHS tx_frames counter advanced through
    the event (~1535 fps mid-stall); driver tx_tail did not.
    Patch 3 watchdog re-kicked TSTART; user-space watchdog
    logged zero RECOVER events for that window -- kernel-level
    recovery completed before user-space noticed.  Consistent
    with the lost-TCOMP pattern.

  * 6 boot-time false positives during the 2026-05-02 rolling
    reboot, all tail=0 head=5-7.  Patch 3 v1 has a real bug:
    tx_stall_last_tail is initialised to tx_tail (=0) at
    macb_open(), first tick fires at +1000 ms; if autoneg
    hasn't completed by then, tail is still 0 from no-TCOMPs
    while head moves from kernel-queued packets, false stall
    fires.  v2 fixes this (details below).


## Théo -- the four questions on the cover

Welcome to macb maintenance.  Quick answers; raw 1 Hz traces,
dmesg dumps, and event logs available on request for any of
these.

  1. Tx-only or Tx+Rx broken?

     Tx-only at the MAC counter level.  Per-node 1 Hz trace
     captures the same signature on every wedge:
     /sys/class/net/end0/statistics/tx_packets freezes at a
     single second, rx_packets continues to grow, MAC IRQ
     counter (on the assigned CPU) continues to advance,
     NET_RX softirq counter continues to advance.  The
     "broken Tx & Rx" reports (lexfrei et al on cilium#43198)
     are the user-space symptom: TX dead -> TCP can't ACK ->
     nothing comes back -> looks bidirectionally dead from
     applications.  At the MAC counters the asymmetry is
     unambiguous.

  2. Recovery: link down/up vs module reload vs power cycle?

     On any system that's still otherwise responsive, link
     down/up suffices and is the lightest known recovery.  My
     user-space watchdog has run that primitive on 24 nodes
     for ~3 weeks; ~6-10 s wedge-to-Ready, every time.
     `modprobe -r macb && modprobe macb` (rtheobald) is a
     heavier-weight equivalent of the same thing -- works but
     unnecessary if link-toggle alone fixes the silicon state.
     "Power cycle only" (lexfrei) was on Ubuntu raspi 6.17 with
     the ondemand governor + frequent RCU stalls preceding the
     wedge (per launchpad bug body's "KEY OBSERVATION" section);
     my read is that case escalated past clean NIC freeze into
     kernel-wide unresponsiveness, so a NIC-level recovery
     couldn't reach it.

  3. TSO / SG / EEE disabling helped some reporters?

     Mixed picture, with one gap in my matrix.

     Tested fleet-wide since 2026-04-24 as baseline before this
     series:
       * EEE off (--set-eee end0 eee off + advertise 0x0)
       * TSO off (-K tso off)
       * GSO off (-K gso off)
       * RX/TX rings 4096/2048 (default 512)
       * IRQ affinity moved off CPU0 to CPU3
       * CPU governor schedutil (default) -- earlier tested
         performance, no change
       * qdisc fq -> pfifo_fast

     With all of those, the stall still fires.  Pre-patch
     fleet rate was multiple per hour with these knobs already
     applied.

     Untested by me: TSO + SG (scatter-gather, NETIF_F_SG) off
     *together*.  That is the specific combination rtheobald
     (cilium#43198 comment 4188846955) and the launchpad
     commenter (#34) report as masking the stall -- "must be
     both, not just one".  I tested TSO + GSO, not TSO + SG.

     Both patch 1 (PCIe-fabric race on TSTART) and patch 2
     (peripheral DMA retirement race on TX_USED) are
     consistent with descriptor-fragment-path interactions
     that SG-off would mask, so the workaround being real
     isn't surprising.  Closing the race rather than masking
     it should still be the right thing for mainline.  Happy
     to canary-test TSO+SG-off on one node if you want the
     data point before v2 review.

  4. cdns,*-max-pipe DT props -- lead dead?

     Yes, dead, per the commenter who pursued it (launchpad
     #2133877 comment #30, April 2026): they patched mainline
     6.18.18 to add the device_property_read calls for
     cdns,ar2r-max-pipe / cdns,aw2w-max-pipe / cdns,use-aw2b-fill
     *and* added those properties to bcm2712-rpi-5-b.dts; ran
     a few hours; their own follow-up: "this is a dud" -- node
     hung anyway.

     For reference, my build (raspberrypi/linux rpi-6.18.y @
     f2f68e79f16f) has the driver-side reads present but the
     DTS does not set the properties, so the AMP register
     defaults are in effect on my fleet and the bug still
     fires.  Two independent confirmations the DT props are
     not the fix.


## My own audit -- two issues v2 fixes

Worth disclosing before someone else catches them.

  * Patch 2 (v1) reads ISR in macb_tx_poll(), masks off
    everything except TCOMP, and discards the rest.
    raspberrypi_rp1_config does not set
    MACB_CAPS_ISR_CLEAR_ON_WRITE -- the driver assumes
    read-clear ISR semantics on RP1, and macb_interrupt()
    relies on processing every bit it reads in one pass for
    that case to be correct.  My v1 patch breaks that contract:
    any RCOMP / ROVR / TXUBR / etc. set in ISR at the moment of
    my read is silently consumed and never processed.  RCOMP
    being lost is the worst case (RX completion never scheduled
    until something else asserts the line).  Race window is
    ~200-500 ns per macb_tx_poll completion; significant at
    moderate-to-heavy RX load on a level-triggered IRQ where
    the consumed bit drops the line before GIC delivery.

    v2 patch 2 drops the ISR read entirely and substitutes
    (void)queue_readl(queue, IMR).  IMR is the read-only mask
    mirror, no side effects, still flushes prior peripheral
    DMA writes via PCIe ordering.  Loses the "directly sample
    latched TCOMP" half of v1's claim; keeps the PCIe-barrier
    half (which is the half that actually addresses the
    documented race in the existing macb_tx_complete_pending()
    rmb() comment).

  * Patch 3 (v1) boot-time false positive described above to
    Andrea.  v2 fixes:

      - netif_carrier_ok() gate -- no carrier, no completion is
        possible, don't fire.

      - tracks tail movement via a bool tx_stall_tail_moved set
        by macb_tx_complete() under tx_ptr_lock when tail
        advances, cleared by the watchdog tick on the same
        lock.  Form suggested by pelwell when he reviewed the
        same series anchored against rpi-6.18.y at
        raspberrypi/linux#7340 (merged 2026-05-08); 13 days of
        fleet runtime in this form since 2026-05-02 across 24
        nodes.  Folded into mainline v2 patch 3 directly rather
        than carried as a fix-up.

      - netdev_warn_once -> netdev_warn_ratelimited per
        Andrea's ask -- operator visibility doesn't disappear
        after the first fire.


## v2 follows

Sending [PATCH net-next v2 0/3] threaded under the v1 cover
shortly.  Plan to drop "RFC" from the subject prefix (~3 weeks
production runtime + the rpi in-tree merge tip the balance
toward a regular PATCH); happy to revert to RFC if you'd
prefer the iterative-review cadence.

Tested-on context for v2:
  * mainline-anchored:  builds clean against current net-next
    HEAD, applies cleanly.  Boot-tested in a canary build of
    my Talos image before this reply.
  * rpi-6.18.y-anchored equivalents:  in production on 24 nodes
    since 2026-05-02; in raspberrypi/linux master tip since
    2026-05-08.
  * The v2 patch 2 IMR-barrier form (the one that fixes the
    destructive ISR-read regression described above) was rolled
    to all 24 Pi nodes earlier today (2026-05-14, ~14:00 UTC) as
    a vendor-fork-anchored update.  ~120 cumulative node-hours
    of runtime since:  zero mid-runtime TX stalls; zero user-space
    watchdog RECOVER events; the 5 dmesg "TX stall detected" lines
    are all boot-time false positives of the form `tail=0 head=N`
    that the new `netif_carrier_ok()` gate in patch 3 eliminates.
    Pre-patch baseline rate (~0.5 stall/node-hour at fleet level)
    would have predicted on the order of 60 mid-runtime stalls in
    that window; observed is 0.

Thanks again to both for the review.

--
Lukasz Raczylo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH net-next v2 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1
       [not found] <cover.1777064117.git.lukasz@raczylo.com>
                   ` (3 preceding siblings ...)
  2026-05-14 21:51 ` Lukasz Raczylo
@ 2026-05-14 21:54 ` Lukasz Raczylo
  2026-05-14 21:54   ` [PATCH net-next v2 1/3] net: macb: flush PCIe posted write after TSTART doorbell (PCIe-only) Lukasz Raczylo
                     ` (2 more replies)
  4 siblings, 3 replies; 8+ messages in thread
From: Lukasz Raczylo @ 2026-05-14 21:54 UTC (permalink / raw)
  To: netdev
  Cc: Theo Lebrun, Andrea della Porta, Nicolas Ferre, Claudiu Beznea,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-kernel, linux-arm-kernel, linux-rpi-kernel

Hi netdev, Théo, Andrea, linux-rpi,

v2 of the silent TX stall series.  The v1 RFC sits at:

  https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/

Reframing first.  The v1 cover claimed "zero events post-patch";
that was true at the user-space watchdog visibility level only.
A dmesg sweep prompted by Andrea's review -- with patch 3's warn
made unconditional, per his ask -- revealed kernel-level evidence
that patches 1 and 2 are partial at best.  Patch 3 is empirically
the load-bearing fix on this platform: it caught and recovered a
real lost-TCOMP stall on pi-data-02 at 2026-05-05T13:24:09Z
(queue 0, tail=259564431 head=259564433 after ~260M TX, HW
ETHS tx_frames counter advancing through the event while driver
tx_tail did not) without user-space involvement.

So the v2 narrative reads:

  * Patch 1 (PCIe posted-write flush) and patch 2 (PCIe read
    barrier before descriptor check) close two specific
    candidate races in the TSTART / TX_USED paths.  Plausible
    and well-motivated, but I cannot prove either fires in
    isolation on this hardware -- my 1 Hz trace shows TX
    freezes, not which mechanism caused them.

  * Patch 3 (TX stall watchdog) is the safety net that
    empirically does the recovery work.  13 days of production
    runtime on 24 nodes since 2026-05-02 in the same form
    (anchored against the rpi-6.18.y vendor fork, in
    raspberrypi/linux#7340 -- merged 2026-05-08 after review
    feedback from pelwell that this v2 incorporates).

The v1 cover's "zero stalls in 95 node-hours of post-patch
uptime" framing was misleading.  Apologies for that.

## What changed in v2

Patch 1 (PCIe posted-write flush after TSTART doorbell):
  * Gates the readback behind a new MACB_CAPS_PCIE_POSTED_WRITES
    capability, set only on raspberrypi_rp1_config.  v1
    applied the readback to every macb variant; SoC-integrated
    parts (Atmel, Microchip, SiFive, Xilinx) have no posted-write
    fabric and were paying the readback latency for no benefit.
  * Commit message notes that the readback also flushes the
    preceding macb_tx_lpi_wake() NCR write on the same path --
    not just TSTART -- since it functions as a PCIe read barrier
    for all prior posted writes by the same requester.

Patch 2 (PCIe read barrier before TX completion descriptor check):
  * Dropped the ISR read.  v1 read ISR in macb_tx_poll() with
    `queue_readl(queue, ISR) & MACB_BIT(TCOMP)`; that's
    destructive on RP1 silicon (MACB_CAPS_ISR_CLEAR_ON_WRITE
    is not set on raspberrypi_rp1_config; the existing handler
    assumes read-clear semantics and processes every bit
    returned from queue_readl(queue, ISR) in one pass).  v1's
    masked-and-discarded read silently consumed any other bit
    set in ISR at that instant -- RCOMP being the worst case
    (RX completion never scheduled until the line re-asserts).
  * v2 substitutes `(void)queue_readl(queue, IMR)` -- IMR is
    the read-only interrupt mask mirror, no side effects, still
    flushes prior peripheral DMA writes via PCIe completion
    ordering.  Loses the "directly sample latched TCOMP" half
    of v1's claim; keeps the PCIe-barrier half, which is the
    half that addresses the documented race in the existing
    macb_tx_complete_pending() rmb() comment.

Patch 3 (TX stall watchdog):
  * Tail movement is tracked via a `bool tx_stall_tail_moved`
    set by macb_tx_complete() under tx_ptr_lock when tail
    advances, and cleared by the watchdog tick on the same lock.
    v1 snapshotted tx_tail and compared between ticks; while
    that worked correctly given tx_tail is free-running u32,
    the bool form is unambiguously cleaner, doesn't depend on
    counter behaviour, and is what pelwell asked for when he
    reviewed the same series on the rpi side
    (raspberrypi/linux#7340).
  * netif_carrier_ok() gate added at the top of the watchdog
    tick.  Eliminates the boot-time false positive seen in v1
    where, between macb_open() and link-autoneg-completion,
    queue->tx_head can advance from kernel-queued packets while
    tx_tail stays at 0 (no TCOMPs yet), tripping the snapshot
    check.  Observed 6 such fires during a 2026-05-02 fleet
    rolling reboot.
  * netdev_warn_once -> netdev_warn_ratelimited.  v1's
    netdev_warn_once made operational accounting impossible
    after the first fire on a given netdev; ratelimited keeps
    bounded log noise but lets operators count events.  Andrea
    asked for this directly.

Patches 1 and 3 are independently revertable.  Patch 2 v2 is a
two-line readback before an existing check; trivially revertable
in isolation, semantically dependent on the existing
macb_tx_complete_pending() recovery path that it strengthens.

## What I haven't done

  * TSO+SG-off canary.  rtheobald (cilium#43198 #4188846955)
    and the launchpad #2133877 commenter (#34) both report
    TSO+SG-off *together* mask the stall; my matrix has TSO+GSO
    tested off, not TSO+SG.  Happy to canary-test this on one
    node if reviewers want the data point before deciding which
    of patches 1/2 the SG path actually exercises.

  * Per-patch isolation testing.  All three deployed
    simultaneously on the 24-node fleet; I cannot independently
    prove patch 1 or patch 2 does anything on its own.  Patch 3
    has direct production evidence (lost-TCOMP recovery
    described above).  If reviewers want a bisection-style
    canary I can stagger one-patch / two-patch / three-patch
    nodes for >=1 week each.

## Status and testing

  * Mainline-anchored:  v2 builds clean against current net-next
    HEAD, applies cleanly.  Boot-tested and brief-sanity in a
    canary build before this send.
  * raspberrypi/linux rpi-6.18.y anchored equivalents:  in
    production on 24 nodes since 2026-05-02 (now 13 days); in
    raspberrypi/linux master since 2026-05-08 (6 days).
  * The v2 patch 2 IMR-barrier form was rolled to all 24 Pi
    nodes earlier today (2026-05-14, ~14:00 UTC) as a
    vendor-fork-anchored update.  ~120 cumulative node-hours
    of runtime since: zero mid-runtime TX stalls; zero user-space
    watchdog RECOVER events.  Cover-letter-thread reply with
    detail accompanies this series.

The series does not depend on any other in-flight work.  Happy
to split, rebase, drop, or restructure on feedback.

Lukasz Raczylo (3):
  net: macb: flush PCIe posted write after TSTART doorbell (PCIe-only)
  net: macb: insert PCIe read barrier before TX completion descriptor
    check
  net: macb: add TX stall watchdog to recover from lost TCOMP interrupts

 drivers/net/ethernet/cadence/macb.h      | 14 ++++
 drivers/net/ethernet/cadence/macb_main.c | 95 ++++++++++++++++++++++++
 2 files changed, 109 insertions(+)

-- 
2.54.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH net-next v2 1/3] net: macb: flush PCIe posted write after TSTART doorbell (PCIe-only)
  2026-05-14 21:54 ` [PATCH net-next v2 " Lukasz Raczylo
@ 2026-05-14 21:54   ` Lukasz Raczylo
  2026-05-14 21:54   ` [PATCH net-next v2 2/3] net: macb: insert PCIe read barrier before TX completion descriptor check Lukasz Raczylo
  2026-05-14 21:54   ` [PATCH net-next v2 3/3] net: macb: add TX stall watchdog to recover from lost TCOMP interrupts Lukasz Raczylo
  2 siblings, 0 replies; 8+ messages in thread
From: Lukasz Raczylo @ 2026-05-14 21:54 UTC (permalink / raw)
  To: netdev
  Cc: Theo Lebrun, Andrea della Porta, Nicolas Ferre, Claudiu Beznea,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-kernel, linux-arm-kernel, linux-rpi-kernel

macb_start_xmit() and macb_tx_restart() kick transmission by
OR-ing MACB_BIT(TSTART) into NCR.  On PCIe-attached macb
instances (BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 is
the case I have in front of me), writes to NCR are posted PCIe
writes: they are not guaranteed to reach the device before the
issuing CPU returns.  If the TSTART doorbell does not reach the
MAC, no TX begins, no TCOMP completion arrives, and the ring
remains quiescent without any kernel-visible indication.

Add a read-back of NCR after each TSTART write.  The read is an
architected PCIe read barrier for earlier posted writes on the
same path; it ensures the doorbell has reached the MAC before
the function returns.  As a side effect on macb_start_xmit() it
also flushes the preceding macb_tx_lpi_wake() NCR write -- not
just TSTART -- since the barrier applies to all prior posted
writes by the same requester.

The cost is one non-posted PCIe read per TSTART.  To avoid
imposing this on SoC-integrated macb variants (Atmel, Microchip,
SiFive, Xilinx), where NCR is on-chip MMIO and no fabric
posted-write concern exists, gate the readback behind a new
MACB_CAPS_PCIE_POSTED_WRITES capability set only on
raspberrypi_rp1_config.

Note that the raspberrypi/linux vendor fork carries a local
patch around the TSTART site (a queue->tx_pending breadcrumb
that is promoted to queue->txubr_pending by the next TCOMP
interrupt, triggering macb_tx_restart()).  That workaround makes
the loss recoverable under traffic, but it cannot help if TCOMP
itself is not raised because no TX started -- which is exactly
the case targeted here.  The handshake is not present in
mainline.

Link: https://github.com/cilium/cilium/issues/43198
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
---
 drivers/net/ethernet/cadence/macb.h      |  4 ++++
 drivers/net/ethernet/cadence/macb_main.c | 15 +++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
index 2de56017e..ce9037f9e 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -791,6 +791,10 @@
 #define MACB_CAPS_USRIO_HAS_MII			BIT(26)
 #define MACB_CAPS_USRIO_HAS_REFCLK_SOURCE	BIT(27)
 #define MACB_CAPS_USRIO_HAS_TSUCLK_SOURCE	BIT(28)
+/* Register writes are posted on the parent fabric and need a non-posted
+ * read-back to guarantee delivery.  Currently set only on RP1.
+ */
+#define MACB_CAPS_PCIE_POSTED_WRITES		BIT(29)

 /* LSO settings */
 #define MACB_LSO_UFO_ENABLE			0x01
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index a12aa2124..6879f3458 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -1922,6 +1922,14 @@ static void macb_tx_restart(struct macb_queue *queue)

 	spin_lock(&bp->lock);
 	macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART));
+	/*
+	 * On PCIe-attached parts, flush the posted-write queue so the
+	 * TSTART doorbell reliably reaches the MAC.  Without this the
+	 * write can sit in the fabric and the MAC never advances,
+	 * causing a silent TX stall.
+	 */
+	if (bp->caps & MACB_CAPS_PCIE_POSTED_WRITES)
+		(void)macb_readl(bp, NCR);
 	spin_unlock(&bp->lock);

 out_tx_ptr_unlock:
@@ -2560,6 +2568,12 @@ static netdev_tx_t macb_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	spin_lock(&bp->lock);
 	macb_tx_lpi_wake(bp);
 	macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART));
+	/*
+	 * Flush PCIe posted-write queue; see comment in macb_tx_restart().
+	 * Also flushes the preceding macb_tx_lpi_wake() NCR write.
+	 */
+	if (bp->caps & MACB_CAPS_PCIE_POSTED_WRITES)
+		(void)macb_readl(bp, NCR);
 	spin_unlock(&bp->lock);

 	if (CIRC_SPACE(queue->tx_head, queue->tx_tail, bp->tx_ring_size) < 1)
@@ -5674,6 +5688,7 @@ static const struct macb_config raspberrypi_rp1_config = {
 	.caps = MACB_CAPS_GIGABIT_MODE_AVAILABLE | MACB_CAPS_CLK_HW_CHG |
 		MACB_CAPS_JUMBO |
 		MACB_CAPS_GEM_HAS_PTP |
+		MACB_CAPS_PCIE_POSTED_WRITES |
 		MACB_CAPS_EEE |
 		MACB_CAPS_USRIO_HAS_MII,
 	.dma_burst_length = 16,
-- 
2.54.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH net-next v2 2/3] net: macb: insert PCIe read barrier before TX completion descriptor check
  2026-05-14 21:54 ` [PATCH net-next v2 " Lukasz Raczylo
  2026-05-14 21:54   ` [PATCH net-next v2 1/3] net: macb: flush PCIe posted write after TSTART doorbell (PCIe-only) Lukasz Raczylo
@ 2026-05-14 21:54   ` Lukasz Raczylo
  2026-05-14 21:54   ` [PATCH net-next v2 3/3] net: macb: add TX stall watchdog to recover from lost TCOMP interrupts Lukasz Raczylo
  2 siblings, 0 replies; 8+ messages in thread
From: Lukasz Raczylo @ 2026-05-14 21:54 UTC (permalink / raw)
  To: netdev
  Cc: Theo Lebrun, Andrea della Porta, Nicolas Ferre, Claudiu Beznea,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-kernel, linux-arm-kernel, linux-rpi-kernel

macb_tx_poll() runs with TCOMP masked, drains the TX ring, then
calls napi_complete_done() and re-enables TCOMP via IER.  An
existing comment in the function notes that completions raised
while TCOMP is masked do not re-fire on IER re-enable, and
mitigates this by calling macb_tx_complete_pending(), which
inspects driver-visible ring state (descriptor->ctrl, after
rmb()) and reschedules NAPI if a completion is observable in
memory.

On PCIe-attached parts (BCM2712 + RP1 PCIe south bridge on
Raspberry Pi 5 is the case I have in front of me), the
descriptor DMA write that sets TX_USED may not have retired to
system memory at the point macb_tx_complete_pending() runs.  The
rmb() synchronises the CPU view of earlier CPU writes; it is
not sufficient to retire an in-flight peripheral DMA write.
Under that ordering the in-memory descriptor can still read
TX_USED=0 when the hardware has in fact completed the frame;
the check returns false; NAPI exits; the quirk above prevents
the re-enabled IER from re-firing; the ring goes quiescent.

Add a side-effect-free MMIO read between the IER write and the
macb_tx_complete_pending() check.  The read functions as an
architected PCIe read barrier for earlier peripheral-originated
DMA writes on the same path, so any in-flight TX_USED update
retires to system memory before the descriptor read.

The register chosen is IMR (the read-only interrupt mask
mirror); reading it has no side effects on either read-clear or
W1C ISR silicon (it is not the ISR), and the read still flushes
prior DMA writes via the PCIe completion-ordering guarantee.

Link: https://github.com/cilium/cilium/issues/43198
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
---
 drivers/net/ethernet/cadence/macb_main.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index 6879f3458..f7fa9e7ad 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -1984,6 +1984,14 @@ static int macb_tx_poll(struct napi_struct *napi, int budget)
 		 * actions if an interrupt is raised just after enabling them,
 		 * but this should be harmless.
 		 */
+		/*
+		 * PCIe read barrier: flush any in-flight peripheral DMA
+		 * writes (descriptor TX_USED updates) so the subsequent
+		 * macb_tx_complete_pending() check observes them.  IMR is
+		 * the read-only interrupt mask mirror; the read has no
+		 * side effects on either read-clear or W1C ISR silicon.
+		 */
+		(void)queue_readl(queue, IMR);
 		if (macb_tx_complete_pending(queue)) {
 			queue_writel(queue, IDR, MACB_BIT(TCOMP));
 			macb_queue_isr_clear(bp, queue, MACB_BIT(TCOMP));
-- 
2.54.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH net-next v2 3/3] net: macb: add TX stall watchdog to recover from lost TCOMP interrupts
  2026-05-14 21:54 ` [PATCH net-next v2 " Lukasz Raczylo
  2026-05-14 21:54   ` [PATCH net-next v2 1/3] net: macb: flush PCIe posted write after TSTART doorbell (PCIe-only) Lukasz Raczylo
  2026-05-14 21:54   ` [PATCH net-next v2 2/3] net: macb: insert PCIe read barrier before TX completion descriptor check Lukasz Raczylo
@ 2026-05-14 21:54   ` Lukasz Raczylo
  2 siblings, 0 replies; 8+ messages in thread
From: Lukasz Raczylo @ 2026-05-14 21:54 UTC (permalink / raw)
  To: netdev
  Cc: Theo Lebrun, Andrea della Porta, Nicolas Ferre, Claudiu Beznea,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-kernel, linux-arm-kernel, linux-rpi-kernel

On PCIe-attached macb instances (BCM2712 + RP1 PCIe south bridge
on Raspberry Pi 5 is the case I have in front of me), a TCOMP
interrupt can be lost: the TSTART doorbell can be lost in the
posted-write fabric (addressed by an earlier patch), or the
descriptor TX_USED DMA write can be observed late by the driver
(also addressed earlier).  When that happens the TX ring stalls
silently until something else kicks TSTART.

Add a per-queue delayed_work that runs once per second.  It
detects forward progress on the TX completion path via a per-queue
bool tx_stall_tail_moved that macb_tx_complete() sets when tx_tail
advances and the watchdog clears on each tick.  If the ring is
non-empty (queue->tx_head != queue->tx_tail) and the flag is
unset when the tick runs, the watchdog calls the existing
macb_tx_restart() to re-assert TSTART.

The bool form (rather than a tx_tail snapshot) sidesteps any
concern about ring-index aliasing between ticks and is the form
suggested by Phil Elwell when reviewing the same series anchored
against rpi-6.18.y at raspberrypi/linux#7340 (merged 2026-05-08).

No new recovery logic is introduced.  macb_tx_restart() already
exists in this file, is correctly locked (tx_ptr_lock, bp->lock),
and verifies that the hardware's TBQP is behind the driver's head
index before re-asserting TSTART.  On a healthy ring it is a
no-op at the hardware level.

Cost on a healthy queue: one spin_lock_irqsave / spin_unlock and
two field assignments per tick.  The delayed_work is only
scheduled between macb_open() and macb_close() and is cancelled
synchronously on close.

A netif_carrier_ok() gate at the top of the tick skips the stall
check when there is no carrier (no completion is possible without
a link), eliminating a boot-time false positive where queue->tx_head
can advance from kernel-queued packets between macb_open() and link
autoneg completion, while tx_tail stays unchanged because no TCOMPs
have arrived yet.

netdev_warn_ratelimited() is used rather than netdev_warn_once() so
operators can count occurrences across the lifetime of the netdev.

Link: https://github.com/cilium/cilium/issues/43198
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
Link: https://github.com/raspberrypi/linux/pull/7340
Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
---
 drivers/net/ethernet/cadence/macb.h      | 10 ++++
 drivers/net/ethernet/cadence/macb_main.c | 72 ++++++++++++++++++++++++
 2 files changed, 82 insertions(+)

diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
index ce9037f9e..75df0f75b 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -1282,6 +1282,16 @@ struct macb_queue {
 	dma_addr_t		tx_ring_dma;
 	struct work_struct	tx_error_task;
 	bool			txubr_pending;
+
+	/* TX stall watchdog -- see macb_tx_stall_watchdog() in macb_main.c.
+	 * tx_stall_tail_moved is set by macb_tx_complete() when tx_tail
+	 * advances and cleared by the watchdog tick on each pass (both
+	 * under tx_ptr_lock).  Using a bool sidesteps any ring-index
+	 * aliasing concern between ticks.
+	 */
+	struct delayed_work	tx_stall_watchdog_work;
+	bool			tx_stall_tail_moved;
+
 	struct napi_struct	napi_tx;
 
 	dma_addr_t		rx_ring_dma;
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index f7fa9e7ad..8245c69e1 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -1473,6 +1473,8 @@ static int macb_tx_complete(struct macb_queue *queue, int budget)
 				  packets, bytes);
 
 	queue->tx_tail = tail;
+	if (packets)
+		queue->tx_stall_tail_moved = true;
 	if (__netif_subqueue_stopped(bp->dev, queue_index) &&
 	    CIRC_CNT(queue->tx_head, queue->tx_tail,
 		     bp->tx_ring_size) <= MACB_TX_WAKEUP_THRESH(bp))
@@ -2003,6 +2005,70 @@ static int macb_tx_poll(struct napi_struct *napi, int budget)
 	return work_done;
 }
 
+#define MACB_TX_STALL_INTERVAL_MS	1000
+
+/*
+ * TX stall watchdog.
+ *
+ * Recovers from lost TCOMP interrupts on PCIe-attached macb
+ * instances.  macb already has a recovery chain
+ * (txubr_pending -> macb_tx_restart()) that fires on TCOMP; if
+ * TCOMP itself is lost the TX ring stalls silently until something
+ * else kicks TSTART.  This watchdog runs once per second per queue
+ * and calls macb_tx_restart() if the ring is non-empty and
+ * tx_tail has not advanced since the previous tick.
+ *
+ * Movement is tracked via the tx_stall_tail_moved boolean rather
+ * than a tx_tail snapshot, sidestepping any ring-index aliasing
+ * concern.  The bool is set by macb_tx_complete() when tx_tail
+ * advances and cleared here on each tick; both writes are under
+ * tx_ptr_lock so no atomic is required.
+ *
+ * macb_tx_restart() already checks the hardware's TBQP against
+ * the driver's head index before re-asserting TSTART, so on a
+ * healthy ring this is a no-op at the hardware level.  The
+ * watchdog only supplies the missing trigger.
+ */
+static void macb_tx_stall_watchdog(struct work_struct *work)
+{
+	struct macb_queue *queue = container_of(to_delayed_work(work),
+						struct macb_queue,
+						tx_stall_watchdog_work);
+	struct macb *bp = queue->bp;
+	unsigned int cur_tail, cur_head;
+	bool stalled = false;
+	unsigned long flags;
+
+	if (!netif_running(bp->dev))
+		return;
+
+	/* No carrier => no completion is possible.  Skip the check
+	 * but keep the watchdog ticking for when carrier comes up.
+	 */
+	if (!netif_carrier_ok(bp->dev))
+		goto reschedule;
+
+	spin_lock_irqsave(&queue->tx_ptr_lock, flags);
+	cur_tail = queue->tx_tail;
+	cur_head = queue->tx_head;
+	if (cur_head != cur_tail && !queue->tx_stall_tail_moved)
+		stalled = true;
+	queue->tx_stall_tail_moved = false;
+	spin_unlock_irqrestore(&queue->tx_ptr_lock, flags);
+
+	if (stalled) {
+		netdev_warn_ratelimited(bp->dev,
+					"TX stall detected on queue %u (tail=%u head=%u); re-kicking TSTART\n",
+					(unsigned int)(queue - bp->queues),
+					cur_tail, cur_head);
+		macb_tx_restart(queue);
+	}
+
+reschedule:
+	schedule_delayed_work(&queue->tx_stall_watchdog_work,
+			      msecs_to_jiffies(MACB_TX_STALL_INTERVAL_MS));
+}
+
 static void macb_hresp_error_task(struct work_struct *work)
 {
 	struct macb *bp = from_work(bp, work, hresp_err_bh_work);
@@ -3192,6 +3258,9 @@ static int macb_open(struct net_device *dev)
 	for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
 		napi_enable(&queue->napi_rx);
 		napi_enable(&queue->napi_tx);
+		queue->tx_stall_tail_moved = true;
+		schedule_delayed_work(&queue->tx_stall_watchdog_work,
+				      msecs_to_jiffies(MACB_TX_STALL_INTERVAL_MS));
 	}
 
 	macb_init_hw(bp);
@@ -3242,6 +3311,7 @@ static int macb_close(struct net_device *dev)
 	for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
 		napi_disable(&queue->napi_rx);
 		napi_disable(&queue->napi_tx);
+		cancel_delayed_work_sync(&queue->tx_stall_watchdog_work);
 		netdev_tx_reset_queue(netdev_get_tx_queue(dev, q));
 	}
 
@@ -4804,6 +4874,8 @@ static int macb_init_dflt(struct platform_device *pdev)
 		}
 
 		INIT_WORK(&queue->tx_error_task, macb_tx_error_task);
+		INIT_DELAYED_WORK(&queue->tx_stall_watchdog_work,
+				  macb_tx_stall_watchdog);
 		q++;
 	}
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-05-14 21:55 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <cover.1777064117.git.lukasz@raczylo.com>
     [not found] ` <3106d546d494f2f52ec832e7f7d04f534286e254.1777064117.git.lukasz@raczylo.com>
2026-05-05 13:17   ` [RFC PATCH net-next 1/3] net: macb: flush PCIe posted write after TSTART doorbell Andrea della Porta
     [not found] ` <c0469642f42ada85d91a8a685eb7c0e04cb99131.1777064117.git.lukasz@raczylo.com>
2026-05-05 13:30   ` [RFC PATCH net-next 3/3] net: macb: add TX stall watchdog as defence-in-depth safety net Andrea della Porta
2026-05-14 10:31 ` [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 Théo Lebrun
2026-05-14 21:51 ` Lukasz Raczylo
2026-05-14 21:54 ` [PATCH net-next v2 " Lukasz Raczylo
2026-05-14 21:54   ` [PATCH net-next v2 1/3] net: macb: flush PCIe posted write after TSTART doorbell (PCIe-only) Lukasz Raczylo
2026-05-14 21:54   ` [PATCH net-next v2 2/3] net: macb: insert PCIe read barrier before TX completion descriptor check Lukasz Raczylo
2026-05-14 21:54   ` [PATCH net-next v2 3/3] net: macb: add TX stall watchdog to recover from lost TCOMP interrupts Lukasz Raczylo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox