Re: [RFC PATCH net-next 1/3] net: macb: flush PCIe posted write after TSTART doorbell

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* Re: [RFC PATCH net-next 1/3] net: macb: flush PCIe posted write after TSTART doorbell
       [not found] ` <3106d546d494f2f52ec832e7f7d04f534286e254.1777064117.git.lukasz@raczylo.com>
@ 2026-05-05 13:17   ` Andrea della Porta
  0 siblings, 0 replies; 3+ messages in thread
From: Andrea della Porta @ 2026-05-05 13:17 UTC (permalink / raw)
  To: Lukasz Raczylo
  Cc: netdev, Nicolas Ferre, Claudiu Beznea, Andrew Lunn,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	linux-kernel, linux-arm-kernel, linux-rpi-kernel

Hi Lukasz,

On 23:38 Fri 24 Apr     , Lukasz Raczylo wrote:
> macb_start_xmit() and macb_tx_restart() kick transmission by
> OR-ing MACB_BIT(TSTART) into NCR.  On PCIe-attached macb instances
> (BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 is the setup we
> have in front of us), writes to NCR are posted PCIe writes: they
> are not guaranteed to reach the device before the issuing CPU
> returns.  If the TSTART doorbell does not reach the MAC, no TX
> begins, no TCOMP completion arrives, and the ring remains
> quiescent without any kernel-visible indication.
> 
> Note that the raspberrypi/linux vendor fork carries a local patch
> around the TSTART site (a queue->tx_pending breadcrumb that is
> promoted to queue->txubr_pending by the next TCOMP interrupt,
> triggering macb_tx_restart()).  That workaround makes the loss
> recoverable under traffic, but it cannot help if TCOMP itself is
> not raised because no TX started -- which is exactly the case we
> are targeting here.  The handshake is not present in mainline.
> 
> Add a read-back of NCR after each TSTART write in macb_start_xmit()
> and macb_tx_restart().  The read is an architected PCIe read
> barrier for earlier posted writes on the same path; it ensures the
> doorbell has reached the MAC before the functions return.
> 
> We do not yet have direct hardware evidence that TSTART is being
> lost on the RP1 path (that would require a PCIe protocol analyser,
> or at minimum a before/after counter on queue->tx_stall_last_tail
> with and without this patch applied in isolation).  This patch is
> one of a three-patch series ("candidate fixes for silent TX stall
> on BCM2712/RP1"); see the cover letter for context.  We have
> verified the series compiles and applies cleanly against mainline
> HEAD and against raspberrypi/linux rpi-6.18.y @ f2f68e79f16f;
> runtime verification is pending.
> 
> Link: https://github.com/cilium/cilium/issues/43198
> Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
> Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
> ---
>  drivers/net/ethernet/cadence/macb_main.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index a12aa2124..b6cca55ad 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -1922,6 +1922,13 @@ static void macb_tx_restart(struct macb_queue *queue)
>  
>  	spin_lock(&bp->lock);
>  	macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART));
> +	/*
> +	 * Flush the PCIe posted-write queue so the TSTART doorbell
> +	 * reliably reaches the MAC.  Without this, the write can sit
> +	 * in the fabric and the MAC never advances, causing a silent
> +	 * TX stall.
> +	 */
> +	(void)macb_readl(bp, NCR);

Do you expect it to be a temporal issue (in the comment you stated that the write should be delivered
before the function returns) or is it just a dropped packet error? In the former case can you please
explain why? If it's the latter, I'd expect the dropped packet to increment the counter in the AER,
and I'm pretty sure it didn't when I performed my tests.

Many thanks,
Andrea

>  	spin_unlock(&bp->lock);
>  
>  out_tx_ptr_unlock:
> @@ -2560,6 +2567,11 @@ static netdev_tx_t macb_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	spin_lock(&bp->lock);
>  	macb_tx_lpi_wake(bp);
>  	macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART));
> +	/*
> +	 * Flush the PCIe posted-write queue; see the comment in
> +	 * macb_tx_restart() for the reasoning.
> +	 */
> +	(void)macb_readl(bp, NCR);
>  	spin_unlock(&bp->lock);
>  
>  	if (CIRC_SPACE(queue->tx_head, queue->tx_tail, bp->tx_ring_size) < 1)
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH net-next 3/3] net: macb: add TX stall watchdog as defence-in-depth safety net
       [not found] ` <c0469642f42ada85d91a8a685eb7c0e04cb99131.1777064117.git.lukasz@raczylo.com>
@ 2026-05-05 13:30   ` Andrea della Porta
  0 siblings, 0 replies; 3+ messages in thread
From: Andrea della Porta @ 2026-05-05 13:30 UTC (permalink / raw)
  To: Lukasz Raczylo
  Cc: netdev, Nicolas Ferre, Claudiu Beznea, Andrew Lunn,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	linux-kernel, linux-arm-kernel, linux-rpi-kernel

Hi Lukasz,

On 23:38 Fri 24 Apr     , Lukasz Raczylo wrote:
> Patches 1/3 and 2/3 address two candidate races that could lead
> to a TCOMP completion being missed on PCIe-attached macb
> instances.  This patch adds a defence-in-depth safety net, in
> case a further race remains that we have not identified.
> 
> The watchdog is a per-queue delayed_work that runs once per
> second.  It snapshots queue->tx_tail; if the ring is non-empty
> (queue->tx_head != queue->tx_tail) and tx_tail has not advanced
> since the previous tick, it calls macb_tx_restart().
> 
> No new recovery logic is introduced.  macb_tx_restart() already
> exists in this file, is correctly locked (tx_ptr_lock, bp->lock),
> and verifies that the hardware's TBQP is behind the driver's
> head index before re-asserting TSTART.  On a healthy ring it is
> a no-op at the hardware level; the watchdog only supplies the
> missing trigger.
> 
> On a healthy queue the per-tick cost is one spin_lock_irqsave()
> / spin_unlock_irqrestore() and one branch.  The delayed_work is
> only scheduled between macb_open() and macb_close(), and is
> cancelled synchronously on close.
> 
> Context for submission: on our 24-node Raspberry Pi 5 fleet,
> before this series, an out-of-band user-space watchdog
> (monitoring tx_packets from /sys/class/net/.../statistics and
> toggling the link down/up when it froze) was required to keep
> nodes usable.  We include this kernel-side watchdog as a cleaner
> in-kernel equivalent for any residual stall that patches 1 and
> 2 do not cover.  We are willing to drop this patch if the view
> is that 1 and 2 should stand alone.
> 
> Link: https://github.com/cilium/cilium/issues/43198
> Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
> Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
> ---
>  drivers/net/ethernet/cadence/macb.h      |  5 ++
>  drivers/net/ethernet/cadence/macb_main.c | 59 ++++++++++++++++++++++++
>  2 files changed, 64 insertions(+)
> 
> diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
> index 2de56017e..9115f2b47 100644
> --- a/drivers/net/ethernet/cadence/macb.h
> +++ b/drivers/net/ethernet/cadence/macb.h
> @@ -1278,6 +1278,11 @@ struct macb_queue {
>  	dma_addr_t		tx_ring_dma;
>  	struct work_struct	tx_error_task;
>  	bool			txubr_pending;
> +
> +	/* TX stall watchdog -- see macb_tx_stall_watchdog() in macb_main.c */
> +	struct delayed_work	tx_stall_watchdog_work;
> +	unsigned int		tx_stall_last_tail;
> +
>  	struct napi_struct	napi_tx;
>  
>  	dma_addr_t		rx_ring_dma;
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index ea231b1c5..ea2306ef7 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -2002,6 +2002,59 @@ static int macb_tx_poll(struct napi_struct *napi, int budget)
>  	return work_done;
>  }
>  
> +#define MACB_TX_STALL_INTERVAL_MS	1000
> +
> +/*
> + * TX stall watchdog.
> + *
> + * Defence-in-depth against lost TCOMP interrupts.  macb already has a
> + * recovery chain (tx_pending -> txubr_pending -> macb_tx_restart())
> + * that fires on TCOMP; if TCOMP itself is lost the TX ring stalls
> + * silently until something else kicks TSTART.  This watchdog runs
> + * once per second per queue, snapshots tx_tail, and calls
> + * macb_tx_restart() if the ring is non-empty and tx_tail has not
> + * advanced since the previous tick.
> + *
> + * macb_tx_restart() already checks the hardware's TBQP against the
> + * driver's head index before re-asserting TSTART, so on a healthy
> + * ring this is a no-op at the hardware level.  The watchdog only
> + * adds the missing trigger.
> + */
> +static void macb_tx_stall_watchdog(struct work_struct *work)
> +{
> +	struct macb_queue *queue = container_of(to_delayed_work(work),
> +						struct macb_queue,
> +						tx_stall_watchdog_work);
> +	struct macb *bp = queue->bp;
> +	unsigned int cur_tail, cur_head;
> +	bool stalled = false;
> +	unsigned long flags;
> +
> +	if (!netif_running(bp->dev))
> +		return;
> +
> +	spin_lock_irqsave(&queue->tx_ptr_lock, flags);
> +	cur_tail = queue->tx_tail;
> +	cur_head = queue->tx_head;
> +	if (cur_head != cur_tail &&
> +	    cur_tail == queue->tx_stall_last_tail)
> +		stalled = true;
> +	else
> +		queue->tx_stall_last_tail = cur_tail;
> +	spin_unlock_irqrestore(&queue->tx_ptr_lock, flags);
> +
> +	if (stalled) {
> +		netdev_warn_once(bp->dev,
> +				 "TX stall detected on queue %u (tail=%u head=%u); re-kicking TSTART\n",
> +				 (unsigned int)(queue - bp->queues),
> +				 cur_tail, cur_head);
> +		macb_tx_restart(queue);
> +	}
> +
> +	schedule_delayed_work(&queue->tx_stall_watchdog_work,
> +			      msecs_to_jiffies(MACB_TX_STALL_INTERVAL_MS));
> +}
> +
>  static void macb_hresp_error_task(struct work_struct *work)
>  {
>  	struct macb *bp = from_work(bp, work, hresp_err_bh_work);
> @@ -3190,6 +3243,9 @@ static int macb_open(struct net_device *dev)
>  	for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
>  		napi_enable(&queue->napi_rx);
>  		napi_enable(&queue->napi_tx);
> +		queue->tx_stall_last_tail = queue->tx_tail;
> +		schedule_delayed_work(&queue->tx_stall_watchdog_work,
> +				      msecs_to_jiffies(MACB_TX_STALL_INTERVAL_MS));
>  	}
>  
>  	macb_init_hw(bp);
> @@ -3240,6 +3296,7 @@ static int macb_close(struct net_device *dev)
>  	for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
>  		napi_disable(&queue->napi_rx);
>  		napi_disable(&queue->napi_tx);
> +		cancel_delayed_work_sync(&queue->tx_stall_watchdog_work);
>  		netdev_tx_reset_queue(netdev_get_tx_queue(dev, q));
>  	}
>  
> @@ -4802,6 +4859,8 @@ static int macb_init_dflt(struct platform_device *pdev)
>  		}
>  
>  		INIT_WORK(&queue->tx_error_task, macb_tx_error_task);
> +		INIT_DELAYED_WORK(&queue->tx_stall_watchdog_work,
> +				  macb_tx_stall_watchdog);
>  		q++;
>  	}
>  
> -- 
> 2.53.0
>

I've applied all three patches to v6.19.10 changing netdev_warn_once() from this one to
netdev_warn() and it the "TX stall" warning appears several time. So it seems that there
could be another cause escaping the filtering in the first two patches.

Interestingly enough, running the same tests after substituing the entire macb driver with
the downstream version works ok.

Not sure how to interpret these results since they seem to be the opposite of yours.
More investigation is ongoing from my side.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1
       [not found] <cover.1777064117.git.lukasz@raczylo.com>
       [not found] ` <3106d546d494f2f52ec832e7f7d04f534286e254.1777064117.git.lukasz@raczylo.com>
       [not found] ` <c0469642f42ada85d91a8a685eb7c0e04cb99131.1777064117.git.lukasz@raczylo.com>
@ 2026-05-14 10:31 ` Théo Lebrun
  2 siblings, 0 replies; 3+ messages in thread
From: Théo Lebrun @ 2026-05-14 10:31 UTC (permalink / raw)
  To: Lukasz Raczylo, netdev
  Cc: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, linux-kernel,
	linux-arm-kernel, linux-rpi-kernel

Hello Lukasz,

On Sat Apr 25, 2026 at 12:38 AM CEST, Lukasz Raczylo wrote:
> This series proposes three candidate fixes for the silent TX stall
> observed on Raspberry Pi 5 (BCM2712 SoC, Cadence GEM via RP1 PCIe
> south bridge).  The bug has been reported, with reproducers, at:

I've taken over the MACB driver maintenance following Clandiu & Nicolas'
work. I have read with curiosity your series and links attached
(though I skimmed over parts because there's been a lot of discussion).

Have you moved forward since your initial post? I've seen your
still-working-on-it message from May 8th on the rpi kernel PR. I have a
few remarks and/or questions:

 - You still think the two fix patches solve it? Any clearer picture of
   which of the two patches inbetween [1/3] or[2/3] fixes it? Does
   [3/3] ever trigger on your targets?

 - Can you clarify the exact symptoms? I've seen a few contradictory
   facts. Two I remember:

    - You say here it is a Tx stall but I've seen messages in the linked
      threads that say explicitely broken Tx & Rx.
      https://github.com/cilium/cilium/issues/43198#issue-3706713821

    - You say here link down/up fixes it, but there is a comment that
      says they unload/reload the module (rtheobald). They don't say
      explicitely that link down/up doesn't work for them though, but
      someone before in the thread recommended link down/up. Another
      one says "Only power cycle recovers the node" (lexfrei).

    - Also, some messages point out disabling TSO / SG / EEE helped it.
      Any comment on that? It would help point fingers.

    - Some comments are about DT props missing. Is that lead dead now?

 - I've seen no mention of the bug having been reproduced on upstream
   kernel (?). What does the rpi kernel bring to the table that makes
   everyone use it?

 - Anything was found to increase the reproducibility of the bug? If it
   was then a bisect could be made possible, as I've seen mentions that
   it didn't appear on some older kernels.

Now about the patches:

> Reading the current driver we identified three plausible races
> between driver and hardware, each of which could independently
> produce the observed behaviour.  We did not determine which is the
> actual root cause -- that likely requires either BCM2712/RP1
> documentation we do not have, or dynamic tracing of the driver
> during an in-situ stall.  The series therefore attempts to close
> all three, with each commit message stating which specific race
> that patch is targeting.
>
>   Patch 1/3 -- flush PCIe posted write after TSTART doorbell.
>   Writes to NCR are posted PCIe writes and may not reach the MAC
>   before the driver returns.  If the TSTART doorbell is lost, no
>   TX starts, no TCOMP arrives, and the ring goes quiescent.  A
>   read-back of NCR after the write is a standard read-after-write
>   PCIe flush.

 - Makes sense, but only on MACB mounted over PCI, which is not the
   majority.
 - IDK if we can do better than a readl(NCR) on all platforms.
 - I am surprised it is the only writel() that needs to be flushed?

>   Patch 2/3 -- re-check ISR after IER re-enable in macb_tx_poll().
>   An existing comment in macb_tx_poll() notes that completions
>   raised while TCOMP is masked do not re-fire when IER is
>   re-enabled, and mitigates the window with macb_tx_complete_pending(),
>   which inspects driver-visible ring state only (after rmb()).  On
>   PCIe-attached parts the descriptor DMA write that sets TX_USED
>   can remain in flight when that check runs; the rmb() orders CPU
>   writes but does not retire peripheral DMA.  Reading ISR directly
>   after IER re-enable addresses this in two ways: (a) the MMIO read
>   is an architected PCIe read barrier for prior DMA writes, so a
>   subsequent macb_tx_complete_pending() sees up-to-date TX_USED
>   state; (b) it directly observes a pending TCOMP bit if the
>   hardware has one set.  Either signal reschedules NAPI.

This will not fly because ISR might be read-to-clear.
See macb_queue_isr_clear() and how it is used. So we cannot re-read ISR
safely on those platforms.

>   Patch 3/3 -- TX stall watchdog.  Defence-in-depth.  If patches
>   1 and 2 close the races we identified, this patch performs a
>   single spin_lock_irqsave/unlock and a branch per queue per
>   second with no other effect.  If a further race remains that we
>   have not identified, it invokes the driver's own existing
>   macb_tx_restart(), which already verifies that TBQP is behind
>   tx_head before re-asserting TSTART.  We include this patch
>   because we have empirically observed multi-minute stalls on this
>   hardware; we are willing to drop it if the preference is for
>   1 and 2 to stand alone.

Good idea, but that is what ndo_tx_timeout() is meant for no? It is a
mechanism that is not specific to our HW so that should be implemented
at the subsystem level, and it looks like it already is. :-)

We are aware of a few software scheduling races that we plan on fixing.
If your above patches ended up not fixing the issue, you could look
into those.
https://lore.kernel.org/netdev/DHIT9TPJQJ46.21A89R5UAFXVH@bootlin.com/

Thanks!

--
Théo Lebrun, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-05-14 10:31 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <cover.1777064117.git.lukasz@raczylo.com>
     [not found] ` <3106d546d494f2f52ec832e7f7d04f534286e254.1777064117.git.lukasz@raczylo.com>
2026-05-05 13:17   ` [RFC PATCH net-next 1/3] net: macb: flush PCIe posted write after TSTART doorbell Andrea della Porta
     [not found] ` <c0469642f42ada85d91a8a685eb7c0e04cb99131.1777064117.git.lukasz@raczylo.com>
2026-05-05 13:30   ` [RFC PATCH net-next 3/3] net: macb: add TX stall watchdog as defence-in-depth safety net Andrea della Porta
2026-05-14 10:31 ` [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 Théo Lebrun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox