* [RFC PATCH net-next 1/3] net: macb: flush PCIe posted write after TSTART doorbell
2026-04-24 22:38 [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 Lukasz Raczylo
@ 2026-04-24 22:38 ` Lukasz Raczylo
2026-04-24 22:38 ` [RFC PATCH net-next 2/3] net: macb: re-check ISR after IER re-enable in macb_tx_poll Lukasz Raczylo
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Lukasz Raczylo @ 2026-04-24 22:38 UTC (permalink / raw)
To: netdev
Cc: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S . Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, linux-kernel,
linux-arm-kernel, linux-rpi-kernel
macb_start_xmit() and macb_tx_restart() kick transmission by
OR-ing MACB_BIT(TSTART) into NCR. On PCIe-attached macb instances
(BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 is the setup we
have in front of us), writes to NCR are posted PCIe writes: they
are not guaranteed to reach the device before the issuing CPU
returns. If the TSTART doorbell does not reach the MAC, no TX
begins, no TCOMP completion arrives, and the ring remains
quiescent without any kernel-visible indication.
Note that the raspberrypi/linux vendor fork carries a local patch
around the TSTART site (a queue->tx_pending breadcrumb that is
promoted to queue->txubr_pending by the next TCOMP interrupt,
triggering macb_tx_restart()). That workaround makes the loss
recoverable under traffic, but it cannot help if TCOMP itself is
not raised because no TX started -- which is exactly the case we
are targeting here. The handshake is not present in mainline.
Add a read-back of NCR after each TSTART write in macb_start_xmit()
and macb_tx_restart(). The read is an architected PCIe read
barrier for earlier posted writes on the same path; it ensures the
doorbell has reached the MAC before the functions return.
We do not yet have direct hardware evidence that TSTART is being
lost on the RP1 path (that would require a PCIe protocol analyser,
or at minimum a before/after counter on queue->tx_stall_last_tail
with and without this patch applied in isolation). This patch is
one of a three-patch series ("candidate fixes for silent TX stall
on BCM2712/RP1"); see the cover letter for context. We have
verified the series compiles and applies cleanly against mainline
HEAD and against raspberrypi/linux rpi-6.18.y @ f2f68e79f16f;
runtime verification is pending.
Link: https://github.com/cilium/cilium/issues/43198
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
---
drivers/net/ethernet/cadence/macb_main.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index a12aa2124..b6cca55ad 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -1922,6 +1922,13 @@ static void macb_tx_restart(struct macb_queue *queue)
spin_lock(&bp->lock);
macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART));
+ /*
+ * Flush the PCIe posted-write queue so the TSTART doorbell
+ * reliably reaches the MAC. Without this, the write can sit
+ * in the fabric and the MAC never advances, causing a silent
+ * TX stall.
+ */
+ (void)macb_readl(bp, NCR);
spin_unlock(&bp->lock);
out_tx_ptr_unlock:
@@ -2560,6 +2567,11 @@ static netdev_tx_t macb_start_xmit(struct sk_buff *skb, struct net_device *dev)
spin_lock(&bp->lock);
macb_tx_lpi_wake(bp);
macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART));
+ /*
+ * Flush the PCIe posted-write queue; see the comment in
+ * macb_tx_restart() for the reasoning.
+ */
+ (void)macb_readl(bp, NCR);
spin_unlock(&bp->lock);
if (CIRC_SPACE(queue->tx_head, queue->tx_tail, bp->tx_ring_size) < 1)
--
2.53.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* [RFC PATCH net-next 2/3] net: macb: re-check ISR after IER re-enable in macb_tx_poll
2026-04-24 22:38 [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 Lukasz Raczylo
2026-04-24 22:38 ` [RFC PATCH net-next 1/3] net: macb: flush PCIe posted write after TSTART doorbell Lukasz Raczylo
@ 2026-04-24 22:38 ` Lukasz Raczylo
2026-04-24 22:38 ` [RFC PATCH net-next 3/3] net: macb: add TX stall watchdog as defence-in-depth safety net Lukasz Raczylo
2026-04-25 21:48 ` [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 Lukasz Raczylo
3 siblings, 0 replies; 5+ messages in thread
From: Lukasz Raczylo @ 2026-04-24 22:38 UTC (permalink / raw)
To: netdev
Cc: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S . Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, linux-kernel,
linux-arm-kernel, linux-rpi-kernel
macb_tx_poll() runs with TCOMP masked, drains the TX ring, then
calls napi_complete_done() and re-enables TCOMP via IER. An
existing comment in the function notes:
/* Packet completions only seem to propagate to raise
* interrupts when interrupts are enabled at the time, so if
* packets were sent while interrupts were disabled,
* they will not cause another interrupt to be generated when
* interrupts are re-enabled.
*/
and mitigates this by calling macb_tx_complete_pending(), which
inspects driver-visible ring state (descriptor->ctrl, after rmb())
and reschedules NAPI if a completion is observable in memory.
On PCIe-attached parts (BCM2712 + RP1 on Raspberry Pi 5 is the
setup we have in front of us), the descriptor DMA write that sets
TX_USED may not have retired to system memory at the point
macb_tx_complete_pending() runs. The rmb() synchronises the CPU
view of earlier CPU writes; it is not sufficient to retire an
in-flight peripheral DMA write. Under that ordering the in-memory
descriptor can still read TX_USED=0 when the hardware has in fact
completed the frame; the check returns false; NAPI exits; the
quirk above prevents the re-enabled IER from re-firing; the ring
goes quiescent.
Add an explicit ISR read after the IER write. The MMIO read
serves two independent purposes:
(1) It is an architected PCIe read barrier for earlier
peripheral-originated DMA writes on the same path, so a
subsequent macb_tx_complete_pending() observes any TX_USED
write that was in flight at the time of the barrier.
(2) It samples the hardware ISR directly, so a TCOMP bit that
the hardware set while TCOMP was masked is visible here,
independently of whether the descriptor DMA has retired.
If either signal indicates pending work, reschedule NAPI via the
same path as the existing check.
This patch addresses one of three candidate races for the silent
TX stall described in the cover letter. Whether it is sufficient
by itself, or whether it requires the PCIe posted-write flush in
patch 1/3 to cover the observed behaviour, we have not yet
verified at runtime.
Link: https://github.com/cilium/cilium/issues/43198
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
---
drivers/net/ethernet/cadence/macb_main.c | 28 +++++++++++++++---------
1 file changed, 18 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index b6cca55ad..ea231b1c5 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -1973,17 +1973,25 @@ static int macb_tx_poll(struct napi_struct *napi, int budget)
if (work_done < budget && napi_complete_done(napi, work_done)) {
queue_writel(queue, IER, MACB_BIT(TCOMP));
- /* Packet completions only seem to propagate to raise
- * interrupts when interrupts are enabled at the time, so if
- * packets were sent while interrupts were disabled,
- * they will not cause another interrupt to be generated when
- * interrupts are re-enabled.
- * Check for this case here to avoid losing a wakeup. This can
- * potentially race with the interrupt handler doing the same
- * actions if an interrupt is raised just after enabling them,
- * but this should be harmless.
+ /*
+ * TCOMP events that fire while the interrupt is masked do
+ * not re-fire when IER is re-enabled. Catch this two ways
+ * to avoid losing a wakeup:
+ *
+ * (1) Read ISR -- catches completions the hardware flagged
+ * but that we did not see as an interrupt. The MMIO
+ * read doubles as a PCIe read barrier, flushing any
+ * in-flight descriptor TX_USED DMA writes into memory.
+ * (2) macb_tx_complete_pending() inspects the ring after
+ * that flush, catching a descriptor whose TX_USED is
+ * now visible as a result of the barrier.
+ *
+ * This can race with the interrupt handler taking the same
+ * path if an interrupt fires just after the IER write;
+ * rescheduling NAPI in that case is harmless.
*/
- if (macb_tx_complete_pending(queue)) {
+ if ((queue_readl(queue, ISR) & MACB_BIT(TCOMP)) ||
+ macb_tx_complete_pending(queue)) {
queue_writel(queue, IDR, MACB_BIT(TCOMP));
macb_queue_isr_clear(bp, queue, MACB_BIT(TCOMP));
netdev_vdbg(bp->dev, "TX poll: packets pending, reschedule\n");
--
2.53.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* [RFC PATCH net-next 3/3] net: macb: add TX stall watchdog as defence-in-depth safety net
2026-04-24 22:38 [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 Lukasz Raczylo
2026-04-24 22:38 ` [RFC PATCH net-next 1/3] net: macb: flush PCIe posted write after TSTART doorbell Lukasz Raczylo
2026-04-24 22:38 ` [RFC PATCH net-next 2/3] net: macb: re-check ISR after IER re-enable in macb_tx_poll Lukasz Raczylo
@ 2026-04-24 22:38 ` Lukasz Raczylo
2026-04-25 21:48 ` [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 Lukasz Raczylo
3 siblings, 0 replies; 5+ messages in thread
From: Lukasz Raczylo @ 2026-04-24 22:38 UTC (permalink / raw)
To: netdev
Cc: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S . Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, linux-kernel,
linux-arm-kernel, linux-rpi-kernel
Patches 1/3 and 2/3 address two candidate races that could lead
to a TCOMP completion being missed on PCIe-attached macb
instances. This patch adds a defence-in-depth safety net, in
case a further race remains that we have not identified.
The watchdog is a per-queue delayed_work that runs once per
second. It snapshots queue->tx_tail; if the ring is non-empty
(queue->tx_head != queue->tx_tail) and tx_tail has not advanced
since the previous tick, it calls macb_tx_restart().
No new recovery logic is introduced. macb_tx_restart() already
exists in this file, is correctly locked (tx_ptr_lock, bp->lock),
and verifies that the hardware's TBQP is behind the driver's
head index before re-asserting TSTART. On a healthy ring it is
a no-op at the hardware level; the watchdog only supplies the
missing trigger.
On a healthy queue the per-tick cost is one spin_lock_irqsave()
/ spin_unlock_irqrestore() and one branch. The delayed_work is
only scheduled between macb_open() and macb_close(), and is
cancelled synchronously on close.
Context for submission: on our 24-node Raspberry Pi 5 fleet,
before this series, an out-of-band user-space watchdog
(monitoring tx_packets from /sys/class/net/.../statistics and
toggling the link down/up when it froze) was required to keep
nodes usable. We include this kernel-side watchdog as a cleaner
in-kernel equivalent for any residual stall that patches 1 and
2 do not cover. We are willing to drop this patch if the view
is that 1 and 2 should stand alone.
Link: https://github.com/cilium/cilium/issues/43198
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
---
drivers/net/ethernet/cadence/macb.h | 5 ++
drivers/net/ethernet/cadence/macb_main.c | 59 ++++++++++++++++++++++++
2 files changed, 64 insertions(+)
diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
index 2de56017e..9115f2b47 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -1278,6 +1278,11 @@ struct macb_queue {
dma_addr_t tx_ring_dma;
struct work_struct tx_error_task;
bool txubr_pending;
+
+ /* TX stall watchdog -- see macb_tx_stall_watchdog() in macb_main.c */
+ struct delayed_work tx_stall_watchdog_work;
+ unsigned int tx_stall_last_tail;
+
struct napi_struct napi_tx;
dma_addr_t rx_ring_dma;
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index ea231b1c5..ea2306ef7 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -2002,6 +2002,59 @@ static int macb_tx_poll(struct napi_struct *napi, int budget)
return work_done;
}
+#define MACB_TX_STALL_INTERVAL_MS 1000
+
+/*
+ * TX stall watchdog.
+ *
+ * Defence-in-depth against lost TCOMP interrupts. macb already has a
+ * recovery chain (tx_pending -> txubr_pending -> macb_tx_restart())
+ * that fires on TCOMP; if TCOMP itself is lost the TX ring stalls
+ * silently until something else kicks TSTART. This watchdog runs
+ * once per second per queue, snapshots tx_tail, and calls
+ * macb_tx_restart() if the ring is non-empty and tx_tail has not
+ * advanced since the previous tick.
+ *
+ * macb_tx_restart() already checks the hardware's TBQP against the
+ * driver's head index before re-asserting TSTART, so on a healthy
+ * ring this is a no-op at the hardware level. The watchdog only
+ * adds the missing trigger.
+ */
+static void macb_tx_stall_watchdog(struct work_struct *work)
+{
+ struct macb_queue *queue = container_of(to_delayed_work(work),
+ struct macb_queue,
+ tx_stall_watchdog_work);
+ struct macb *bp = queue->bp;
+ unsigned int cur_tail, cur_head;
+ bool stalled = false;
+ unsigned long flags;
+
+ if (!netif_running(bp->dev))
+ return;
+
+ spin_lock_irqsave(&queue->tx_ptr_lock, flags);
+ cur_tail = queue->tx_tail;
+ cur_head = queue->tx_head;
+ if (cur_head != cur_tail &&
+ cur_tail == queue->tx_stall_last_tail)
+ stalled = true;
+ else
+ queue->tx_stall_last_tail = cur_tail;
+ spin_unlock_irqrestore(&queue->tx_ptr_lock, flags);
+
+ if (stalled) {
+ netdev_warn_once(bp->dev,
+ "TX stall detected on queue %u (tail=%u head=%u); re-kicking TSTART\n",
+ (unsigned int)(queue - bp->queues),
+ cur_tail, cur_head);
+ macb_tx_restart(queue);
+ }
+
+ schedule_delayed_work(&queue->tx_stall_watchdog_work,
+ msecs_to_jiffies(MACB_TX_STALL_INTERVAL_MS));
+}
+
static void macb_hresp_error_task(struct work_struct *work)
{
struct macb *bp = from_work(bp, work, hresp_err_bh_work);
@@ -3190,6 +3243,9 @@ static int macb_open(struct net_device *dev)
for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
napi_enable(&queue->napi_rx);
napi_enable(&queue->napi_tx);
+ queue->tx_stall_last_tail = queue->tx_tail;
+ schedule_delayed_work(&queue->tx_stall_watchdog_work,
+ msecs_to_jiffies(MACB_TX_STALL_INTERVAL_MS));
}
macb_init_hw(bp);
@@ -3240,6 +3296,7 @@ static int macb_close(struct net_device *dev)
for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
napi_disable(&queue->napi_rx);
napi_disable(&queue->napi_tx);
+ cancel_delayed_work_sync(&queue->tx_stall_watchdog_work);
netdev_tx_reset_queue(netdev_get_tx_queue(dev, q));
}
@@ -4802,6 +4859,8 @@ static int macb_init_dflt(struct platform_device *pdev)
}
INIT_WORK(&queue->tx_error_task, macb_tx_error_task);
+ INIT_DELAYED_WORK(&queue->tx_stall_watchdog_work,
+ macb_tx_stall_watchdog);
q++;
}
--
2.53.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1
2026-04-24 22:38 [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 Lukasz Raczylo
` (2 preceding siblings ...)
2026-04-24 22:38 ` [RFC PATCH net-next 3/3] net: macb: add TX stall watchdog as defence-in-depth safety net Lukasz Raczylo
@ 2026-04-25 21:48 ` Lukasz Raczylo
3 siblings, 0 replies; 5+ messages in thread
From: Lukasz Raczylo @ 2026-04-25 21:48 UTC (permalink / raw)
To: netdev
Cc: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S . Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, linux-kernel,
linux-arm-kernel
A follow-up runtime data point on this series.
Fleet state at 2026-04-25 21:46 UTC:
* Patched uptime (since staggered rollout 2026-04-24 18:10-19:20 UTC):
- shortest: 26h 26m (last master upgraded)
- longest: 27h 34m (canary)
- cumulative across 24 nodes: ~651 node-hours
* Macb-attributable event counts (out-of-band userspace watchdog;
the [tx-stall] detector watches /sys/class/net/end0/statistics/
tx_packets + qdisc backlog every 1 s and would have fired
ip link down/up if any node's TX path froze):
- RECOVER trigger=tx-stall (actual stalls caught): 0
- partial [tx-stall] markers (transient 1 s freezes): 0
* Separately: 40 RECOVER events with trigger=ping fired in this
window across the fleet, attributable to a brief upstream-network
outage (gateway / switch event); each node simultaneously lost ping
to gateway, VIP, and NAS within seconds of each other, then
recovered. These are unrelated to the macb hang the patch series
targets — distinguishing them from a real TX stall is exactly what
the trigger= tag in the watchdog log is for.
At the pre-patch rate referenced in the cover letter (50 stalls in
95 node-hours observed in our 2026-04-24 14:00-18:10 UTC reference
window, ~0.5 per node-hour), the projected stall count in
651 node-hours is on the order of 342;
observed is 0.
Same observability runs forward; will reply again after a full week
of uptime unless something changes.
--
Lukasz Raczylo
^ permalink raw reply [flat|nested] 5+ messages in thread