* [PATCH net] net: macb: Shuffle the tx ring before enabling tx
@ 2026-03-05 14:32 Kevin Hao
2026-03-07 2:56 ` [net] " Jakub Kicinski
0 siblings, 1 reply; 3+ messages in thread
From: Kevin Hao @ 2026-03-05 14:32 UTC (permalink / raw)
To: netdev
Cc: Kevin Hao, Quanyang Wang, stable, Nicolas Ferre, Claudiu Beznea,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Russell King
Quanyang observed that when using an NFS rootfs on an AMD ZynqMp board,
the rootfs may take an extended time to recover after a suspend.
Upon investigation, it was determined that the issue originates from a
problem in the macb driver.
According to the Zynq UltraScale TRM [1], when transmit is disabled,
the transmit buffer queue pointer resets to point to the address
specified by the transmit buffer queue base address register.
In the current implementation, the code merely resets `queue->tx_head`
and `queue->tx_tail` to '0'. This approach presents several issues:
- Packets already queued in the tx ring are silently lost,
leading to memory leaks since the associated skbs cannot be released.
- Concurrent write access to `queue->tx_head` and `queue->tx_tail` may
occur from `macb_tx_poll()` or `macb_start_xmit()` when these values
are reset to '0'.
- The transmission may become stuck on a packet that has already been sent
out, with its 'TX_USED' bit set, but has not yet been processed. However,
due to the manipulation of 'queue->tx_head' and 'queue->tx_tail',
`macb_tx_poll()` incorrectly assumes there are no packets to handle
because `queue->tx_head == queue->tx_tail`. This issue is only resolved
when a new packet is placed at this position. This is the root cause of
the prolonged recovery time observed for the NFS root filesystem.
To resolve this issue, shuffle the tx ring and tx skb array so that
the first unsent packet is positioned at the start of the tx ring.
Additionally, ensure that updates to `queue->tx_head` and
`queue->tx_tail` are properly protected with the appropriate lock.
[1] https://docs.amd.com/v/u/en-US/ug1085-zynq-ultrascale-trm
Fixes: bf9cf80cab81 ("net: macb: Fix tx/rx malfunction after phy link down and up")
Reported-by: Quanyang Wang <quanyang.wang@windriver.com>
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Cc: stable@vger.kernel.org
---
Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
Cc: Claudiu Beznea <claudiu.beznea@tuxon.dev>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
---
drivers/net/ethernet/cadence/macb_main.c | 89 ++++++++++++++++++++++++++++++--
1 file changed, 86 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index 17f0ad3d7a0924a7dc2fc0a13505aff7d2499ffa..fce144a1830823da9821ad3245784e62fed97e33 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -36,6 +36,7 @@
#include <linux/tcp.h>
#include <linux/types.h>
#include <linux/udp.h>
+#include <linux/gcd.h>
#include <net/pkt_sched.h>
#include "macb.h"
@@ -684,6 +685,88 @@ static void macb_mac_link_down(struct phylink_config *config, unsigned int mode,
netif_tx_stop_all_queues(ndev);
}
+/* Use juggling algorithm to left rotate tx ring and tx skb array */
+static void gem_shuffle_tx_one_ring(struct macb_queue *queue)
+{
+ unsigned int i, head, tail, count, size, cycles, shift, curr, next;
+ struct macb_dma_desc desc, *desc_curr, *desc_next;
+ struct macb_tx_skb tx_skb, *skb_curr, *skb_next;
+ struct macb *bp = queue->bp;
+ unsigned long flags;
+
+ spin_lock_irqsave(&queue->tx_ptr_lock, flags);
+ head = queue->tx_head;
+ tail = queue->tx_tail;
+ size = bp->tx_ring_size;
+ count = CIRC_CNT(head, tail, size);
+
+ if (!(tail % size))
+ goto unlock;
+
+ if (!count) {
+ queue->tx_head = 0;
+ queue->tx_tail = 0;
+ goto unlock;
+ }
+
+ shift = tail % size;
+ cycles = gcd(size, shift);
+
+ for (i = 0; i < cycles; i++) {
+ memcpy(&desc, macb_tx_desc(queue, i), sizeof(struct macb_dma_desc));
+ memcpy(&tx_skb, macb_tx_skb(queue, i), sizeof(struct macb_tx_skb));
+
+ curr = i;
+ next = (curr + shift) % size;
+
+ while (next != i) {
+ desc_curr = macb_tx_desc(queue, curr);
+ desc_next = macb_tx_desc(queue, next);
+
+ memcpy(desc_curr, desc_next, sizeof(struct macb_dma_desc));
+
+ if (next == bp->tx_ring_size - 1)
+ desc_curr->ctrl &= ~MACB_BIT(TX_WRAP);
+ if (curr == bp->tx_ring_size - 1)
+ desc_curr->ctrl |= MACB_BIT(TX_WRAP);
+
+ skb_curr = macb_tx_skb(queue, curr);
+ skb_next = macb_tx_skb(queue, next);
+ memcpy(skb_curr, skb_next, sizeof(struct macb_tx_skb));
+
+ curr = next;
+ next = (curr + shift) % size;
+ }
+
+ desc_curr = macb_tx_desc(queue, curr);
+ memcpy(desc_curr, &desc, sizeof(struct macb_dma_desc));
+ if (i == bp->tx_ring_size - 1)
+ desc_curr->ctrl &= ~MACB_BIT(TX_WRAP);
+ if (curr == bp->tx_ring_size - 1)
+ desc_curr->ctrl |= MACB_BIT(TX_WRAP);
+ memcpy(macb_tx_skb(queue, curr), &tx_skb, sizeof(struct macb_tx_skb));
+ }
+
+ queue->tx_head = count;
+ queue->tx_tail = 0;
+
+ /* Make descriptor updates visible to hardware */
+ wmb();
+
+unlock:
+ spin_unlock_irqrestore(&queue->tx_ptr_lock, flags);
+}
+
+/* Rotate the queue so that the tail is at index 0 */
+static void gem_shuffle_tx_rings(struct macb *bp)
+{
+ struct macb_queue *queue;
+ int q;
+
+ for (q = 0, queue = bp->queues; q < bp->num_queues; q++, queue++)
+ gem_shuffle_tx_one_ring(queue);
+}
+
static void macb_mac_link_up(struct phylink_config *config,
struct phy_device *phy,
unsigned int mode, phy_interface_t interface,
@@ -722,8 +805,6 @@ static void macb_mac_link_up(struct phylink_config *config,
ctrl |= MACB_BIT(PAE);
for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
- queue->tx_head = 0;
- queue->tx_tail = 0;
queue_writel(queue, IER,
bp->rx_intr_mask | MACB_TX_INT_FLAGS | MACB_BIT(HRESP));
}
@@ -737,8 +818,10 @@ static void macb_mac_link_up(struct phylink_config *config,
spin_unlock_irqrestore(&bp->lock, flags);
- if (!(bp->caps & MACB_CAPS_MACB_IS_EMAC))
+ if (!(bp->caps & MACB_CAPS_MACB_IS_EMAC)) {
macb_set_tx_clk(bp, speed);
+ gem_shuffle_tx_rings(bp);
+ }
/* Enable Rx and Tx; Enable PTP unicast */
ctrl = macb_readl(bp, NCR);
---
base-commit: fc7b1a72c6cd5cbbd989c6c32a6486e3e4e3594d
change-id: 20260214-zynqmp-74d35f3862f8
Best regards,
--
Kevin Hao <haokexin@gmail.com>
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [net] net: macb: Shuffle the tx ring before enabling tx
2026-03-05 14:32 [PATCH net] net: macb: Shuffle the tx ring before enabling tx Kevin Hao
@ 2026-03-07 2:56 ` Jakub Kicinski
2026-03-07 6:51 ` Kevin Hao
0 siblings, 1 reply; 3+ messages in thread
From: Jakub Kicinski @ 2026-03-07 2:56 UTC (permalink / raw)
To: haokexin
Cc: Jakub Kicinski, edumazet, claudiu.beznea, davem, stable, netdev,
andrew+netdev, linux, pabeni, nicolas.ferre, quanyang.wang
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
---
net: macb: Shuffle the tx ring before enabling tx
Quanyang observed that when using an NFS rootfs on an AMD ZynqMp board,
the rootfs may take an extended time to recover after a suspend. Upon
investigation, it was determined that the issue originates from a problem
in the macb driver. To resolve this issue, shuffle the tx ring and tx skb
array so that the first unsent packet is positioned at the start of the tx
ring.
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index 5bc35f651ebd..3ceeff22c12f 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
[ ... ]
> @@ -668,6 +669,88 @@ static void macb_mac_link_down(struct phylink_config *config, unsigned int mode,
> netif_tx_stop_all_queues(ndev);
> }
>
> +/* Use juggling algorithm to left rotate tx ring and tx skb array */
> +static void gem_shuffle_tx_one_ring(struct macb_queue *queue)
> +{
> + unsigned int i, head, tail, count, size, cycles, shift, curr, next;
> + struct macb_dma_desc desc, *desc_curr, *desc_next;
> + struct macb_tx_skb tx_skb, *skb_curr, *skb_next;
> + struct macb *bp = queue->bp;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&queue->tx_ptr_lock, flags);
> + head = queue->tx_head;
> + tail = queue->tx_tail;
> + size = bp->tx_ring_size;
> + count = CIRC_CNT(head, tail, size);
> +
> + if (!(tail % size))
> + goto unlock;
> +
> + if (!count) {
> + queue->tx_head = 0;
> + queue->tx_tail = 0;
> + goto unlock;
> + }
> +
> + shift = tail % size;
> + cycles = gcd(size, shift);
> +
> + for (i = 0; i < cycles; i++) {
> + memcpy(&desc, macb_tx_desc(queue, i), sizeof(struct macb_dma_desc));
> + memcpy(&tx_skb, macb_tx_skb(queue, i), sizeof(struct macb_tx_skb));
> +
> + curr = i;
> + next = (curr + shift) % size;
> +
> + while (next != i) {
> + desc_curr = macb_tx_desc(queue, curr);
> + desc_next = macb_tx_desc(queue, next);
> +
> + memcpy(desc_curr, desc_next, sizeof(struct macb_dma_desc));
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Does this memcpy corrupt the descriptor fields on systems with 64-bit DMA?
The macb_tx_desc() macro uses macb_adj_dma_desc_idx() to properly space the
descriptor pointers in the ring according to the descriptor size reported by
macb_dma_desc_get_size(bp), which can be 8 bytes (base), 16 bytes (with
DMA64), or 24 bytes (with DMA64 + PTP). However, all three memcpy calls in
this function use sizeof(struct macb_dma_desc), which is only 8 bytes and
covers only the addr and ctrl fields.
On systems where macb_dma64(bp) == true, such as the AMD ZynqMp platform
this patch targets, each descriptor has an additional macb_dma_desc_64
struct containing the upper 32 bits of the DMA address (addrh field).
When the shuffle copies descriptors using only 8 bytes, it moves the lower
32-bit address (desc->addr) and control word (desc->ctrl) to new positions
but leaves the upper 32-bit address (addrh) unmoved in the old location.
After the shuffle, hardware reads a composite 64-bit DMA address from
mismatched descriptor slots: the old slot's addrh combined with the new
slot's addr.
This would cause DMA to a corrupted physical address, potentially leading
to memory corruption or bus errors.
The driver uses macb_dma_desc_get_size(bp) elsewhere for operations on the
descriptor ring:
drivers/net/ethernet/cadence/macb_main.c:macb_alloc_consistent() {
...
size = bp->tx_ring_size * macb_dma_desc_get_size(bp);
queue->tx_ring = dma_alloc_coherent(&bp->pdev->dev, size,
&queue->tx_ring_dma, GFP_KERNEL);
...
}
and in:
drivers/net/ethernet/cadence/macb_main.c:macb_free_consistent() {
...
size = bp->tx_ring_size * macb_dma_desc_get_size(bp);
dma_free_coherent(&bp->pdev->dev, size, queue->tx_ring,
queue->tx_ring_dma);
...
}
Should the memcpy size here use macb_dma_desc_get_size(bp) instead of
sizeof(struct macb_dma_desc)?
The same issue appears at two other locations in this function (the initial
memcpy saving desc at the start of the loop, and the final memcpy restoring
desc at the end). Additionally, the stack-allocated "struct macb_dma_desc
desc" temporary buffer may be too small to hold the extended descriptor.
> +
> + if (next == bp->tx_ring_size - 1)
> + desc_curr->ctrl &= ~MACB_BIT(TX_WRAP);
> + if (curr == bp->tx_ring_size - 1)
> + desc_curr->ctrl |= MACB_BIT(TX_WRAP);
> +
> + skb_curr = macb_tx_skb(queue, curr);
> + skb_next = macb_tx_skb(queue, next);
> + memcpy(skb_curr, skb_next, sizeof(struct macb_tx_skb));
> +
> + curr = next;
> + next = (curr + shift) % size;
> + }
> +
> + desc_curr = macb_tx_desc(queue, curr);
> + memcpy(desc_curr, &desc, sizeof(struct macb_dma_desc));
> + if (i == bp->tx_ring_size - 1)
> + desc_curr->ctrl &= ~MACB_BIT(TX_WRAP);
> + if (curr == bp->tx_ring_size - 1)
> + desc_curr->ctrl |= MACB_BIT(TX_WRAP);
> + memcpy(macb_tx_skb(queue, curr), &tx_skb, sizeof(struct macb_tx_skb));
> + }
> +
> + queue->tx_head = count;
> + queue->tx_tail = 0;
> +
> + /* Make descriptor updates visible to hardware */
> + wmb();
> +
> +unlock:
> + spin_unlock_irqrestore(&queue->tx_ptr_lock, flags);
> +}
[ ... ]
--
pw-bot: cr
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [net] net: macb: Shuffle the tx ring before enabling tx
2026-03-07 2:56 ` [net] " Jakub Kicinski
@ 2026-03-07 6:51 ` Kevin Hao
0 siblings, 0 replies; 3+ messages in thread
From: Kevin Hao @ 2026-03-07 6:51 UTC (permalink / raw)
To: Jakub Kicinski
Cc: edumazet, claudiu.beznea, davem, stable, netdev, andrew+netdev,
linux, pabeni, nicolas.ferre, quanyang.wang
[-- Attachment #1: Type: text/plain, Size: 1444 bytes --]
On Fri, Mar 06, 2026 at 06:56:38PM -0800, Jakub Kicinski wrote:
> Does this memcpy corrupt the descriptor fields on systems with 64-bit DMA?
> The macb_tx_desc() macro uses macb_adj_dma_desc_idx() to properly space the
> descriptor pointers in the ring according to the descriptor size reported by
> macb_dma_desc_get_size(bp), which can be 8 bytes (base), 16 bytes (with
> DMA64), or 24 bytes (with DMA64 + PTP). However, all three memcpy calls in
> this function use sizeof(struct macb_dma_desc), which is only 8 bytes and
> covers only the addr and ctrl fields.
>
> On systems where macb_dma64(bp) == true, such as the AMD ZynqMp platform
> this patch targets, each descriptor has an additional macb_dma_desc_64
> struct containing the upper 32 bits of the DMA address (addrh field).
>
> When the shuffle copies descriptors using only 8 bytes, it moves the lower
> 32-bit address (desc->addr) and control word (desc->ctrl) to new positions
> but leaves the upper 32-bit address (addrh) unmoved in the old location.
> After the shuffle, hardware reads a composite 64-bit DMA address from
> mismatched descriptor slots: the old slot's addrh combined with the new
> slot's addr.
This is indeed an issue. Coincidentally, all the upper 32-bit addresses in the
tx ring were identical, which is why it went undetected during testing.
My apologies for this oversight; I will address it in the v2 version.
Thanks,
Kevin
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-03-07 6:52 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-05 14:32 [PATCH net] net: macb: Shuffle the tx ring before enabling tx Kevin Hao
2026-03-07 2:56 ` [net] " Jakub Kicinski
2026-03-07 6:51 ` Kevin Hao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox