netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v3 0/4] net: stmmac: RX performance improvement
@ 2025-01-15  3:27 Furong Xu
  2025-01-15  3:27 ` [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path Furong Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 35+ messages in thread
From: Furong Xu @ 2025-01-15  3:27 UTC (permalink / raw)
  To: netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Alexander Lobakin, Joe Damato, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr,
	Furong Xu

This series improves RX performance a lot, ~40% TCP RX throughput boost
has been observed with DWXGMAC CORE 3.20a running on Cortex-A65 CPUs:
from 2.18 Gbits/sec increased to 3.06 Gbits/sec.

---
Changes in v3:
  1. Convert prefetch() to net_prefetch() to get better performance (Joe Damato)

  v2: https://patchwork.kernel.org/project/netdevbpf/list/?series=924912&state=%2A&archive=both

Changes in v2:
  1. No cache prefetch for frags (Alexander Lobakin)
  2. Fix code style warning reported by netdev CI on Patchwork

  v1: https://patchwork.kernel.org/project/netdevbpf/list/?series=924103&state=%2A&archive=both
---

Furong Xu (4):
  net: stmmac: Switch to zero-copy in non-XDP RX path
  net: stmmac: Set page_pool_params.max_len to a precise size
  net: stmmac: Optimize cache prefetch in RX path
  net: stmmac: Convert prefetch() to net_prefetch() for received frames

 drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  1 +
 .../net/ethernet/stmicro/stmmac/stmmac_main.c | 34 +++++++++++--------
 .../net/ethernet/stmicro/stmmac/stmmac_xdp.h  |  1 -
 3 files changed, 21 insertions(+), 15 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-15  3:27 [PATCH net-next v3 0/4] net: stmmac: RX performance improvement Furong Xu
@ 2025-01-15  3:27 ` Furong Xu
  2025-01-15 16:58   ` Larysa Zaremba
                     ` (2 more replies)
  2025-01-15  3:27 ` [PATCH net-next v3 2/4] net: stmmac: Set page_pool_params.max_len to a precise size Furong Xu
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 35+ messages in thread
From: Furong Xu @ 2025-01-15  3:27 UTC (permalink / raw)
  To: netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Alexander Lobakin, Joe Damato, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr,
	Furong Xu

Avoid memcpy in non-XDP RX path by marking all allocated SKBs to
be recycled in the upper network stack.

This patch brings ~11.5% driver performance improvement in a TCP RX
throughput test with iPerf tool on a single isolated Cortex-A65 CPU
core, from 2.18 Gbits/sec increased to 2.43 Gbits/sec.

Signed-off-by: Furong Xu <0x1207@gmail.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  1 +
 .../net/ethernet/stmicro/stmmac/stmmac_main.c | 26 ++++++++++++-------
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index e8dbce20129c..f05cae103d83 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -126,6 +126,7 @@ struct stmmac_rx_queue {
 	unsigned int cur_rx;
 	unsigned int dirty_rx;
 	unsigned int buf_alloc_num;
+	unsigned int napi_skb_frag_size;
 	dma_addr_t dma_rx_phy;
 	u32 rx_tail_addr;
 	unsigned int state_saved;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index acd6994c1764..1d98a5e8c98c 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1341,7 +1341,7 @@ static unsigned int stmmac_rx_offset(struct stmmac_priv *priv)
 	if (stmmac_xdp_is_enabled(priv))
 		return XDP_PACKET_HEADROOM;
 
-	return 0;
+	return NET_SKB_PAD;
 }
 
 static int stmmac_set_bfsize(int mtu, int bufsize)
@@ -2040,17 +2040,21 @@ static int __alloc_dma_rx_desc_resources(struct stmmac_priv *priv,
 	struct stmmac_channel *ch = &priv->channel[queue];
 	bool xdp_prog = stmmac_xdp_is_enabled(priv);
 	struct page_pool_params pp_params = { 0 };
-	unsigned int num_pages;
+	unsigned int dma_buf_sz_pad, num_pages;
 	unsigned int napi_id;
 	int ret;
 
+	dma_buf_sz_pad = stmmac_rx_offset(priv) + dma_conf->dma_buf_sz +
+			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	num_pages = DIV_ROUND_UP(dma_buf_sz_pad, PAGE_SIZE);
+
 	rx_q->queue_index = queue;
 	rx_q->priv_data = priv;
+	rx_q->napi_skb_frag_size = num_pages * PAGE_SIZE;
 
 	pp_params.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
 	pp_params.pool_size = dma_conf->dma_rx_size;
-	num_pages = DIV_ROUND_UP(dma_conf->dma_buf_sz, PAGE_SIZE);
-	pp_params.order = ilog2(num_pages);
+	pp_params.order = order_base_2(num_pages);
 	pp_params.nid = dev_to_node(priv->device);
 	pp_params.dev = priv->device;
 	pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
@@ -5582,22 +5586,26 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 		}
 
 		if (!skb) {
+			unsigned int head_pad_len;
+
 			/* XDP program may expand or reduce tail */
 			buf1_len = ctx.xdp.data_end - ctx.xdp.data;
 
-			skb = napi_alloc_skb(&ch->rx_napi, buf1_len);
+			skb = napi_build_skb(page_address(buf->page),
+					     rx_q->napi_skb_frag_size);
 			if (!skb) {
+				page_pool_recycle_direct(rx_q->page_pool,
+							 buf->page);
 				rx_dropped++;
 				count++;
 				goto drain_data;
 			}
 
 			/* XDP program may adjust header */
-			skb_copy_to_linear_data(skb, ctx.xdp.data, buf1_len);
+			head_pad_len = ctx.xdp.data - ctx.xdp.data_hard_start;
+			skb_reserve(skb, head_pad_len);
 			skb_put(skb, buf1_len);
-
-			/* Data payload copied into SKB, page ready for recycle */
-			page_pool_recycle_direct(rx_q->page_pool, buf->page);
+			skb_mark_for_recycle(skb);
 			buf->page = NULL;
 		} else if (buf1_len) {
 			dma_sync_single_for_cpu(priv->device, buf->addr,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH net-next v3 2/4] net: stmmac: Set page_pool_params.max_len to a precise size
  2025-01-15  3:27 [PATCH net-next v3 0/4] net: stmmac: RX performance improvement Furong Xu
  2025-01-15  3:27 ` [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path Furong Xu
@ 2025-01-15  3:27 ` Furong Xu
  2025-01-15 10:07   ` Yanteng Si
  2025-01-15  3:27 ` [PATCH net-next v3 3/4] net: stmmac: Optimize cache prefetch in RX path Furong Xu
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 35+ messages in thread
From: Furong Xu @ 2025-01-15  3:27 UTC (permalink / raw)
  To: netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Alexander Lobakin, Joe Damato, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr,
	Furong Xu

DMA engine will always write no more than dma_buf_sz bytes of a received
frame into a page buffer, the remaining spaces are unused or used by CPU
exclusively.
Setting page_pool_params.max_len to almost the full size of page(s) helps
nothing more, but wastes more CPU cycles on cache maintenance.

For a standard MTU of 1500, then dma_buf_sz is assigned to 1536, and this
patch brings ~16.9% driver performance improvement in a TCP RX
throughput test with iPerf tool on a single isolated Cortex-A65 CPU
core, from 2.43 Gbits/sec increased to 2.84 Gbits/sec.

Signed-off-by: Furong Xu <0x1207@gmail.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 2 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_xdp.h  | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 1d98a5e8c98c..811e2d372abf 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2059,7 +2059,7 @@ static int __alloc_dma_rx_desc_resources(struct stmmac_priv *priv,
 	pp_params.dev = priv->device;
 	pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
 	pp_params.offset = stmmac_rx_offset(priv);
-	pp_params.max_len = STMMAC_MAX_RX_BUF_SIZE(num_pages);
+	pp_params.max_len = dma_conf->dma_buf_sz;
 
 	rx_q->page_pool = page_pool_create(&pp_params);
 	if (IS_ERR(rx_q->page_pool)) {
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_xdp.h b/drivers/net/ethernet/stmicro/stmmac/stmmac_xdp.h
index 896dc987d4ef..77ce8cfbe976 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_xdp.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_xdp.h
@@ -4,7 +4,6 @@
 #ifndef _STMMAC_XDP_H_
 #define _STMMAC_XDP_H_
 
-#define STMMAC_MAX_RX_BUF_SIZE(num)	(((num) * PAGE_SIZE) - XDP_PACKET_HEADROOM)
 #define STMMAC_RX_DMA_ATTR	(DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING)
 
 int stmmac_xdp_setup_pool(struct stmmac_priv *priv, struct xsk_buff_pool *pool,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH net-next v3 3/4] net: stmmac: Optimize cache prefetch in RX path
  2025-01-15  3:27 [PATCH net-next v3 0/4] net: stmmac: RX performance improvement Furong Xu
  2025-01-15  3:27 ` [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path Furong Xu
  2025-01-15  3:27 ` [PATCH net-next v3 2/4] net: stmmac: Set page_pool_params.max_len to a precise size Furong Xu
@ 2025-01-15  3:27 ` Furong Xu
  2025-01-15 16:24   ` Yanteng Si
  2025-01-15  3:27 ` [PATCH net-next v3 4/4] net: stmmac: Convert prefetch() to net_prefetch() for received frames Furong Xu
  2025-01-16 11:40 ` [PATCH net-next v3 0/4] net: stmmac: RX performance improvement patchwork-bot+netdevbpf
  4 siblings, 1 reply; 35+ messages in thread
From: Furong Xu @ 2025-01-15  3:27 UTC (permalink / raw)
  To: netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Alexander Lobakin, Joe Damato, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr,
	Furong Xu

Current code prefetches cache lines for the received frame first, and
then dma_sync_single_for_cpu() against this frame, this is wrong.
Cache prefetch should be triggered after dma_sync_single_for_cpu().

This patch brings ~2.8% driver performance improvement in a TCP RX
throughput test with iPerf tool on a single isolated Cortex-A65 CPU
core, 2.84 Gbits/sec increased to 2.92 Gbits/sec.

Signed-off-by: Furong Xu <0x1207@gmail.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 811e2d372abf..ad928e8e21a9 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -5508,10 +5508,6 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 
 		/* Buffer is good. Go on. */
 
-		prefetch(page_address(buf->page) + buf->page_offset);
-		if (buf->sec_page)
-			prefetch(page_address(buf->sec_page));
-
 		buf1_len = stmmac_rx_buf1_len(priv, p, status, len);
 		len += buf1_len;
 		buf2_len = stmmac_rx_buf2_len(priv, p, status, len);
@@ -5533,6 +5529,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 
 			dma_sync_single_for_cpu(priv->device, buf->addr,
 						buf1_len, dma_dir);
+			prefetch(page_address(buf->page) + buf->page_offset);
 
 			xdp_init_buff(&ctx.xdp, buf_sz, &rx_q->xdp_rxq);
 			xdp_prepare_buff(&ctx.xdp, page_address(buf->page),
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH net-next v3 4/4] net: stmmac: Convert prefetch() to net_prefetch() for received frames
  2025-01-15  3:27 [PATCH net-next v3 0/4] net: stmmac: RX performance improvement Furong Xu
                   ` (2 preceding siblings ...)
  2025-01-15  3:27 ` [PATCH net-next v3 3/4] net: stmmac: Optimize cache prefetch in RX path Furong Xu
@ 2025-01-15  3:27 ` Furong Xu
  2025-01-15 16:33   ` Yanteng Si
                     ` (2 more replies)
  2025-01-16 11:40 ` [PATCH net-next v3 0/4] net: stmmac: RX performance improvement patchwork-bot+netdevbpf
  4 siblings, 3 replies; 35+ messages in thread
From: Furong Xu @ 2025-01-15  3:27 UTC (permalink / raw)
  To: netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Alexander Lobakin, Joe Damato, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr,
	Furong Xu

The size of DMA descriptors is 32 bytes at most.
net_prefetch() for received frames, and keep prefetch() for descriptors.

This patch brings ~4.8% driver performance improvement in a TCP RX
throughput test with iPerf tool on a single isolated Cortex-A65 CPU
core, 2.92 Gbits/sec increased to 3.06 Gbits/sec.

Suggested-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Furong Xu <0x1207@gmail.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index ad928e8e21a9..49b41148d594 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -5529,7 +5529,8 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 
 			dma_sync_single_for_cpu(priv->device, buf->addr,
 						buf1_len, dma_dir);
-			prefetch(page_address(buf->page) + buf->page_offset);
+			net_prefetch(page_address(buf->page) +
+				     buf->page_offset);
 
 			xdp_init_buff(&ctx.xdp, buf_sz, &rx_q->xdp_rxq);
 			xdp_prepare_buff(&ctx.xdp, page_address(buf->page),
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 2/4] net: stmmac: Set page_pool_params.max_len to a precise size
  2025-01-15  3:27 ` [PATCH net-next v3 2/4] net: stmmac: Set page_pool_params.max_len to a precise size Furong Xu
@ 2025-01-15 10:07   ` Yanteng Si
  0 siblings, 0 replies; 35+ messages in thread
From: Yanteng Si @ 2025-01-15 10:07 UTC (permalink / raw)
  To: Furong Xu, netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Alexander Lobakin, Joe Damato, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr

在 2025/1/15 11:27, Furong Xu 写道:
> DMA engine will always write no more than dma_buf_sz bytes of a received
> frame into a page buffer, the remaining spaces are unused or used by CPU
> exclusively.
> Setting page_pool_params.max_len to almost the full size of page(s) helps
> nothing more, but wastes more CPU cycles on cache maintenance.
> 
> For a standard MTU of 1500, then dma_buf_sz is assigned to 1536, and this
> patch brings ~16.9% driver performance improvement in a TCP RX
> throughput test with iPerf tool on a single isolated Cortex-A65 CPU
> core, from 2.43 Gbits/sec increased to 2.84 Gbits/sec.
> 
> Signed-off-by: Furong Xu <0x1207@gmail.com>
> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Yanteng Si <si.yanteng@linux.dev>

Thanks,
Yanteng

> ---
>   drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 2 +-
>   drivers/net/ethernet/stmicro/stmmac/stmmac_xdp.h  | 1 -
>   2 files changed, 1 insertion(+), 2 deletions(-)



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 3/4] net: stmmac: Optimize cache prefetch in RX path
  2025-01-15  3:27 ` [PATCH net-next v3 3/4] net: stmmac: Optimize cache prefetch in RX path Furong Xu
@ 2025-01-15 16:24   ` Yanteng Si
  0 siblings, 0 replies; 35+ messages in thread
From: Yanteng Si @ 2025-01-15 16:24 UTC (permalink / raw)
  To: Furong Xu, netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Alexander Lobakin, Joe Damato, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr

在 1/15/25 11:27, Furong Xu 写道:
> Current code prefetches cache lines for the received frame first, and
> then dma_sync_single_for_cpu() against this frame, this is wrong.
> Cache prefetch should be triggered after dma_sync_single_for_cpu().
> 
> This patch brings ~2.8% driver performance improvement in a TCP RX
> throughput test with iPerf tool on a single isolated Cortex-A65 CPU
> core, 2.84 Gbits/sec increased to 2.92 Gbits/sec.
> 
> Signed-off-by: Furong Xu <0x1207@gmail.com>
> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Yanteng Si <si.yanteng@linux.dev>

Thanks,
Yanteng
> ---
>   drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 5 +----
>   1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index 811e2d372abf..ad928e8e21a9 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -5508,10 +5508,6 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
>   
>   		/* Buffer is good. Go on. */
>   
> -		prefetch(page_address(buf->page) + buf->page_offset);
> -		if (buf->sec_page)
> -			prefetch(page_address(buf->sec_page));
> -
>   		buf1_len = stmmac_rx_buf1_len(priv, p, status, len);
>   		len += buf1_len;
>   		buf2_len = stmmac_rx_buf2_len(priv, p, status, len);
> @@ -5533,6 +5529,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
>   
>   			dma_sync_single_for_cpu(priv->device, buf->addr,
>   						buf1_len, dma_dir);
> +			prefetch(page_address(buf->page) + buf->page_offset);
>   
>   			xdp_init_buff(&ctx.xdp, buf_sz, &rx_q->xdp_rxq);
>   			xdp_prepare_buff(&ctx.xdp, page_address(buf->page),


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 4/4] net: stmmac: Convert prefetch() to net_prefetch() for received frames
  2025-01-15  3:27 ` [PATCH net-next v3 4/4] net: stmmac: Convert prefetch() to net_prefetch() for received frames Furong Xu
@ 2025-01-15 16:33   ` Yanteng Si
  2025-01-15 16:35   ` Larysa Zaremba
  2025-01-15 17:35   ` Joe Damato
  2 siblings, 0 replies; 35+ messages in thread
From: Yanteng Si @ 2025-01-15 16:33 UTC (permalink / raw)
  To: Furong Xu, netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Alexander Lobakin, Joe Damato, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr

在 1/15/25 11:27, Furong Xu 写道:
> The size of DMA descriptors is 32 bytes at most.
> net_prefetch() for received frames, and keep prefetch() for descriptors.
> 
> This patch brings ~4.8% driver performance improvement in a TCP RX
> throughput test with iPerf tool on a single isolated Cortex-A65 CPU
> core, 2.92 Gbits/sec increased to 3.06 Gbits/sec.
> 
> Suggested-by: Joe Damato <jdamato@fastly.com>
> Signed-off-by: Furong Xu <0x1207@gmail.com>
Reviewed-by: Yanteng Si <si.yanteng@linux.dev>

Thanks,
Yanteng
> ---
>   drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index ad928e8e21a9..49b41148d594 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -5529,7 +5529,8 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
>   
>   			dma_sync_single_for_cpu(priv->device, buf->addr,
>   						buf1_len, dma_dir);
> -			prefetch(page_address(buf->page) + buf->page_offset);
> +			net_prefetch(page_address(buf->page) +
> +				     buf->page_offset);
>   
>   			xdp_init_buff(&ctx.xdp, buf_sz, &rx_q->xdp_rxq);
>   			xdp_prepare_buff(&ctx.xdp, page_address(buf->page),


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 4/4] net: stmmac: Convert prefetch() to net_prefetch() for received frames
  2025-01-15  3:27 ` [PATCH net-next v3 4/4] net: stmmac: Convert prefetch() to net_prefetch() for received frames Furong Xu
  2025-01-15 16:33   ` Yanteng Si
@ 2025-01-15 16:35   ` Larysa Zaremba
  2025-01-15 17:35   ` Joe Damato
  2 siblings, 0 replies; 35+ messages in thread
From: Larysa Zaremba @ 2025-01-15 16:35 UTC (permalink / raw)
  To: Furong Xu
  Cc: netdev, linux-stm32, linux-arm-kernel, linux-kernel,
	Alexander Lobakin, Joe Damato, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr

On Wed, Jan 15, 2025 at 11:27:05AM +0800, Furong Xu wrote:
> The size of DMA descriptors is 32 bytes at most.
> net_prefetch() for received frames, and keep prefetch() for descriptors.
> 
> This patch brings ~4.8% driver performance improvement in a TCP RX
> throughput test with iPerf tool on a single isolated Cortex-A65 CPU
> core, 2.92 Gbits/sec increased to 3.06 Gbits/sec.
> 
> Suggested-by: Joe Damato <jdamato@fastly.com>
> Signed-off-by: Furong Xu <0x1207@gmail.com>

Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>

> ---
>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index ad928e8e21a9..49b41148d594 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -5529,7 +5529,8 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
>  
>  			dma_sync_single_for_cpu(priv->device, buf->addr,
>  						buf1_len, dma_dir);
> -			prefetch(page_address(buf->page) + buf->page_offset);
> +			net_prefetch(page_address(buf->page) +
> +				     buf->page_offset);
>  
>  			xdp_init_buff(&ctx.xdp, buf_sz, &rx_q->xdp_rxq);
>  			xdp_prepare_buff(&ctx.xdp, page_address(buf->page),
> -- 
> 2.34.1
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-15  3:27 ` [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path Furong Xu
@ 2025-01-15 16:58   ` Larysa Zaremba
  2025-01-16  2:05   ` Yanteng Si
  2025-01-23 14:06   ` Jon Hunter
  2 siblings, 0 replies; 35+ messages in thread
From: Larysa Zaremba @ 2025-01-15 16:58 UTC (permalink / raw)
  To: Furong Xu
  Cc: netdev, linux-stm32, linux-arm-kernel, linux-kernel,
	Alexander Lobakin, Joe Damato, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr

On Wed, Jan 15, 2025 at 11:27:02AM +0800, Furong Xu wrote:
> Avoid memcpy in non-XDP RX path by marking all allocated SKBs to
> be recycled in the upper network stack.
> 
> This patch brings ~11.5% driver performance improvement in a TCP RX
> throughput test with iPerf tool on a single isolated Cortex-A65 CPU
> core, from 2.18 Gbits/sec increased to 2.43 Gbits/sec.
> 
> Signed-off-by: Furong Xu <0x1207@gmail.com>
> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>

> ---
>  drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  1 +
>  .../net/ethernet/stmicro/stmmac/stmmac_main.c | 26 ++++++++++++-------
>  2 files changed, 18 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> index e8dbce20129c..f05cae103d83 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> @@ -126,6 +126,7 @@ struct stmmac_rx_queue {
>  	unsigned int cur_rx;
>  	unsigned int dirty_rx;
>  	unsigned int buf_alloc_num;
> +	unsigned int napi_skb_frag_size;
>  	dma_addr_t dma_rx_phy;
>  	u32 rx_tail_addr;
>  	unsigned int state_saved;
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index acd6994c1764..1d98a5e8c98c 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -1341,7 +1341,7 @@ static unsigned int stmmac_rx_offset(struct stmmac_priv *priv)
>  	if (stmmac_xdp_is_enabled(priv))
>  		return XDP_PACKET_HEADROOM;
>  
> -	return 0;
> +	return NET_SKB_PAD;
>  }
>  
>  static int stmmac_set_bfsize(int mtu, int bufsize)
> @@ -2040,17 +2040,21 @@ static int __alloc_dma_rx_desc_resources(struct stmmac_priv *priv,
>  	struct stmmac_channel *ch = &priv->channel[queue];
>  	bool xdp_prog = stmmac_xdp_is_enabled(priv);
>  	struct page_pool_params pp_params = { 0 };
> -	unsigned int num_pages;
> +	unsigned int dma_buf_sz_pad, num_pages;
>  	unsigned int napi_id;
>  	int ret;
>  
> +	dma_buf_sz_pad = stmmac_rx_offset(priv) + dma_conf->dma_buf_sz +
> +			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	num_pages = DIV_ROUND_UP(dma_buf_sz_pad, PAGE_SIZE);
> +
>  	rx_q->queue_index = queue;
>  	rx_q->priv_data = priv;
> +	rx_q->napi_skb_frag_size = num_pages * PAGE_SIZE;
>  
>  	pp_params.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
>  	pp_params.pool_size = dma_conf->dma_rx_size;
> -	num_pages = DIV_ROUND_UP(dma_conf->dma_buf_sz, PAGE_SIZE);
> -	pp_params.order = ilog2(num_pages);
> +	pp_params.order = order_base_2(num_pages);
>  	pp_params.nid = dev_to_node(priv->device);
>  	pp_params.dev = priv->device;
>  	pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
> @@ -5582,22 +5586,26 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
>  		}
>  
>  		if (!skb) {
> +			unsigned int head_pad_len;
> +
>  			/* XDP program may expand or reduce tail */
>  			buf1_len = ctx.xdp.data_end - ctx.xdp.data;
>  
> -			skb = napi_alloc_skb(&ch->rx_napi, buf1_len);
> +			skb = napi_build_skb(page_address(buf->page),
> +					     rx_q->napi_skb_frag_size);
>  			if (!skb) {
> +				page_pool_recycle_direct(rx_q->page_pool,
> +							 buf->page);
>  				rx_dropped++;
>  				count++;
>  				goto drain_data;
>  			}
>  
>  			/* XDP program may adjust header */
> -			skb_copy_to_linear_data(skb, ctx.xdp.data, buf1_len);
> +			head_pad_len = ctx.xdp.data - ctx.xdp.data_hard_start;
> +			skb_reserve(skb, head_pad_len);
>  			skb_put(skb, buf1_len);
> -
> -			/* Data payload copied into SKB, page ready for recycle */
> -			page_pool_recycle_direct(rx_q->page_pool, buf->page);
> +			skb_mark_for_recycle(skb);
>  			buf->page = NULL;
>  		} else if (buf1_len) {
>  			dma_sync_single_for_cpu(priv->device, buf->addr,
> -- 
> 2.34.1
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 4/4] net: stmmac: Convert prefetch() to net_prefetch() for received frames
  2025-01-15  3:27 ` [PATCH net-next v3 4/4] net: stmmac: Convert prefetch() to net_prefetch() for received frames Furong Xu
  2025-01-15 16:33   ` Yanteng Si
  2025-01-15 16:35   ` Larysa Zaremba
@ 2025-01-15 17:35   ` Joe Damato
  2 siblings, 0 replies; 35+ messages in thread
From: Joe Damato @ 2025-01-15 17:35 UTC (permalink / raw)
  To: Furong Xu
  Cc: netdev, linux-stm32, linux-arm-kernel, linux-kernel,
	Alexander Lobakin, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr

On Wed, Jan 15, 2025 at 11:27:05AM +0800, Furong Xu wrote:
> The size of DMA descriptors is 32 bytes at most.
> net_prefetch() for received frames, and keep prefetch() for descriptors.
> 
> This patch brings ~4.8% driver performance improvement in a TCP RX
> throughput test with iPerf tool on a single isolated Cortex-A65 CPU
> core, 2.92 Gbits/sec increased to 3.06 Gbits/sec.
> 
> Suggested-by: Joe Damato <jdamato@fastly.com>
> Signed-off-by: Furong Xu <0x1207@gmail.com>
> ---
>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Reviewed-by: Joe Damato <jdamato@fastly.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-15  3:27 ` [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path Furong Xu
  2025-01-15 16:58   ` Larysa Zaremba
@ 2025-01-16  2:05   ` Yanteng Si
  2025-01-23 14:06   ` Jon Hunter
  2 siblings, 0 replies; 35+ messages in thread
From: Yanteng Si @ 2025-01-16  2:05 UTC (permalink / raw)
  To: Furong Xu, netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Alexander Lobakin, Joe Damato, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr

在 2025/1/15 11:27, Furong Xu 写道:
> Avoid memcpy in non-XDP RX path by marking all allocated SKBs to
> be recycled in the upper network stack.
> 
> This patch brings ~11.5% driver performance improvement in a TCP RX
> throughput test with iPerf tool on a single isolated Cortex-A65 CPU
> core, from 2.18 Gbits/sec increased to 2.43 Gbits/sec.
> 
> Signed-off-by: Furong Xu <0x1207@gmail.com>
> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Yanteng Si <si.yanteng@linux.dev>

Thanks,
Yanteng
> ---
>   drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  1 +
>   .../net/ethernet/stmicro/stmmac/stmmac_main.c | 26 ++++++++++++-------
>   2 files changed, 18 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> index e8dbce20129c..f05cae103d83 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> @@ -126,6 +126,7 @@ struct stmmac_rx_queue {
>   	unsigned int cur_rx;
>   	unsigned int dirty_rx;
>   	unsigned int buf_alloc_num;
> +	unsigned int napi_skb_frag_size;
>   	dma_addr_t dma_rx_phy;
>   	u32 rx_tail_addr;
>   	unsigned int state_saved;
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index acd6994c1764..1d98a5e8c98c 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -1341,7 +1341,7 @@ static unsigned int stmmac_rx_offset(struct stmmac_priv *priv)
>   	if (stmmac_xdp_is_enabled(priv))
>   		return XDP_PACKET_HEADROOM;
>   
> -	return 0;
> +	return NET_SKB_PAD;
>   }
>   
>   static int stmmac_set_bfsize(int mtu, int bufsize)
> @@ -2040,17 +2040,21 @@ static int __alloc_dma_rx_desc_resources(struct stmmac_priv *priv,
>   	struct stmmac_channel *ch = &priv->channel[queue];
>   	bool xdp_prog = stmmac_xdp_is_enabled(priv);
>   	struct page_pool_params pp_params = { 0 };
> -	unsigned int num_pages;
> +	unsigned int dma_buf_sz_pad, num_pages;
>   	unsigned int napi_id;
>   	int ret;
>   
> +	dma_buf_sz_pad = stmmac_rx_offset(priv) + dma_conf->dma_buf_sz +
> +			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	num_pages = DIV_ROUND_UP(dma_buf_sz_pad, PAGE_SIZE);
> +
>   	rx_q->queue_index = queue;
>   	rx_q->priv_data = priv;
> +	rx_q->napi_skb_frag_size = num_pages * PAGE_SIZE;
>   
>   	pp_params.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
>   	pp_params.pool_size = dma_conf->dma_rx_size;
> -	num_pages = DIV_ROUND_UP(dma_conf->dma_buf_sz, PAGE_SIZE);
> -	pp_params.order = ilog2(num_pages);
> +	pp_params.order = order_base_2(num_pages);
>   	pp_params.nid = dev_to_node(priv->device);
>   	pp_params.dev = priv->device;
>   	pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
> @@ -5582,22 +5586,26 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
>   		}
>   
>   		if (!skb) {
> +			unsigned int head_pad_len;
> +
>   			/* XDP program may expand or reduce tail */
>   			buf1_len = ctx.xdp.data_end - ctx.xdp.data;
>   
> -			skb = napi_alloc_skb(&ch->rx_napi, buf1_len);
> +			skb = napi_build_skb(page_address(buf->page),
> +					     rx_q->napi_skb_frag_size);
>   			if (!skb) {
> +				page_pool_recycle_direct(rx_q->page_pool,
> +							 buf->page);
>   				rx_dropped++;
>   				count++;
>   				goto drain_data;
>   			}
>   
>   			/* XDP program may adjust header */
> -			skb_copy_to_linear_data(skb, ctx.xdp.data, buf1_len);
> +			head_pad_len = ctx.xdp.data - ctx.xdp.data_hard_start;
> +			skb_reserve(skb, head_pad_len);
>   			skb_put(skb, buf1_len);
> -
> -			/* Data payload copied into SKB, page ready for recycle */
> -			page_pool_recycle_direct(rx_q->page_pool, buf->page);
> +			skb_mark_for_recycle(skb);
>   			buf->page = NULL;
>   		} else if (buf1_len) {
>   			dma_sync_single_for_cpu(priv->device, buf->addr,


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 0/4] net: stmmac: RX performance improvement
  2025-01-15  3:27 [PATCH net-next v3 0/4] net: stmmac: RX performance improvement Furong Xu
                   ` (3 preceding siblings ...)
  2025-01-15  3:27 ` [PATCH net-next v3 4/4] net: stmmac: Convert prefetch() to net_prefetch() for received frames Furong Xu
@ 2025-01-16 11:40 ` patchwork-bot+netdevbpf
  4 siblings, 0 replies; 35+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-01-16 11:40 UTC (permalink / raw)
  To: Furong Xu
  Cc: netdev, linux-stm32, linux-arm-kernel, linux-kernel,
	aleksander.lobakin, jdamato, andrew+netdev, davem, edumazet, kuba,
	pabeni, mcoquelin.stm32, xfr

Hello:

This series was applied to netdev/net-next.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Wed, 15 Jan 2025 11:27:01 +0800 you wrote:
> This series improves RX performance a lot, ~40% TCP RX throughput boost
> has been observed with DWXGMAC CORE 3.20a running on Cortex-A65 CPUs:
> from 2.18 Gbits/sec increased to 3.06 Gbits/sec.
> 
> ---
> Changes in v3:
>   1. Convert prefetch() to net_prefetch() to get better performance (Joe Damato)
> 
> [...]

Here is the summary with links:
  - [net-next,v3,1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
    https://git.kernel.org/netdev/net-next/c/df542f669307
  - [net-next,v3,2/4] net: stmmac: Set page_pool_params.max_len to a precise size
    https://git.kernel.org/netdev/net-next/c/2324c78a75c5
  - [net-next,v3,3/4] net: stmmac: Optimize cache prefetch in RX path
    https://git.kernel.org/netdev/net-next/c/2a2931517c9a
  - [net-next,v3,4/4] net: stmmac: Convert prefetch() to net_prefetch() for received frames
    https://git.kernel.org/netdev/net-next/c/204182edb310

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-15  3:27 ` [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path Furong Xu
  2025-01-15 16:58   ` Larysa Zaremba
  2025-01-16  2:05   ` Yanteng Si
@ 2025-01-23 14:06   ` Jon Hunter
  2025-01-23 16:35     ` Furong Xu
  2 siblings, 1 reply; 35+ messages in thread
From: Jon Hunter @ 2025-01-23 14:06 UTC (permalink / raw)
  To: Furong Xu, netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Alexander Lobakin, Joe Damato, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr,
	Brad Griffis, linux-tegra@vger.kernel.org

Hi Furong,

On 15/01/2025 03:27, Furong Xu wrote:
> Avoid memcpy in non-XDP RX path by marking all allocated SKBs to
> be recycled in the upper network stack.
> 
> This patch brings ~11.5% driver performance improvement in a TCP RX
> throughput test with iPerf tool on a single isolated Cortex-A65 CPU
> core, from 2.18 Gbits/sec increased to 2.43 Gbits/sec.
> 
> Signed-off-by: Furong Xu <0x1207@gmail.com>
> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>   drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  1 +
>   .../net/ethernet/stmicro/stmmac/stmmac_main.c | 26 ++++++++++++-------
>   2 files changed, 18 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> index e8dbce20129c..f05cae103d83 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> @@ -126,6 +126,7 @@ struct stmmac_rx_queue {
>   	unsigned int cur_rx;
>   	unsigned int dirty_rx;
>   	unsigned int buf_alloc_num;
> +	unsigned int napi_skb_frag_size;
>   	dma_addr_t dma_rx_phy;
>   	u32 rx_tail_addr;
>   	unsigned int state_saved;
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index acd6994c1764..1d98a5e8c98c 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -1341,7 +1341,7 @@ static unsigned int stmmac_rx_offset(struct stmmac_priv *priv)
>   	if (stmmac_xdp_is_enabled(priv))
>   		return XDP_PACKET_HEADROOM;
>   
> -	return 0;
> +	return NET_SKB_PAD;
>   }
>   
>   static int stmmac_set_bfsize(int mtu, int bufsize)
> @@ -2040,17 +2040,21 @@ static int __alloc_dma_rx_desc_resources(struct stmmac_priv *priv,
>   	struct stmmac_channel *ch = &priv->channel[queue];
>   	bool xdp_prog = stmmac_xdp_is_enabled(priv);
>   	struct page_pool_params pp_params = { 0 };
> -	unsigned int num_pages;
> +	unsigned int dma_buf_sz_pad, num_pages;
>   	unsigned int napi_id;
>   	int ret;
>   
> +	dma_buf_sz_pad = stmmac_rx_offset(priv) + dma_conf->dma_buf_sz +
> +			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	num_pages = DIV_ROUND_UP(dma_buf_sz_pad, PAGE_SIZE);
> +
>   	rx_q->queue_index = queue;
>   	rx_q->priv_data = priv;
> +	rx_q->napi_skb_frag_size = num_pages * PAGE_SIZE;
>   
>   	pp_params.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
>   	pp_params.pool_size = dma_conf->dma_rx_size;
> -	num_pages = DIV_ROUND_UP(dma_conf->dma_buf_sz, PAGE_SIZE);
> -	pp_params.order = ilog2(num_pages);
> +	pp_params.order = order_base_2(num_pages);
>   	pp_params.nid = dev_to_node(priv->device);
>   	pp_params.dev = priv->device;
>   	pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
> @@ -5582,22 +5586,26 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
>   		}
>   
>   		if (!skb) {
> +			unsigned int head_pad_len;
> +
>   			/* XDP program may expand or reduce tail */
>   			buf1_len = ctx.xdp.data_end - ctx.xdp.data;
>   
> -			skb = napi_alloc_skb(&ch->rx_napi, buf1_len);
> +			skb = napi_build_skb(page_address(buf->page),
> +					     rx_q->napi_skb_frag_size);
>   			if (!skb) {
> +				page_pool_recycle_direct(rx_q->page_pool,
> +							 buf->page);
>   				rx_dropped++;
>   				count++;
>   				goto drain_data;
>   			}
>   
>   			/* XDP program may adjust header */
> -			skb_copy_to_linear_data(skb, ctx.xdp.data, buf1_len);
> +			head_pad_len = ctx.xdp.data - ctx.xdp.data_hard_start;
> +			skb_reserve(skb, head_pad_len);
>   			skb_put(skb, buf1_len);
> -
> -			/* Data payload copied into SKB, page ready for recycle */
> -			page_pool_recycle_direct(rx_q->page_pool, buf->page);
> +			skb_mark_for_recycle(skb);
>   			buf->page = NULL;
>   		} else if (buf1_len) {
>   			dma_sync_single_for_cpu(priv->device, buf->addr,


We have noticed a boot regression on -next when booting with NFS. Bisect 
is pointing to this commit and reverting this on top of -next does fix 
the problem.

I only see this on Tegra234 which uses the 
drivers/net/ethernet/stmicro/stmmac/dwmac-tegra.c driver. Tegra194 which 
uses the drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c driver 
appears to be fine.

I tried booting without NFS and the network interface seems to come up 
fine. It only seems to be a problem when booting with NFS. However, I 
guess I have also not stressed the network interface when booting 
without NFS. Have you tried booting with NFS?

Thanks
Jon

-- 
nvpublic


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-23 14:06   ` Jon Hunter
@ 2025-01-23 16:35     ` Furong Xu
  2025-01-23 19:53       ` Brad Griffis
  0 siblings, 1 reply; 35+ messages in thread
From: Furong Xu @ 2025-01-23 16:35 UTC (permalink / raw)
  To: Jon Hunter
  Cc: netdev, linux-stm32, linux-arm-kernel, linux-kernel,
	Alexander Lobakin, Joe Damato, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr,
	Brad Griffis, linux-tegra@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 963 bytes --]

Hi Jon,

On Thu, 23 Jan 2025 14:06:42 +0000, Jon Hunter wrote: 
> We have noticed a boot regression on -next when booting with NFS.
> Bisect is pointing to this commit and reverting this on top of -next
> does fix the problem.
> 
> I only see this on Tegra234 which uses the 
> drivers/net/ethernet/stmicro/stmmac/dwmac-tegra.c driver. Tegra194
> which uses the
> drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c driver
> appears to be fine.

What is the MTU of Tegra234 and NFS server? Are they both 1500?

Could you please try attached patch to confirm if this regression is
fixed?

If the attached patch fixes this regression, and so it seems to be a
cache coherence issue specific to Tegra234, since this patch avoid
memcpy and the page buffers may be modified by upper network stack of
course, then cache lines of page buffers may become dirty. But by
reverting this patch, cache lines of page buffers never become dirty,
this is the core difference.

[-- Attachment #2: force-disable-rx-checksum.diff --]
[-- Type: text/x-patch, Size: 744 bytes --]

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index edbf8994455d..f00bcfc65dd0 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -5442,7 +5442,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
        struct stmmac_rx_queue *rx_q = &priv->dma_conf.rx_queue[queue];
        struct stmmac_channel *ch = &priv->channel[queue];
        unsigned int count = 0, error = 0, len = 0;
-       int status = 0, coe = priv->hw->rx_csum;
+       int status = 0, coe = 0;
        unsigned int next_entry = rx_q->cur_rx;
        enum dma_data_direction dma_dir;
        unsigned int desc_size;

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-23 16:35     ` Furong Xu
@ 2025-01-23 19:53       ` Brad Griffis
  2025-01-23 21:48         ` Andrew Lunn
  2025-01-24  1:53         ` Furong Xu
  0 siblings, 2 replies; 35+ messages in thread
From: Brad Griffis @ 2025-01-23 19:53 UTC (permalink / raw)
  To: Furong Xu, Jon Hunter
  Cc: netdev, linux-stm32, linux-arm-kernel, linux-kernel,
	Alexander Lobakin, Joe Damato, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr,
	linux-tegra@vger.kernel.org

On 1/23/25 08:35, Furong Xu wrote:
> What is the MTU of Tegra234 and NFS server? Are they both 1500?

I see the same issue.  Yes, both are 1500.

> Could you please try attached patch to confirm if this regression is
> fixed?

Patch fixes the issue.

> If the attached patch fixes this regression, and so it seems to be a
> cache coherence issue specific to Tegra234, since this patch avoid
> memcpy and the page buffers may be modified by upper network stack of
> course, then cache lines of page buffers may become dirty. But by
> reverting this patch, cache lines of page buffers never become dirty,
> this is the core difference.

Thanks for these insights. I don't have specific experience in this 
driver, but I see we have dma-coherent turned on for this driver in our 
downstream device tree files (i.e. dtbs that coincide with our 
out-of-tree implementation of this driver).  I went back to the original 
code and verified that the issue was there. I did a new test where I 
added dma-coherent to this ethernet node in the dtb and retested. It worked!

Just to clarify, the patch that you had us try was not intended as an 
actual fix, correct? It was only for diagnostic purposes, i.e. to see if 
there is some kind of cache coherence issue, which seems to be the case? 
  So perhaps the only fix needed is to add dma-coherent to our device tree?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-23 19:53       ` Brad Griffis
@ 2025-01-23 21:48         ` Andrew Lunn
  2025-01-24  2:42           ` Furong Xu
  2025-01-24  1:53         ` Furong Xu
  1 sibling, 1 reply; 35+ messages in thread
From: Andrew Lunn @ 2025-01-23 21:48 UTC (permalink / raw)
  To: Brad Griffis
  Cc: Furong Xu, Jon Hunter, netdev, linux-stm32, linux-arm-kernel,
	linux-kernel, Alexander Lobakin, Joe Damato, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Maxime Coquelin, xfr, linux-tegra@vger.kernel.org

> Just to clarify, the patch that you had us try was not intended as an actual
> fix, correct? It was only for diagnostic purposes, i.e. to see if there is
> some kind of cache coherence issue, which seems to be the case?  So perhaps
> the only fix needed is to add dma-coherent to our device tree?

That sounds quite error prone. How many other DT blobs are missing the
property? If the memory should be coherent, i would expect the driver
to allocate coherent memory. Or the driver needs to handle
non-coherent memory and add the necessary flush/invalidates etc.

	Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-23 19:53       ` Brad Griffis
  2025-01-23 21:48         ` Andrew Lunn
@ 2025-01-24  1:53         ` Furong Xu
  2025-01-24 15:14           ` Andrew Lunn
  1 sibling, 1 reply; 35+ messages in thread
From: Furong Xu @ 2025-01-24  1:53 UTC (permalink / raw)
  To: Brad Griffis
  Cc: Jon Hunter, netdev, linux-stm32, linux-arm-kernel, linux-kernel,
	Alexander Lobakin, Joe Damato, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr,
	linux-tegra@vger.kernel.org

On Thu, 23 Jan 2025 11:53:21 -0800, Brad Griffis <bgriffis@nvidia.com> wrote:

> On 1/23/25 08:35, Furong Xu wrote:
> > What is the MTU of Tegra234 and NFS server? Are they both 1500?  
> 
> I see the same issue.  Yes, both are 1500.
> 
> > Could you please try attached patch to confirm if this regression is
> > fixed?  
> 
> Patch fixes the issue.
> 
> > If the attached patch fixes this regression, and so it seems to be a
> > cache coherence issue specific to Tegra234, since this patch avoid
> > memcpy and the page buffers may be modified by upper network stack of
> > course, then cache lines of page buffers may become dirty. But by
> > reverting this patch, cache lines of page buffers never become dirty,
> > this is the core difference.  
> 
> Thanks for these insights. I don't have specific experience in this 
> driver, but I see we have dma-coherent turned on for this driver in our 
> downstream device tree files (i.e. dtbs that coincide with our 
> out-of-tree implementation of this driver).  I went back to the original 
> code and verified that the issue was there. I did a new test where I 
> added dma-coherent to this ethernet node in the dtb and retested. It worked!
> 
> Just to clarify, the patch that you had us try was not intended as an 
> actual fix, correct? It was only for diagnostic purposes, i.e. to see if 
> there is some kind of cache coherence issue, which seems to be the case? 

It is not an actual fix, it is only for diagnostic purposes.

>   So perhaps the only fix needed is to add dma-coherent to our device tree?

Yes, add dma-coherent to ethernet node is the correct fix.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-23 21:48         ` Andrew Lunn
@ 2025-01-24  2:42           ` Furong Xu
  2025-01-24 13:15             ` Thierry Reding
  2025-01-25 10:20             ` Ido Schimmel
  0 siblings, 2 replies; 35+ messages in thread
From: Furong Xu @ 2025-01-24  2:42 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Brad Griffis, Jon Hunter, netdev, linux-stm32, linux-arm-kernel,
	linux-kernel, Alexander Lobakin, Joe Damato, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Maxime Coquelin, xfr, linux-tegra@vger.kernel.org

On Thu, 23 Jan 2025 22:48:42 +0100, Andrew Lunn <andrew@lunn.ch> wrote:

> > Just to clarify, the patch that you had us try was not intended as an actual
> > fix, correct? It was only for diagnostic purposes, i.e. to see if there is
> > some kind of cache coherence issue, which seems to be the case?  So perhaps
> > the only fix needed is to add dma-coherent to our device tree?  
> 
> That sounds quite error prone. How many other DT blobs are missing the
> property? If the memory should be coherent, i would expect the driver
> to allocate coherent memory. Or the driver needs to handle
> non-coherent memory and add the necessary flush/invalidates etc.

stmmac driver does the necessary cache flush/invalidates to maintain cache lines
explicitly.

See dma_sync_single_for_cpu():
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/include/linux/dma-mapping.h#n297

dma_dev_need_sync() is supposed to return false for Tegra234, since the ethernet
controller on Tegra234 is dma coherent by SoC design as Brad said their
downstream device tree has dma-coherent turned on by default, and after add
dma-coherent to mainline ethernet node, stmmac driver works fine.
But dma-coherent property is missing in mainline Tegra234 ethernet device tree
node, dma_dev_need_sync() returns true and this is not the expected behavior.

The dma-coherent property in device tree node is SoC specific, so only the
vendors know if their stmmac ethernet controller is dma coherent and
whether their device tree are missing the critical dma-coherent property.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-24  2:42           ` Furong Xu
@ 2025-01-24 13:15             ` Thierry Reding
  2025-01-28 20:04               ` Lucas Stach
  2025-01-25 10:20             ` Ido Schimmel
  1 sibling, 1 reply; 35+ messages in thread
From: Thierry Reding @ 2025-01-24 13:15 UTC (permalink / raw)
  To: Furong Xu
  Cc: Andrew Lunn, Brad Griffis, Jon Hunter, netdev, linux-stm32,
	linux-arm-kernel, linux-kernel, Alexander Lobakin, Joe Damato,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr, linux-tegra@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2830 bytes --]

On Fri, Jan 24, 2025 at 10:42:56AM +0800, Furong Xu wrote:
> On Thu, 23 Jan 2025 22:48:42 +0100, Andrew Lunn <andrew@lunn.ch> wrote:
> 
> > > Just to clarify, the patch that you had us try was not intended as an actual
> > > fix, correct? It was only for diagnostic purposes, i.e. to see if there is
> > > some kind of cache coherence issue, which seems to be the case?  So perhaps
> > > the only fix needed is to add dma-coherent to our device tree?  
> > 
> > That sounds quite error prone. How many other DT blobs are missing the
> > property? If the memory should be coherent, i would expect the driver
> > to allocate coherent memory. Or the driver needs to handle
> > non-coherent memory and add the necessary flush/invalidates etc.
> 
> stmmac driver does the necessary cache flush/invalidates to maintain cache lines
> explicitly.
> 
> See dma_sync_single_for_cpu():
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/include/linux/dma-mapping.h#n297
> 
> dma_dev_need_sync() is supposed to return false for Tegra234, since the ethernet
> controller on Tegra234 is dma coherent by SoC design as Brad said their
> downstream device tree has dma-coherent turned on by default, and after add
> dma-coherent to mainline ethernet node, stmmac driver works fine.
> But dma-coherent property is missing in mainline Tegra234 ethernet device tree
> node, dma_dev_need_sync() returns true and this is not the expected behavior.

My understanding is that the Ethernet controller itself is not DMA
coherent. Instead, it is by accessing memory through the IOMMU that the
accesses become coherent. It's technically possible for the IOMMU to be
disabled via command-line, or simply compile out the driver for it, and
the devices are supposed to keep working with it (though it's not a
recommended configuration), so exclusively relying on the presence of
an IOMMU node or even a dma-coherent property is bound to fail in these
corner cases. We don't currently have a good way of describing this in
device tree, but a discussion with the DT maintainers is probably
warranted.

> The dma-coherent property in device tree node is SoC specific, so only the
> vendors know if their stmmac ethernet controller is dma coherent and
> whether their device tree are missing the critical dma-coherent property.

What I fail to understand is how dma-coherent can make a difference in
this case. If it's not present, then the driver is supposed to maintain
caches explicitly. But it seems like explicit cache maintenance actually
causes things to break. So do we need to assume that DMA coherent
devices in generally won't work if the driver manages caches explicitly?

I always figured dma-coherent was more of an optimization, but this
seems to indicate it isn't.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-24  1:53         ` Furong Xu
@ 2025-01-24 15:14           ` Andrew Lunn
  0 siblings, 0 replies; 35+ messages in thread
From: Andrew Lunn @ 2025-01-24 15:14 UTC (permalink / raw)
  To: Furong Xu
  Cc: Brad Griffis, Jon Hunter, netdev, linux-stm32, linux-arm-kernel,
	linux-kernel, Alexander Lobakin, Joe Damato, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Maxime Coquelin, xfr, linux-tegra@vger.kernel.org

> It is not an actual fix, it is only for diagnostic purposes.
> 
> >   So perhaps the only fix needed is to add dma-coherent to our device tree?
> 
> Yes, add dma-coherent to ethernet node is the correct fix.

What do you think about the comment i made?

     Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-24  2:42           ` Furong Xu
  2025-01-24 13:15             ` Thierry Reding
@ 2025-01-25 10:20             ` Ido Schimmel
  2025-01-25 14:43               ` Furong Xu
  2025-01-25 15:03               ` Furong Xu
  1 sibling, 2 replies; 35+ messages in thread
From: Ido Schimmel @ 2025-01-25 10:20 UTC (permalink / raw)
  To: Furong Xu
  Cc: Andrew Lunn, Brad Griffis, Jon Hunter, netdev, linux-stm32,
	linux-arm-kernel, linux-kernel, Alexander Lobakin, Joe Damato,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr, linux-tegra@vger.kernel.org

On Fri, Jan 24, 2025 at 10:42:56AM +0800, Furong Xu wrote:
> On Thu, 23 Jan 2025 22:48:42 +0100, Andrew Lunn <andrew@lunn.ch> wrote:
> 
> > > Just to clarify, the patch that you had us try was not intended as an actual
> > > fix, correct? It was only for diagnostic purposes, i.e. to see if there is
> > > some kind of cache coherence issue, which seems to be the case?  So perhaps
> > > the only fix needed is to add dma-coherent to our device tree?  
> > 
> > That sounds quite error prone. How many other DT blobs are missing the
> > property? If the memory should be coherent, i would expect the driver
> > to allocate coherent memory. Or the driver needs to handle
> > non-coherent memory and add the necessary flush/invalidates etc.
> 
> stmmac driver does the necessary cache flush/invalidates to maintain cache lines
> explicitly.

Given the problem happens when the kernel performs syncing, is it
possible that there is a problem with how the syncing is performed?

I am not familiar with this driver, but it seems to allocate multiple
buffers per packet when split header is enabled and these buffers are
allocated from the same page pool (see stmmac_init_rx_buffers()).
Despite that, the driver is creating the page pool with a non-zero
offset (see __alloc_dma_rx_desc_resources()) to avoid syncing the
headroom, which is only present in the head buffer.

I asked Thierry to test the following patch [1] and initial testing
seems OK. He also confirmed that "SPH feature enabled" shows up in the
kernel log.

BTW, the commit that added split header support (67afd6d1cfdf0) says
that it "reduces CPU usage because without the feature all the entire
packet is memcpy'ed, while that with the feature only the header is".
This is no longer correct after your patch, so is there still value in
the split header feature? With two large buffers being allocated from
the same page pool for each packet (headers + data), the end result is
an inflated skb->truesize, no?

[1]
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index edbf8994455d..42170ce2927e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2091,8 +2091,8 @@ static int __alloc_dma_rx_desc_resources(struct stmmac_priv *priv,
 	pp_params.nid = dev_to_node(priv->device);
 	pp_params.dev = priv->device;
 	pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
-	pp_params.offset = stmmac_rx_offset(priv);
-	pp_params.max_len = dma_conf->dma_buf_sz;
+	pp_params.offset = 0;
+	pp_params.max_len = num_pages * PAGE_SIZE;
 
 	rx_q->page_pool = page_pool_create(&pp_params);
 	if (IS_ERR(rx_q->page_pool)) {

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-25 10:20             ` Ido Schimmel
@ 2025-01-25 14:43               ` Furong Xu
  2025-01-26  8:41                 ` Ido Schimmel
  2025-01-25 15:03               ` Furong Xu
  1 sibling, 1 reply; 35+ messages in thread
From: Furong Xu @ 2025-01-25 14:43 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Andrew Lunn, Brad Griffis, Jon Hunter, netdev, linux-stm32,
	linux-arm-kernel, linux-kernel, Alexander Lobakin, Joe Damato,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr, linux-tegra@vger.kernel.org

Hi Ido

On Sat, 25 Jan 2025 12:20:38 +0200, Ido Schimmel wrote:

> On Fri, Jan 24, 2025 at 10:42:56AM +0800, Furong Xu wrote:
> > On Thu, 23 Jan 2025 22:48:42 +0100, Andrew Lunn <andrew@lunn.ch>
> > wrote: 
> > > > Just to clarify, the patch that you had us try was not intended
> > > > as an actual fix, correct? It was only for diagnostic purposes,
> > > > i.e. to see if there is some kind of cache coherence issue,
> > > > which seems to be the case?  So perhaps the only fix needed is
> > > > to add dma-coherent to our device tree?    
> > > 
> > > That sounds quite error prone. How many other DT blobs are
> > > missing the property? If the memory should be coherent, i would
> > > expect the driver to allocate coherent memory. Or the driver
> > > needs to handle non-coherent memory and add the necessary
> > > flush/invalidates etc.  
> > 
> > stmmac driver does the necessary cache flush/invalidates to
> > maintain cache lines explicitly.  
> 
> Given the problem happens when the kernel performs syncing, is it
> possible that there is a problem with how the syncing is performed?
> 
> I am not familiar with this driver, but it seems to allocate multiple
> buffers per packet when split header is enabled and these buffers are
> allocated from the same page pool (see stmmac_init_rx_buffers()).
> Despite that, the driver is creating the page pool with a non-zero
> offset (see __alloc_dma_rx_desc_resources()) to avoid syncing the
> headroom, which is only present in the head buffer.
> 
> I asked Thierry to test the following patch [1] and initial testing
> seems OK. He also confirmed that "SPH feature enabled" shows up in the
> kernel log.
> BTW, the commit that added split header support (67afd6d1cfdf0) says
> that it "reduces CPU usage because without the feature all the entire
> packet is memcpy'ed, while that with the feature only the header is".
> This is no longer correct after your patch, so is there still value in
> the split header feature? With two large buffers being allocated from

Thanks for these great insights!

Yes, when "SPH feature enabled", it is not correct after my patch,
pp_params.offset should be updated to match the offset of split payload.

But I would like to let pp_params.max_len remains to
dma_conf->dma_buf_sz since the sizes of both header and payload are
limited to dma_conf->dma_buf_sz by DMA engine, no more than
dma_conf->dma_buf_sz bytes will be written into a page buffer.
So my patch would be like [2]:

BTW, the split header feature will be very useful on some certain
cases, stmmac driver should support this feature always.

[2]
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index edbf8994455d..def0d893efbb 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2091,7 +2091,7 @@ static int __alloc_dma_rx_desc_resources(struct stmmac_priv *priv,
        pp_params.nid = dev_to_node(priv->device);
        pp_params.dev = priv->device;
        pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
-       pp_params.offset = stmmac_rx_offset(priv);
+       pp_params.offset = priv->sph ? 0 : stmmac_rx_offset(priv);
        pp_params.max_len = dma_conf->dma_buf_sz;

        rx_q->page_pool = page_pool_create(&pp_params);

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-25 10:20             ` Ido Schimmel
  2025-01-25 14:43               ` Furong Xu
@ 2025-01-25 15:03               ` Furong Xu
  2025-01-25 19:08                 ` Andrew Lunn
  2025-01-27 13:28                 ` Thierry Reding
  1 sibling, 2 replies; 35+ messages in thread
From: Furong Xu @ 2025-01-25 15:03 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Andrew Lunn, Brad Griffis, Jon Hunter, netdev, linux-stm32,
	linux-arm-kernel, linux-kernel, Alexander Lobakin, Joe Damato,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr, linux-tegra@vger.kernel.org

Hi Thierry

On Sat, 25 Jan 2025 12:20:38 +0200, Ido Schimmel wrote:

> On Fri, Jan 24, 2025 at 10:42:56AM +0800, Furong Xu wrote:
> > On Thu, 23 Jan 2025 22:48:42 +0100, Andrew Lunn <andrew@lunn.ch>
> > wrote: 
> > > > Just to clarify, the patch that you had us try was not intended
> > > > as an actual fix, correct? It was only for diagnostic purposes,
> > > > i.e. to see if there is some kind of cache coherence issue,
> > > > which seems to be the case?  So perhaps the only fix needed is
> > > > to add dma-coherent to our device tree?    
> > > 
> > > That sounds quite error prone. How many other DT blobs are
> > > missing the property? If the memory should be coherent, i would
> > > expect the driver to allocate coherent memory. Or the driver
> > > needs to handle non-coherent memory and add the necessary
> > > flush/invalidates etc.  
> > 
> > stmmac driver does the necessary cache flush/invalidates to
> > maintain cache lines explicitly.  
> 
> Given the problem happens when the kernel performs syncing, is it
> possible that there is a problem with how the syncing is performed?
> 
> I am not familiar with this driver, but it seems to allocate multiple
> buffers per packet when split header is enabled and these buffers are
> allocated from the same page pool (see stmmac_init_rx_buffers()).
> Despite that, the driver is creating the page pool with a non-zero
> offset (see __alloc_dma_rx_desc_resources()) to avoid syncing the
> headroom, which is only present in the head buffer.
> 
> I asked Thierry to test the following patch [1] and initial testing
> seems OK. He also confirmed that "SPH feature enabled" shows up in the
> kernel log.

It is recommended to disable the "SPH feature" by default unless some
certain cases depend on it. Like Ido said, two large buffers being
allocated from the same page pool for each packet, this is a huge waste
of memory, and brings performance drops for most of general cases.

Our downstream driver and two mainline drivers disable SPH by default:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c#n357
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c#n471

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-25 15:03               ` Furong Xu
@ 2025-01-25 19:08                 ` Andrew Lunn
  2025-01-26  2:39                   ` Furong Xu
  2025-01-27 13:28                 ` Thierry Reding
  1 sibling, 1 reply; 35+ messages in thread
From: Andrew Lunn @ 2025-01-25 19:08 UTC (permalink / raw)
  To: Furong Xu
  Cc: Ido Schimmel, Brad Griffis, Jon Hunter, netdev, linux-stm32,
	linux-arm-kernel, linux-kernel, Alexander Lobakin, Joe Damato,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr, linux-tegra@vger.kernel.org

> It is recommended to disable the "SPH feature" by default unless some
> certain cases depend on it. Like Ido said, two large buffers being
> allocated from the same page pool for each packet, this is a huge waste
> of memory, and brings performance drops for most of general cases.

I don't know this driver, but it looks like SPH is required for
NETIF_F_GRO? Can you add this flag to hw_features, but not
wanted_features and leave SPH disabled until ethtool is used to enable
GRO?

Are there other use cases where SPH is needed?

	Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-25 19:08                 ` Andrew Lunn
@ 2025-01-26  2:39                   ` Furong Xu
  0 siblings, 0 replies; 35+ messages in thread
From: Furong Xu @ 2025-01-26  2:39 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Ido Schimmel, Brad Griffis, Jon Hunter, netdev, linux-stm32,
	linux-arm-kernel, linux-kernel, Alexander Lobakin, Joe Damato,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr, linux-tegra@vger.kernel.org

On Sat, 25 Jan 2025 20:08:12 +0100, Andrew Lunn <andrew@lunn.ch> wrote:

> > It is recommended to disable the "SPH feature" by default unless
> > some certain cases depend on it. Like Ido said, two large buffers
> > being allocated from the same page pool for each packet, this is a
> > huge waste of memory, and brings performance drops for most of
> > general cases.  
> 
> I don't know this driver, but it looks like SPH is required for
> NETIF_F_GRO? Can you add this flag to hw_features, but not
> wanted_features and leave SPH disabled until ethtool is used to enable
> GRO?

SPH has its own ethtool command, stmmac driver does not implement yet.
see:
https://patchwork.kernel.org/project/netdevbpf/cover/20250114142852.3364986-1-ap420073@gmail.com/

> Are there other use cases where SPH is needed?

https://patchwork.kernel.org/project/netdevbpf/cover/20240910171458.219195-1-almasrymina@google.com/
https://patchwork.kernel.org/project/netdevbpf/cover/20250116231704.2402455-1-dw@davidwei.uk/

The stmmac driver does not support both of them, but it will someday :)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-25 14:43               ` Furong Xu
@ 2025-01-26  8:41                 ` Ido Schimmel
  2025-01-26 10:37                   ` Furong Xu
  0 siblings, 1 reply; 35+ messages in thread
From: Ido Schimmel @ 2025-01-26  8:41 UTC (permalink / raw)
  To: Furong Xu
  Cc: Andrew Lunn, Brad Griffis, Jon Hunter, netdev, linux-stm32,
	linux-arm-kernel, linux-kernel, Alexander Lobakin, Joe Damato,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr, linux-tegra@vger.kernel.org

Hi,

On Sat, Jan 25, 2025 at 10:43:42PM +0800, Furong Xu wrote:
> Hi Ido
> 
> On Sat, 25 Jan 2025 12:20:38 +0200, Ido Schimmel wrote:
> 
> > On Fri, Jan 24, 2025 at 10:42:56AM +0800, Furong Xu wrote:
> > > On Thu, 23 Jan 2025 22:48:42 +0100, Andrew Lunn <andrew@lunn.ch>
> > > wrote: 
> > > > > Just to clarify, the patch that you had us try was not intended
> > > > > as an actual fix, correct? It was only for diagnostic purposes,
> > > > > i.e. to see if there is some kind of cache coherence issue,
> > > > > which seems to be the case?  So perhaps the only fix needed is
> > > > > to add dma-coherent to our device tree?    
> > > > 
> > > > That sounds quite error prone. How many other DT blobs are
> > > > missing the property? If the memory should be coherent, i would
> > > > expect the driver to allocate coherent memory. Or the driver
> > > > needs to handle non-coherent memory and add the necessary
> > > > flush/invalidates etc.  
> > > 
> > > stmmac driver does the necessary cache flush/invalidates to
> > > maintain cache lines explicitly.  
> > 
> > Given the problem happens when the kernel performs syncing, is it
> > possible that there is a problem with how the syncing is performed?
> > 
> > I am not familiar with this driver, but it seems to allocate multiple
> > buffers per packet when split header is enabled and these buffers are
> > allocated from the same page pool (see stmmac_init_rx_buffers()).
> > Despite that, the driver is creating the page pool with a non-zero
> > offset (see __alloc_dma_rx_desc_resources()) to avoid syncing the
> > headroom, which is only present in the head buffer.
> > 
> > I asked Thierry to test the following patch [1] and initial testing
> > seems OK. He also confirmed that "SPH feature enabled" shows up in the
> > kernel log.
> > BTW, the commit that added split header support (67afd6d1cfdf0) says
> > that it "reduces CPU usage because without the feature all the entire
> > packet is memcpy'ed, while that with the feature only the header is".
> > This is no longer correct after your patch, so is there still value in
> > the split header feature? With two large buffers being allocated from
> 
> Thanks for these great insights!
> 
> Yes, when "SPH feature enabled", it is not correct after my patch,
> pp_params.offset should be updated to match the offset of split payload.
> 
> But I would like to let pp_params.max_len remains to
> dma_conf->dma_buf_sz since the sizes of both header and payload are
> limited to dma_conf->dma_buf_sz by DMA engine, no more than
> dma_conf->dma_buf_sz bytes will be written into a page buffer.
> So my patch would be like [2]:
> 
> BTW, the split header feature will be very useful on some certain
> cases, stmmac driver should support this feature always.
> 
> [2]
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index edbf8994455d..def0d893efbb 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -2091,7 +2091,7 @@ static int __alloc_dma_rx_desc_resources(struct stmmac_priv *priv,
>         pp_params.nid = dev_to_node(priv->device);
>         pp_params.dev = priv->device;
>         pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
> -       pp_params.offset = stmmac_rx_offset(priv);
> +       pp_params.offset = priv->sph ? 0 : stmmac_rx_offset(priv);

SPH is the only scenario in which the driver uses multiple buffers per
packet?

>         pp_params.max_len = dma_conf->dma_buf_sz;

Are you sure this is correct? Page pool documentation says that "For
pages recycled on the XDP xmit and skb paths the page pool will use the
max_len member of struct page_pool_params to decide how much of the page
needs to be synced (starting at offset)" [1].

While "no more than dma_conf->dma_buf_sz bytes will be written into a
page buffer", for the head buffer they will be written starting at a
non-zero offset unlike buffers used for the data, no?

[1] https://docs.kernel.org/networking/page_pool.html#dma-sync

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-26  8:41                 ` Ido Schimmel
@ 2025-01-26 10:37                   ` Furong Xu
  2025-01-26 11:35                     ` Ido Schimmel
  0 siblings, 1 reply; 35+ messages in thread
From: Furong Xu @ 2025-01-26 10:37 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Andrew Lunn, Brad Griffis, Jon Hunter, netdev, linux-stm32,
	linux-arm-kernel, linux-kernel, Alexander Lobakin, Joe Damato,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr, linux-tegra@vger.kernel.org

On Sun, 26 Jan 2025 10:41:23 +0200, Ido Schimmel wrote:
 
> SPH is the only scenario in which the driver uses multiple buffers per
> packet?

Yes.

Jumbo mode may use multiple buffers per packet too, but they are
high order pages, just like a single page in a page pool when using
a standard MTU.

> >         pp_params.max_len = dma_conf->dma_buf_sz;  
> 
> Are you sure this is correct? Page pool documentation says that "For
> pages recycled on the XDP xmit and skb paths the page pool will use
> the max_len member of struct page_pool_params to decide how much of
> the page needs to be synced (starting at offset)" [1].

Page pool must sync an area of the buffer because both DMA and CPU may
touch this area, other areas are CPU exclusive, so no sync for them
seems better.

> While "no more than dma_conf->dma_buf_sz bytes will be written into a
> page buffer", for the head buffer they will be written starting at a
> non-zero offset unlike buffers used for the data, no?

Correct, they have different offsets.

The "SPH feature" splits header into buf->page (non-zero offset) and
splits payload into buf->sec_page (zero offset).

For buf->page, pp_params.max_len should be the size of L3/L4 header,
and with a offset of NET_SKB_PAD.

For buf->sec_page, pp_params.max_len should be dma_conf->dma_buf_sz,
and with a offset of 0.

This is always true:
sizeof(L3/L4 header) + NET_SKB_PAD < dma_conf->dma_buf_sz + 0

pp_params.max_len = dma_conf->dma_buf_sz;
make things simpler :)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-26 10:37                   ` Furong Xu
@ 2025-01-26 11:35                     ` Ido Schimmel
  2025-01-26 12:56                       ` Furong Xu
  0 siblings, 1 reply; 35+ messages in thread
From: Ido Schimmel @ 2025-01-26 11:35 UTC (permalink / raw)
  To: Furong Xu
  Cc: Andrew Lunn, Brad Griffis, Jon Hunter, netdev, linux-stm32,
	linux-arm-kernel, linux-kernel, Alexander Lobakin, Joe Damato,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr, linux-tegra@vger.kernel.org

On Sun, Jan 26, 2025 at 06:37:14PM +0800, Furong Xu wrote:
> The "SPH feature" splits header into buf->page (non-zero offset) and
> splits payload into buf->sec_page (zero offset).
> 
> For buf->page, pp_params.max_len should be the size of L3/L4 header,
> and with a offset of NET_SKB_PAD.
> 
> For buf->sec_page, pp_params.max_len should be dma_conf->dma_buf_sz,
> and with a offset of 0.
> 
> This is always true:
> sizeof(L3/L4 header) + NET_SKB_PAD < dma_conf->dma_buf_sz + 0

Thanks, understood, but are there situations where the device is unable
to split a packet? For example, a large L2 packet. I am trying to
understand if there are situations where the device will write more than
"dma_conf->dma_buf_sz - NET_SKB_PAD" to the head buffer.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-26 11:35                     ` Ido Schimmel
@ 2025-01-26 12:56                       ` Furong Xu
  0 siblings, 0 replies; 35+ messages in thread
From: Furong Xu @ 2025-01-26 12:56 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Andrew Lunn, Brad Griffis, Jon Hunter, netdev, linux-stm32,
	linux-arm-kernel, linux-kernel, Alexander Lobakin, Joe Damato,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr, linux-tegra@vger.kernel.org

On Sun, 26 Jan 2025 13:35:45 +0200, Ido Schimmel wrote:

> On Sun, Jan 26, 2025 at 06:37:14PM +0800, Furong Xu wrote:
> > The "SPH feature" splits header into buf->page (non-zero offset) and
> > splits payload into buf->sec_page (zero offset).
> > 
> > For buf->page, pp_params.max_len should be the size of L3/L4 header,
> > and with a offset of NET_SKB_PAD.
> > 
> > For buf->sec_page, pp_params.max_len should be dma_conf->dma_buf_sz,
> > and with a offset of 0.
> > 
> > This is always true:
> > sizeof(L3/L4 header) + NET_SKB_PAD < dma_conf->dma_buf_sz + 0  
> 
> Thanks, understood, but are there situations where the device is
> unable to split a packet? For example, a large L2 packet. I am trying
> to understand if there are situations where the device will write
> more than "dma_conf->dma_buf_sz - NET_SKB_PAD" to the head buffer.

Nice catch!
When receiving a large L2/non-IP packet, more than "dma_conf->dma_buf_sz
- NET_SKB_PAD" will be written.

So we should:
pp_params.max_len = dma_conf->dma_buf_sz + stmmac_rx_offset(priv);

Thanks a lot, Ido
Have a nice weekend :)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-25 15:03               ` Furong Xu
  2025-01-25 19:08                 ` Andrew Lunn
@ 2025-01-27 13:28                 ` Thierry Reding
  2025-01-29 14:51                   ` Jon Hunter
  1 sibling, 1 reply; 35+ messages in thread
From: Thierry Reding @ 2025-01-27 13:28 UTC (permalink / raw)
  To: Furong Xu
  Cc: Ido Schimmel, Andrew Lunn, Brad Griffis, Jon Hunter, netdev,
	linux-stm32, linux-arm-kernel, linux-kernel, Alexander Lobakin,
	Joe Damato, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr,
	linux-tegra@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2752 bytes --]

On Sat, Jan 25, 2025 at 11:03:47PM +0800, Furong Xu wrote:
> Hi Thierry
> 
> On Sat, 25 Jan 2025 12:20:38 +0200, Ido Schimmel wrote:
> 
> > On Fri, Jan 24, 2025 at 10:42:56AM +0800, Furong Xu wrote:
> > > On Thu, 23 Jan 2025 22:48:42 +0100, Andrew Lunn <andrew@lunn.ch>
> > > wrote: 
> > > > > Just to clarify, the patch that you had us try was not intended
> > > > > as an actual fix, correct? It was only for diagnostic purposes,
> > > > > i.e. to see if there is some kind of cache coherence issue,
> > > > > which seems to be the case?  So perhaps the only fix needed is
> > > > > to add dma-coherent to our device tree?    
> > > > 
> > > > That sounds quite error prone. How many other DT blobs are
> > > > missing the property? If the memory should be coherent, i would
> > > > expect the driver to allocate coherent memory. Or the driver
> > > > needs to handle non-coherent memory and add the necessary
> > > > flush/invalidates etc.  
> > > 
> > > stmmac driver does the necessary cache flush/invalidates to
> > > maintain cache lines explicitly.  
> > 
> > Given the problem happens when the kernel performs syncing, is it
> > possible that there is a problem with how the syncing is performed?
> > 
> > I am not familiar with this driver, but it seems to allocate multiple
> > buffers per packet when split header is enabled and these buffers are
> > allocated from the same page pool (see stmmac_init_rx_buffers()).
> > Despite that, the driver is creating the page pool with a non-zero
> > offset (see __alloc_dma_rx_desc_resources()) to avoid syncing the
> > headroom, which is only present in the head buffer.
> > 
> > I asked Thierry to test the following patch [1] and initial testing
> > seems OK. He also confirmed that "SPH feature enabled" shows up in the
> > kernel log.
> 
> It is recommended to disable the "SPH feature" by default unless some
> certain cases depend on it. Like Ido said, two large buffers being
> allocated from the same page pool for each packet, this is a huge waste
> of memory, and brings performance drops for most of general cases.
> 
> Our downstream driver and two mainline drivers disable SPH by default:
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c#n357
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c#n471

Okay, that's something we can look into changing. What would be an
example of a use-case depending on SPH? Also, isn't this something
that should be a policy that users can configure?

Irrespective of that we should fix the problems we are seeing with
SPH enabled.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-24 13:15             ` Thierry Reding
@ 2025-01-28 20:04               ` Lucas Stach
  0 siblings, 0 replies; 35+ messages in thread
From: Lucas Stach @ 2025-01-28 20:04 UTC (permalink / raw)
  To: Thierry Reding, Furong Xu
  Cc: Andrew Lunn, Brad Griffis, Jon Hunter, netdev, linux-stm32,
	linux-arm-kernel, linux-kernel, Alexander Lobakin, Joe Damato,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr, linux-tegra@vger.kernel.org

Am Freitag, dem 24.01.2025 um 14:15 +0100 schrieb Thierry Reding:
> 
[...]
> > The dma-coherent property in device tree node is SoC specific, so only the
> > vendors know if their stmmac ethernet controller is dma coherent and
> > whether their device tree are missing the critical dma-coherent property.
> 
> What I fail to understand is how dma-coherent can make a difference in
> this case. If it's not present, then the driver is supposed to maintain
> caches explicitly. But it seems like explicit cache maintenance actually
> causes things to break. So do we need to assume that DMA coherent
> devices in generally won't work if the driver manages caches explicitly?
> 
> I always figured dma-coherent was more of an optimization, but this
> seems to indicate it isn't.

There is a real failure scenario when the device is actually dma-
coherent, but the DT claims it isn't: If a location from e.g. a receive
buffer is already allocated in the cache, the write from the device
will update the cache line and not go out to memory. If you then do the
usual non-coherent cache maintenance, i.e. invalidate the RX buffer
area after the device indicated the reception of the buffer, you will
destroy the updated data in the cache and read stale data from memory.

Regards,
Lucas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-27 13:28                 ` Thierry Reding
@ 2025-01-29 14:51                   ` Jon Hunter
  2025-02-07  9:07                     ` Furong Xu
  0 siblings, 1 reply; 35+ messages in thread
From: Jon Hunter @ 2025-01-29 14:51 UTC (permalink / raw)
  To: Thierry Reding, Furong Xu
  Cc: Ido Schimmel, Andrew Lunn, Brad Griffis, netdev, linux-stm32,
	linux-arm-kernel, linux-kernel, Alexander Lobakin, Joe Damato,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr, linux-tegra@vger.kernel.org

Hi Furong,

On 27/01/2025 13:28, Thierry Reding wrote:
> On Sat, Jan 25, 2025 at 11:03:47PM +0800, Furong Xu wrote:
>> Hi Thierry
>>
>> On Sat, 25 Jan 2025 12:20:38 +0200, Ido Schimmel wrote:
>>
>>> On Fri, Jan 24, 2025 at 10:42:56AM +0800, Furong Xu wrote:
>>>> On Thu, 23 Jan 2025 22:48:42 +0100, Andrew Lunn <andrew@lunn.ch>
>>>> wrote:
>>>>>> Just to clarify, the patch that you had us try was not intended
>>>>>> as an actual fix, correct? It was only for diagnostic purposes,
>>>>>> i.e. to see if there is some kind of cache coherence issue,
>>>>>> which seems to be the case?  So perhaps the only fix needed is
>>>>>> to add dma-coherent to our device tree?
>>>>>
>>>>> That sounds quite error prone. How many other DT blobs are
>>>>> missing the property? If the memory should be coherent, i would
>>>>> expect the driver to allocate coherent memory. Or the driver
>>>>> needs to handle non-coherent memory and add the necessary
>>>>> flush/invalidates etc.
>>>>
>>>> stmmac driver does the necessary cache flush/invalidates to
>>>> maintain cache lines explicitly.
>>>
>>> Given the problem happens when the kernel performs syncing, is it
>>> possible that there is a problem with how the syncing is performed?
>>>
>>> I am not familiar with this driver, but it seems to allocate multiple
>>> buffers per packet when split header is enabled and these buffers are
>>> allocated from the same page pool (see stmmac_init_rx_buffers()).
>>> Despite that, the driver is creating the page pool with a non-zero
>>> offset (see __alloc_dma_rx_desc_resources()) to avoid syncing the
>>> headroom, which is only present in the head buffer.
>>>
>>> I asked Thierry to test the following patch [1] and initial testing
>>> seems OK. He also confirmed that "SPH feature enabled" shows up in the
>>> kernel log.
>>
>> It is recommended to disable the "SPH feature" by default unless some
>> certain cases depend on it. Like Ido said, two large buffers being
>> allocated from the same page pool for each packet, this is a huge waste
>> of memory, and brings performance drops for most of general cases.
>>
>> Our downstream driver and two mainline drivers disable SPH by default:
>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c#n357
>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c#n471
> 
> Okay, that's something we can look into changing. What would be an
> example of a use-case depending on SPH? Also, isn't this something
> that should be a policy that users can configure?
> 
> Irrespective of that we should fix the problems we are seeing with
> SPH enabled.


Any update on this?

Thanks
Jon

-- 
nvpublic


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-29 14:51                   ` Jon Hunter
@ 2025-02-07  9:07                     ` Furong Xu
  2025-02-07 13:42                       ` Jon Hunter
  0 siblings, 1 reply; 35+ messages in thread
From: Furong Xu @ 2025-02-07  9:07 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Thierry Reding, Ido Schimmel, Andrew Lunn, Brad Griffis, netdev,
	linux-stm32, linux-arm-kernel, linux-kernel, Alexander Lobakin,
	Joe Damato, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr,
	linux-tegra@vger.kernel.org

Hi Jon,

On Wed, 29 Jan 2025 14:51:35 +0000, Jon Hunter <jonathanh@nvidia.com> wrote:
> Hi Furong,
> 
> On 27/01/2025 13:28, Thierry Reding wrote:
> > On Sat, Jan 25, 2025 at 11:03:47PM +0800, Furong Xu wrote:  
> >> Hi Thierry
> >>
> >> On Sat, 25 Jan 2025 12:20:38 +0200, Ido Schimmel wrote:
> >>  
> >>> On Fri, Jan 24, 2025 at 10:42:56AM +0800, Furong Xu wrote:  
> >>>> On Thu, 23 Jan 2025 22:48:42 +0100, Andrew Lunn <andrew@lunn.ch>
> >>>> wrote:  
> >>>>>> Just to clarify, the patch that you had us try was not intended
> >>>>>> as an actual fix, correct? It was only for diagnostic purposes,
> >>>>>> i.e. to see if there is some kind of cache coherence issue,
> >>>>>> which seems to be the case?  So perhaps the only fix needed is
> >>>>>> to add dma-coherent to our device tree?  
> >>>>>
> >>>>> That sounds quite error prone. How many other DT blobs are
> >>>>> missing the property? If the memory should be coherent, i would
> >>>>> expect the driver to allocate coherent memory. Or the driver
> >>>>> needs to handle non-coherent memory and add the necessary
> >>>>> flush/invalidates etc.  
> >>>>
> >>>> stmmac driver does the necessary cache flush/invalidates to
> >>>> maintain cache lines explicitly.  
> >>>
> >>> Given the problem happens when the kernel performs syncing, is it
> >>> possible that there is a problem with how the syncing is performed?
> >>>
> >>> I am not familiar with this driver, but it seems to allocate multiple
> >>> buffers per packet when split header is enabled and these buffers are
> >>> allocated from the same page pool (see stmmac_init_rx_buffers()).
> >>> Despite that, the driver is creating the page pool with a non-zero
> >>> offset (see __alloc_dma_rx_desc_resources()) to avoid syncing the
> >>> headroom, which is only present in the head buffer.
> >>>
> >>> I asked Thierry to test the following patch [1] and initial testing
> >>> seems OK. He also confirmed that "SPH feature enabled" shows up in the
> >>> kernel log.  
> >>
> >> It is recommended to disable the "SPH feature" by default unless some
> >> certain cases depend on it. Like Ido said, two large buffers being
> >> allocated from the same page pool for each packet, this is a huge waste
> >> of memory, and brings performance drops for most of general cases.
> >>
> >> Our downstream driver and two mainline drivers disable SPH by default:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c#n357
> >> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c#n471  
> > 
> > Okay, that's something we can look into changing. What would be an
> > example of a use-case depending on SPH? Also, isn't this something
> > that should be a policy that users can configure?
> > 
> > Irrespective of that we should fix the problems we are seeing with
> > SPH enabled.  
> 
> 
> Any update on this?

Sorry for my late response, I was on Chinese New Year holiday.

The fix is sent, and it will be so nice to have your Tested-by: tag there:
https://lore.kernel.org/all/20250207085639.13580-1-0x1207@gmail.com/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-02-07  9:07                     ` Furong Xu
@ 2025-02-07 13:42                       ` Jon Hunter
  0 siblings, 0 replies; 35+ messages in thread
From: Jon Hunter @ 2025-02-07 13:42 UTC (permalink / raw)
  To: Furong Xu
  Cc: Thierry Reding, Ido Schimmel, Andrew Lunn, Brad Griffis, netdev,
	linux-stm32, linux-arm-kernel, linux-kernel, Alexander Lobakin,
	Joe Damato, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr,
	linux-tegra@vger.kernel.org



On 07/02/2025 09:07, Furong Xu wrote:
> Hi Jon,
> 
> On Wed, 29 Jan 2025 14:51:35 +0000, Jon Hunter <jonathanh@nvidia.com> wrote:
>> Hi Furong,
>>
>> On 27/01/2025 13:28, Thierry Reding wrote:
>>> On Sat, Jan 25, 2025 at 11:03:47PM +0800, Furong Xu wrote:
>>>> Hi Thierry
>>>>
>>>> On Sat, 25 Jan 2025 12:20:38 +0200, Ido Schimmel wrote:
>>>>   
>>>>> On Fri, Jan 24, 2025 at 10:42:56AM +0800, Furong Xu wrote:
>>>>>> On Thu, 23 Jan 2025 22:48:42 +0100, Andrew Lunn <andrew@lunn.ch>
>>>>>> wrote:
>>>>>>>> Just to clarify, the patch that you had us try was not intended
>>>>>>>> as an actual fix, correct? It was only for diagnostic purposes,
>>>>>>>> i.e. to see if there is some kind of cache coherence issue,
>>>>>>>> which seems to be the case?  So perhaps the only fix needed is
>>>>>>>> to add dma-coherent to our device tree?
>>>>>>>
>>>>>>> That sounds quite error prone. How many other DT blobs are
>>>>>>> missing the property? If the memory should be coherent, i would
>>>>>>> expect the driver to allocate coherent memory. Or the driver
>>>>>>> needs to handle non-coherent memory and add the necessary
>>>>>>> flush/invalidates etc.
>>>>>>
>>>>>> stmmac driver does the necessary cache flush/invalidates to
>>>>>> maintain cache lines explicitly.
>>>>>
>>>>> Given the problem happens when the kernel performs syncing, is it
>>>>> possible that there is a problem with how the syncing is performed?
>>>>>
>>>>> I am not familiar with this driver, but it seems to allocate multiple
>>>>> buffers per packet when split header is enabled and these buffers are
>>>>> allocated from the same page pool (see stmmac_init_rx_buffers()).
>>>>> Despite that, the driver is creating the page pool with a non-zero
>>>>> offset (see __alloc_dma_rx_desc_resources()) to avoid syncing the
>>>>> headroom, which is only present in the head buffer.
>>>>>
>>>>> I asked Thierry to test the following patch [1] and initial testing
>>>>> seems OK. He also confirmed that "SPH feature enabled" shows up in the
>>>>> kernel log.
>>>>
>>>> It is recommended to disable the "SPH feature" by default unless some
>>>> certain cases depend on it. Like Ido said, two large buffers being
>>>> allocated from the same page pool for each packet, this is a huge waste
>>>> of memory, and brings performance drops for most of general cases.
>>>>
>>>> Our downstream driver and two mainline drivers disable SPH by default:
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c#n357
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c#n471
>>>
>>> Okay, that's something we can look into changing. What would be an
>>> example of a use-case depending on SPH? Also, isn't this something
>>> that should be a policy that users can configure?
>>>
>>> Irrespective of that we should fix the problems we are seeing with
>>> SPH enabled.
>>
>>
>> Any update on this?
> 
> Sorry for my late response, I was on Chinese New Year holiday.

No problem! Happy new year!

> The fix is sent, and it will be so nice to have your Tested-by: tag there:
> https://lore.kernel.org/all/20250207085639.13580-1-0x1207@gmail.com/

Thanks, I have tested and responded. All looking good!

Jon

-- 
nvpublic


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-02-07 13:42 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-15  3:27 [PATCH net-next v3 0/4] net: stmmac: RX performance improvement Furong Xu
2025-01-15  3:27 ` [PATCH net-next v3 1/4] net: stmmac: Switch to zero-copy in non-XDP RX path Furong Xu
2025-01-15 16:58   ` Larysa Zaremba
2025-01-16  2:05   ` Yanteng Si
2025-01-23 14:06   ` Jon Hunter
2025-01-23 16:35     ` Furong Xu
2025-01-23 19:53       ` Brad Griffis
2025-01-23 21:48         ` Andrew Lunn
2025-01-24  2:42           ` Furong Xu
2025-01-24 13:15             ` Thierry Reding
2025-01-28 20:04               ` Lucas Stach
2025-01-25 10:20             ` Ido Schimmel
2025-01-25 14:43               ` Furong Xu
2025-01-26  8:41                 ` Ido Schimmel
2025-01-26 10:37                   ` Furong Xu
2025-01-26 11:35                     ` Ido Schimmel
2025-01-26 12:56                       ` Furong Xu
2025-01-25 15:03               ` Furong Xu
2025-01-25 19:08                 ` Andrew Lunn
2025-01-26  2:39                   ` Furong Xu
2025-01-27 13:28                 ` Thierry Reding
2025-01-29 14:51                   ` Jon Hunter
2025-02-07  9:07                     ` Furong Xu
2025-02-07 13:42                       ` Jon Hunter
2025-01-24  1:53         ` Furong Xu
2025-01-24 15:14           ` Andrew Lunn
2025-01-15  3:27 ` [PATCH net-next v3 2/4] net: stmmac: Set page_pool_params.max_len to a precise size Furong Xu
2025-01-15 10:07   ` Yanteng Si
2025-01-15  3:27 ` [PATCH net-next v3 3/4] net: stmmac: Optimize cache prefetch in RX path Furong Xu
2025-01-15 16:24   ` Yanteng Si
2025-01-15  3:27 ` [PATCH net-next v3 4/4] net: stmmac: Convert prefetch() to net_prefetch() for received frames Furong Xu
2025-01-15 16:33   ` Yanteng Si
2025-01-15 16:35   ` Larysa Zaremba
2025-01-15 17:35   ` Joe Damato
2025-01-16 11:40 ` [PATCH net-next v3 0/4] net: stmmac: RX performance improvement patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).