linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v1 0/3] net: stmmac: RX performance improvement
@ 2025-01-10  9:53 Furong Xu
  2025-01-10  9:53 ` [PATCH net-next v1 1/3] net: stmmac: Switch to zero-copy in non-XDP RX path Furong Xu
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Furong Xu @ 2025-01-10  9:53 UTC (permalink / raw)
  To: netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr, Furong Xu

This series improves RX performance a lot, ~34% TCP RX throughput boost
has been observed with DWXGMAC CORE 3.20a running on Cortex-A65 CPUs:
from 2.18 Gbits/sec increased to 2.92 Gbits/sec.

Furong Xu (3):
  net: stmmac: Switch to zero-copy in non-XDP RX path
  net: stmmac: Set page_pool_params.max_len to a precise size
  net: stmmac: Optimize cache prefetch in RX path

 drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  1 +
 .../net/ethernet/stmicro/stmmac/stmmac_main.c | 32 +++++++++++--------
 .../net/ethernet/stmicro/stmmac/stmmac_xdp.h  |  1 -
 3 files changed, 19 insertions(+), 15 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net-next v1 1/3] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-10  9:53 [PATCH net-next v1 0/3] net: stmmac: RX performance improvement Furong Xu
@ 2025-01-10  9:53 ` Furong Xu
  2025-01-13  9:41   ` Yanteng Si
  2025-01-10  9:53 ` [PATCH net-next v1 2/3] net: stmmac: Set page_pool_params.max_len to a precise size Furong Xu
  2025-01-10  9:53 ` [PATCH net-next v1 3/3] net: stmmac: Optimize cache prefetch in RX path Furong Xu
  2 siblings, 1 reply; 11+ messages in thread
From: Furong Xu @ 2025-01-10  9:53 UTC (permalink / raw)
  To: netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr, Furong Xu

Avoid memcpy in non-XDP RX path by marking all allocated SKBs to
be recycled in the upper network stack.

This patch brings ~11.5% driver performance improvement in a TCP RX
throughput test with iPerf tool on a single isolated Cortex-A65 CPU
core, from 2.18 Gbits/sec increased to 2.43 Gbits/sec.

Signed-off-by: Furong Xu <0x1207@gmail.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  1 +
 .../net/ethernet/stmicro/stmmac/stmmac_main.c | 23 +++++++++++--------
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index 548b28fed9b6..5c39292313de 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -126,6 +126,7 @@ struct stmmac_rx_queue {
 	unsigned int cur_rx;
 	unsigned int dirty_rx;
 	unsigned int buf_alloc_num;
+	unsigned int napi_skb_frag_size;
 	dma_addr_t dma_rx_phy;
 	u32 rx_tail_addr;
 	unsigned int state_saved;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 038df1b2bb58..43125a6f8f6b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1320,7 +1320,7 @@ static unsigned int stmmac_rx_offset(struct stmmac_priv *priv)
 	if (stmmac_xdp_is_enabled(priv))
 		return XDP_PACKET_HEADROOM;
 
-	return 0;
+	return NET_SKB_PAD;
 }
 
 static int stmmac_set_bfsize(int mtu, int bufsize)
@@ -2019,17 +2019,21 @@ static int __alloc_dma_rx_desc_resources(struct stmmac_priv *priv,
 	struct stmmac_channel *ch = &priv->channel[queue];
 	bool xdp_prog = stmmac_xdp_is_enabled(priv);
 	struct page_pool_params pp_params = { 0 };
-	unsigned int num_pages;
+	unsigned int dma_buf_sz_pad, num_pages;
 	unsigned int napi_id;
 	int ret;
 
+	dma_buf_sz_pad = stmmac_rx_offset(priv) + dma_conf->dma_buf_sz +
+			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	num_pages = DIV_ROUND_UP(dma_buf_sz_pad, PAGE_SIZE);
+
 	rx_q->queue_index = queue;
 	rx_q->priv_data = priv;
+	rx_q->napi_skb_frag_size = num_pages * PAGE_SIZE;
 
 	pp_params.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
 	pp_params.pool_size = dma_conf->dma_rx_size;
-	num_pages = DIV_ROUND_UP(dma_conf->dma_buf_sz, PAGE_SIZE);
-	pp_params.order = ilog2(num_pages);
+	pp_params.order = order_base_2(num_pages);
 	pp_params.nid = dev_to_node(priv->device);
 	pp_params.dev = priv->device;
 	pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
@@ -5574,19 +5578,20 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 			/* XDP program may expand or reduce tail */
 			buf1_len = ctx.xdp.data_end - ctx.xdp.data;
 
-			skb = napi_alloc_skb(&ch->rx_napi, buf1_len);
+			skb = napi_build_skb(page_address(buf->page),
+					     rx_q->napi_skb_frag_size);
 			if (!skb) {
+				page_pool_recycle_direct(rx_q->page_pool,
+							 buf->page);
 				rx_dropped++;
 				count++;
 				goto drain_data;
 			}
 
 			/* XDP program may adjust header */
-			skb_copy_to_linear_data(skb, ctx.xdp.data, buf1_len);
+			skb_reserve(skb, ctx.xdp.data - ctx.xdp.data_hard_start);
 			skb_put(skb, buf1_len);
-
-			/* Data payload copied into SKB, page ready for recycle */
-			page_pool_recycle_direct(rx_q->page_pool, buf->page);
+			skb_mark_for_recycle(skb);
 			buf->page = NULL;
 		} else if (buf1_len) {
 			dma_sync_single_for_cpu(priv->device, buf->addr,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next v1 2/3] net: stmmac: Set page_pool_params.max_len to a precise size
  2025-01-10  9:53 [PATCH net-next v1 0/3] net: stmmac: RX performance improvement Furong Xu
  2025-01-10  9:53 ` [PATCH net-next v1 1/3] net: stmmac: Switch to zero-copy in non-XDP RX path Furong Xu
@ 2025-01-10  9:53 ` Furong Xu
  2025-01-10  9:53 ` [PATCH net-next v1 3/3] net: stmmac: Optimize cache prefetch in RX path Furong Xu
  2 siblings, 0 replies; 11+ messages in thread
From: Furong Xu @ 2025-01-10  9:53 UTC (permalink / raw)
  To: netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr, Furong Xu

DMA engine will always write no more than dma_buf_sz bytes of a received
frame into a page buffer, the remaining spaces are unused or used by CPU
exclusively.
Setting page_pool_params.max_len to almost the full size of page(s) helps
nothing more, but wastes more CPU cycles on cache maintenance.

For a standard MTU of 1500, then dma_buf_sz is assigned to 1536, and this
patch brings ~16.9% driver performance improvement in a TCP RX
throughput test with iPerf tool on a single isolated Cortex-A65 CPU
core, from 2.43 Gbits/sec increased to 2.84 Gbits/sec.

Signed-off-by: Furong Xu <0x1207@gmail.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 2 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_xdp.h  | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 43125a6f8f6b..c1aeaec53b4c 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2038,7 +2038,7 @@ static int __alloc_dma_rx_desc_resources(struct stmmac_priv *priv,
 	pp_params.dev = priv->device;
 	pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
 	pp_params.offset = stmmac_rx_offset(priv);
-	pp_params.max_len = STMMAC_MAX_RX_BUF_SIZE(num_pages);
+	pp_params.max_len = dma_conf->dma_buf_sz;
 
 	rx_q->page_pool = page_pool_create(&pp_params);
 	if (IS_ERR(rx_q->page_pool)) {
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_xdp.h b/drivers/net/ethernet/stmicro/stmmac/stmmac_xdp.h
index 896dc987d4ef..77ce8cfbe976 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_xdp.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_xdp.h
@@ -4,7 +4,6 @@
 #ifndef _STMMAC_XDP_H_
 #define _STMMAC_XDP_H_
 
-#define STMMAC_MAX_RX_BUF_SIZE(num)	(((num) * PAGE_SIZE) - XDP_PACKET_HEADROOM)
 #define STMMAC_RX_DMA_ATTR	(DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING)
 
 int stmmac_xdp_setup_pool(struct stmmac_priv *priv, struct xsk_buff_pool *pool,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next v1 3/3] net: stmmac: Optimize cache prefetch in RX path
  2025-01-10  9:53 [PATCH net-next v1 0/3] net: stmmac: RX performance improvement Furong Xu
  2025-01-10  9:53 ` [PATCH net-next v1 1/3] net: stmmac: Switch to zero-copy in non-XDP RX path Furong Xu
  2025-01-10  9:53 ` [PATCH net-next v1 2/3] net: stmmac: Set page_pool_params.max_len to a precise size Furong Xu
@ 2025-01-10  9:53 ` Furong Xu
  2025-01-13 12:10   ` Alexander Lobakin
  2 siblings, 1 reply; 11+ messages in thread
From: Furong Xu @ 2025-01-10  9:53 UTC (permalink / raw)
  To: netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr, Furong Xu

Current code prefetches cache lines for the received frame first, and
then dma_sync_single_for_cpu() against this frame, this is wrong.
Cache prefetch should be triggered after dma_sync_single_for_cpu().

This patch brings ~2.8% driver performance improvement in a TCP RX
throughput test with iPerf tool on a single isolated Cortex-A65 CPU
core, 2.84 Gbits/sec increased to 2.92 Gbits/sec.

Signed-off-by: Furong Xu <0x1207@gmail.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index c1aeaec53b4c..1b4e8b035b1a 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -5497,10 +5497,6 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 
 		/* Buffer is good. Go on. */
 
-		prefetch(page_address(buf->page) + buf->page_offset);
-		if (buf->sec_page)
-			prefetch(page_address(buf->sec_page));
-
 		buf1_len = stmmac_rx_buf1_len(priv, p, status, len);
 		len += buf1_len;
 		buf2_len = stmmac_rx_buf2_len(priv, p, status, len);
@@ -5522,6 +5518,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 
 			dma_sync_single_for_cpu(priv->device, buf->addr,
 						buf1_len, dma_dir);
+			prefetch(page_address(buf->page) + buf->page_offset);
 
 			xdp_init_buff(&ctx.xdp, buf_sz, &rx_q->xdp_rxq);
 			xdp_prepare_buff(&ctx.xdp, page_address(buf->page),
@@ -5596,6 +5593,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 		} else if (buf1_len) {
 			dma_sync_single_for_cpu(priv->device, buf->addr,
 						buf1_len, dma_dir);
+			prefetch(page_address(buf->page) + buf->page_offset);
 			skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
 					buf->page, buf->page_offset, buf1_len,
 					priv->dma_conf.dma_buf_sz);
@@ -5608,6 +5606,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 		if (buf2_len) {
 			dma_sync_single_for_cpu(priv->device, buf->sec_addr,
 						buf2_len, dma_dir);
+			prefetch(page_address(buf->sec_page));
 			skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
 					buf->sec_page, 0, buf2_len,
 					priv->dma_conf.dma_buf_sz);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v1 1/3] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-10  9:53 ` [PATCH net-next v1 1/3] net: stmmac: Switch to zero-copy in non-XDP RX path Furong Xu
@ 2025-01-13  9:41   ` Yanteng Si
  2025-01-13 12:03     ` Alexander Lobakin
  0 siblings, 1 reply; 11+ messages in thread
From: Yanteng Si @ 2025-01-13  9:41 UTC (permalink / raw)
  To: Furong Xu, netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr

在 2025/1/10 17:53, Furong Xu 写道:
> Avoid memcpy in non-XDP RX path by marking all allocated SKBs to
> be recycled in the upper network stack.
> 
> This patch brings ~11.5% driver performance improvement in a TCP RX
> throughput test with iPerf tool on a single isolated Cortex-A65 CPU
> core, from 2.18 Gbits/sec increased to 2.43 Gbits/sec.
> 
> Signed-off-by: Furong Xu <0x1207@gmail.com>
> ---
>   drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  1 +
>   .../net/ethernet/stmicro/stmmac/stmmac_main.c | 23 +++++++++++--------
>   2 files changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> index 548b28fed9b6..5c39292313de 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> @@ -126,6 +126,7 @@ struct stmmac_rx_queue {
>   	unsigned int cur_rx;
>   	unsigned int dirty_rx;
>   	unsigned int buf_alloc_num;
> +	unsigned int napi_skb_frag_size;
>   	dma_addr_t dma_rx_phy;
>   	u32 rx_tail_addr;
>   	unsigned int state_saved;
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index 038df1b2bb58..43125a6f8f6b 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -1320,7 +1320,7 @@ static unsigned int stmmac_rx_offset(struct stmmac_priv *priv)
>   	if (stmmac_xdp_is_enabled(priv))
>   		return XDP_PACKET_HEADROOM;
>   
> -	return 0;
> +	return NET_SKB_PAD;
>   }
>   
>   static int stmmac_set_bfsize(int mtu, int bufsize)
> @@ -2019,17 +2019,21 @@ static int __alloc_dma_rx_desc_resources(struct stmmac_priv *priv,
>   	struct stmmac_channel *ch = &priv->channel[queue];
>   	bool xdp_prog = stmmac_xdp_is_enabled(priv);
>   	struct page_pool_params pp_params = { 0 };
> -	unsigned int num_pages;
> +	unsigned int dma_buf_sz_pad, num_pages;
>   	unsigned int napi_id;
>   	int ret;
>   
> +	dma_buf_sz_pad = stmmac_rx_offset(priv) + dma_conf->dma_buf_sz +
> +			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	num_pages = DIV_ROUND_UP(dma_buf_sz_pad, PAGE_SIZE);
> +
>   	rx_q->queue_index = queue;
>   	rx_q->priv_data = priv;
> +	rx_q->napi_skb_frag_size = num_pages * PAGE_SIZE;
>   
>   	pp_params.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
>   	pp_params.pool_size = dma_conf->dma_rx_size;
> -	num_pages = DIV_ROUND_UP(dma_conf->dma_buf_sz, PAGE_SIZE);
> -	pp_params.order = ilog2(num_pages);
> +	pp_params.order = order_base_2(num_pages);
>   	pp_params.nid = dev_to_node(priv->device);
>   	pp_params.dev = priv->device;
>   	pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
> @@ -5574,19 +5578,20 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
>   			/* XDP program may expand or reduce tail */
>   			buf1_len = ctx.xdp.data_end - ctx.xdp.data;
>   
> -			skb = napi_alloc_skb(&ch->rx_napi, buf1_len);
> +			skb = napi_build_skb(page_address(buf->page),
> +					     rx_q->napi_skb_frag_size);
>   			if (!skb) {
> +				page_pool_recycle_direct(rx_q->page_pool,
> +							 buf->page);
>   				rx_dropped++;
>   				count++;
>   				goto drain_data;
>   			}
>   
>   			/* XDP program may adjust header */
> -			skb_copy_to_linear_data(skb, ctx.xdp.data, buf1_len);
> +			skb_reserve(skb, ctx.xdp.data - ctx.xdp.data_hard_start);
The network subsystem still requires that the length
of each line of code should not exceed 80 characters.
So let's silence the warning:

WARNING: line length of 81 exceeds 80 columns
#87: FILE: drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:5592:
+			skb_reserve(skb, ctx.xdp.data - ctx.xdp.data_hard_start);

Thanks,
Yanteng


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v1 1/3] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-13  9:41   ` Yanteng Si
@ 2025-01-13 12:03     ` Alexander Lobakin
  2025-01-13 15:16       ` Yanteng Si
  2025-01-13 16:48       ` Andrew Lunn
  0 siblings, 2 replies; 11+ messages in thread
From: Alexander Lobakin @ 2025-01-13 12:03 UTC (permalink / raw)
  To: Yanteng Si
  Cc: Furong Xu, netdev, linux-stm32, linux-arm-kernel, linux-kernel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr

From: Yanteng Si <si.yanteng@linux.dev>
Date: Mon, 13 Jan 2025 17:41:41 +0800

> 在 2025/1/10 17:53, Furong Xu 写道:
>> Avoid memcpy in non-XDP RX path by marking all allocated SKBs to
>> be recycled in the upper network stack.
>>
>> This patch brings ~11.5% driver performance improvement in a TCP RX
>> throughput test with iPerf tool on a single isolated Cortex-A65 CPU
>> core, from 2.18 Gbits/sec increased to 2.43 Gbits/sec.
>>
>> Signed-off-by: Furong Xu <0x1207@gmail.com>
>> ---
>>   drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  1 +
>>   .../net/ethernet/stmicro/stmmac/stmmac_main.c | 23 +++++++++++--------
>>   2 files changed, 15 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/
>> net/ethernet/stmicro/stmmac/stmmac.h
>> index 548b28fed9b6..5c39292313de 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
>> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
>> @@ -126,6 +126,7 @@ struct stmmac_rx_queue {
>>       unsigned int cur_rx;
>>       unsigned int dirty_rx;
>>       unsigned int buf_alloc_num;
>> +    unsigned int napi_skb_frag_size;
>>       dma_addr_t dma_rx_phy;
>>       u32 rx_tail_addr;
>>       unsigned int state_saved;
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/
>> drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> index 038df1b2bb58..43125a6f8f6b 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> @@ -1320,7 +1320,7 @@ static unsigned int stmmac_rx_offset(struct
>> stmmac_priv *priv)
>>       if (stmmac_xdp_is_enabled(priv))
>>           return XDP_PACKET_HEADROOM;
>>   -    return 0;
>> +    return NET_SKB_PAD;
>>   }
>>     static int stmmac_set_bfsize(int mtu, int bufsize)
>> @@ -2019,17 +2019,21 @@ static int
>> __alloc_dma_rx_desc_resources(struct stmmac_priv *priv,
>>       struct stmmac_channel *ch = &priv->channel[queue];
>>       bool xdp_prog = stmmac_xdp_is_enabled(priv);
>>       struct page_pool_params pp_params = { 0 };
>> -    unsigned int num_pages;
>> +    unsigned int dma_buf_sz_pad, num_pages;
>>       unsigned int napi_id;
>>       int ret;
>>   +    dma_buf_sz_pad = stmmac_rx_offset(priv) + dma_conf->dma_buf_sz +
>> +             SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +    num_pages = DIV_ROUND_UP(dma_buf_sz_pad, PAGE_SIZE);
>> +
>>       rx_q->queue_index = queue;
>>       rx_q->priv_data = priv;
>> +    rx_q->napi_skb_frag_size = num_pages * PAGE_SIZE;
>>         pp_params.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
>>       pp_params.pool_size = dma_conf->dma_rx_size;
>> -    num_pages = DIV_ROUND_UP(dma_conf->dma_buf_sz, PAGE_SIZE);
>> -    pp_params.order = ilog2(num_pages);
>> +    pp_params.order = order_base_2(num_pages);
>>       pp_params.nid = dev_to_node(priv->device);
>>       pp_params.dev = priv->device;
>>       pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
>> @@ -5574,19 +5578,20 @@ static int stmmac_rx(struct stmmac_priv *priv,
>> int limit, u32 queue)
>>               /* XDP program may expand or reduce tail */
>>               buf1_len = ctx.xdp.data_end - ctx.xdp.data;
>>   -            skb = napi_alloc_skb(&ch->rx_napi, buf1_len);
>> +            skb = napi_build_skb(page_address(buf->page),
>> +                         rx_q->napi_skb_frag_size);
>>               if (!skb) {
>> +                page_pool_recycle_direct(rx_q->page_pool,
>> +                             buf->page);
>>                   rx_dropped++;
>>                   count++;
>>                   goto drain_data;
>>               }
>>                 /* XDP program may adjust header */
>> -            skb_copy_to_linear_data(skb, ctx.xdp.data, buf1_len);
>> +            skb_reserve(skb, ctx.xdp.data - ctx.xdp.data_hard_start);
> The network subsystem still requires that the length
> of each line of code should not exceed 80 characters.
> So let's silence the warning:
> 
> WARNING: line length of 81 exceeds 80 columns
> #87: FILE: drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:5592:
> +            skb_reserve(skb, ctx.xdp.data - ctx.xdp.data_hard_start);

I agree that &ctx.xdp could be made an onstack pointer to shorten these
lines, but please don't spam with the checkpatch output.

1. It's author's responsibility to read netdev CI output on Patchwork,
   reviewers shouldn't copy its logs.
2. The only alternative without making a shortcut for &ctx.xdp is

			skb_reserve(skb,
				    ctx.xdp.data - ctx.xdp.data_hard_start);

This looks really ugly and does more harm than good.

If you really want to help, pls come with good propositions how to avoid
such warnings in an elegant way.

> 
> Thanks,
> Yanteng

Thanks,
Olek


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v1 3/3] net: stmmac: Optimize cache prefetch in RX path
  2025-01-10  9:53 ` [PATCH net-next v1 3/3] net: stmmac: Optimize cache prefetch in RX path Furong Xu
@ 2025-01-13 12:10   ` Alexander Lobakin
  2025-01-13 12:37     ` Furong Xu
  0 siblings, 1 reply; 11+ messages in thread
From: Alexander Lobakin @ 2025-01-13 12:10 UTC (permalink / raw)
  To: Furong Xu
  Cc: netdev, linux-stm32, linux-arm-kernel, linux-kernel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Maxime Coquelin, xfr

From: Furong Xu <0x1207@gmail.com>
Date: Fri, 10 Jan 2025 17:53:59 +0800

> Current code prefetches cache lines for the received frame first, and
> then dma_sync_single_for_cpu() against this frame, this is wrong.
> Cache prefetch should be triggered after dma_sync_single_for_cpu().
> 
> This patch brings ~2.8% driver performance improvement in a TCP RX
> throughput test with iPerf tool on a single isolated Cortex-A65 CPU
> core, 2.84 Gbits/sec increased to 2.92 Gbits/sec.
> 
> Signed-off-by: Furong Xu <0x1207@gmail.com>
> ---
>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index c1aeaec53b4c..1b4e8b035b1a 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -5497,10 +5497,6 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
>  
>  		/* Buffer is good. Go on. */
>  
> -		prefetch(page_address(buf->page) + buf->page_offset);
> -		if (buf->sec_page)
> -			prefetch(page_address(buf->sec_page));
> -
>  		buf1_len = stmmac_rx_buf1_len(priv, p, status, len);
>  		len += buf1_len;
>  		buf2_len = stmmac_rx_buf2_len(priv, p, status, len);
> @@ -5522,6 +5518,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
>  
>  			dma_sync_single_for_cpu(priv->device, buf->addr,
>  						buf1_len, dma_dir);
> +			prefetch(page_address(buf->page) + buf->page_offset);
>  
>  			xdp_init_buff(&ctx.xdp, buf_sz, &rx_q->xdp_rxq);
>  			xdp_prepare_buff(&ctx.xdp, page_address(buf->page),
> @@ -5596,6 +5593,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
>  		} else if (buf1_len) {
>  			dma_sync_single_for_cpu(priv->device, buf->addr,
>  						buf1_len, dma_dir);
> +			prefetch(page_address(buf->page) + buf->page_offset);
>  			skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
>  					buf->page, buf->page_offset, buf1_len,
>  					priv->dma_conf.dma_buf_sz);

Are you sure you need to prefetch frags as well? I'd say this is a waste
of cycles, as the kernel core stack barely looks at payload...
Probably prefetching only header buffers would be enough.

> @@ -5608,6 +5606,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
>  		if (buf2_len) {
>  			dma_sync_single_for_cpu(priv->device, buf->sec_addr,
>  						buf2_len, dma_dir);
> +			prefetch(page_address(buf->sec_page));
>  			skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
>  					buf->sec_page, 0, buf2_len,
>  					priv->dma_conf.dma_buf_sz);

Thanks,
Olek


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v1 3/3] net: stmmac: Optimize cache prefetch in RX path
  2025-01-13 12:10   ` Alexander Lobakin
@ 2025-01-13 12:37     ` Furong Xu
  0 siblings, 0 replies; 11+ messages in thread
From: Furong Xu @ 2025-01-13 12:37 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: netdev, linux-stm32, linux-arm-kernel, linux-kernel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Maxime Coquelin, xfr

On Mon, 13 Jan 2025 13:10:46 +0100, Alexander Lobakin <aleksander.lobakin@intel.com> wrote:

> > @@ -5596,6 +5593,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
> >  		} else if (buf1_len) {
> >  			dma_sync_single_for_cpu(priv->device, buf->addr,
> >  						buf1_len, dma_dir);
> > +			prefetch(page_address(buf->page) + buf->page_offset);
> >  			skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
> >  					buf->page, buf->page_offset, buf1_len,
> >  					priv->dma_conf.dma_buf_sz);  
> 
> Are you sure you need to prefetch frags as well? I'd say this is a waste
> of cycles, as the kernel core stack barely looks at payload...
> Probably prefetching only header buffers would be enough.
> 

Yes, do not prefetch for frags is more reasonable.
Thanks!

pw-bot: changes-requested


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v1 1/3] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-13 12:03     ` Alexander Lobakin
@ 2025-01-13 15:16       ` Yanteng Si
  2025-01-13 16:48       ` Andrew Lunn
  1 sibling, 0 replies; 11+ messages in thread
From: Yanteng Si @ 2025-01-13 15:16 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Furong Xu, netdev, linux-stm32, linux-arm-kernel, linux-kernel,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, xfr


在 1/13/25 20:03, Alexander Lobakin 写道:
> From: Yanteng Si <si.yanteng@linux.dev>
> Date: Mon, 13 Jan 2025 17:41:41 +0800
>
>> 在 2025/1/10 17:53, Furong Xu 写道:
>>> Avoid memcpy in non-XDP RX path by marking all allocated SKBs to
>>> be recycled in the upper network stack.
>>>
>>> This patch brings ~11.5% driver performance improvement in a TCP RX
>>> throughput test with iPerf tool on a single isolated Cortex-A65 CPU
>>> core, from 2.18 Gbits/sec increased to 2.43 Gbits/sec.
>>>
>>> Signed-off-by: Furong Xu <0x1207@gmail.com>
>>> ---
>>>    drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  1 +
>>>    .../net/ethernet/stmicro/stmmac/stmmac_main.c | 23 +++++++++++--------
>>>    2 files changed, 15 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/
>>> net/ethernet/stmicro/stmmac/stmmac.h
>>> index 548b28fed9b6..5c39292313de 100644
>>> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
>>> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
>>> @@ -126,6 +126,7 @@ struct stmmac_rx_queue {
>>>        unsigned int cur_rx;
>>>        unsigned int dirty_rx;
>>>        unsigned int buf_alloc_num;
>>> +    unsigned int napi_skb_frag_size;
>>>        dma_addr_t dma_rx_phy;
>>>        u32 rx_tail_addr;
>>>        unsigned int state_saved;
>>> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/
>>> drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>>> index 038df1b2bb58..43125a6f8f6b 100644
>>> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>>> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>>> @@ -1320,7 +1320,7 @@ static unsigned int stmmac_rx_offset(struct
>>> stmmac_priv *priv)
>>>        if (stmmac_xdp_is_enabled(priv))
>>>            return XDP_PACKET_HEADROOM;
>>>    -    return 0;
>>> +    return NET_SKB_PAD;
>>>    }
>>>      static int stmmac_set_bfsize(int mtu, int bufsize)
>>> @@ -2019,17 +2019,21 @@ static int
>>> __alloc_dma_rx_desc_resources(struct stmmac_priv *priv,
>>>        struct stmmac_channel *ch = &priv->channel[queue];
>>>        bool xdp_prog = stmmac_xdp_is_enabled(priv);
>>>        struct page_pool_params pp_params = { 0 };
>>> -    unsigned int num_pages;
>>> +    unsigned int dma_buf_sz_pad, num_pages;
>>>        unsigned int napi_id;
>>>        int ret;
>>>    +    dma_buf_sz_pad = stmmac_rx_offset(priv) + dma_conf->dma_buf_sz +
>>> +             SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>>> +    num_pages = DIV_ROUND_UP(dma_buf_sz_pad, PAGE_SIZE);
>>> +
>>>        rx_q->queue_index = queue;
>>>        rx_q->priv_data = priv;
>>> +    rx_q->napi_skb_frag_size = num_pages * PAGE_SIZE;
>>>          pp_params.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
>>>        pp_params.pool_size = dma_conf->dma_rx_size;
>>> -    num_pages = DIV_ROUND_UP(dma_conf->dma_buf_sz, PAGE_SIZE);
>>> -    pp_params.order = ilog2(num_pages);
>>> +    pp_params.order = order_base_2(num_pages);
>>>        pp_params.nid = dev_to_node(priv->device);
>>>        pp_params.dev = priv->device;
>>>        pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
>>> @@ -5574,19 +5578,20 @@ static int stmmac_rx(struct stmmac_priv *priv,
>>> int limit, u32 queue)
>>>                /* XDP program may expand or reduce tail */
>>>                buf1_len = ctx.xdp.data_end - ctx.xdp.data;
>>>    -            skb = napi_alloc_skb(&ch->rx_napi, buf1_len);
>>> +            skb = napi_build_skb(page_address(buf->page),
>>> +                         rx_q->napi_skb_frag_size);
>>>                if (!skb) {
>>> +                page_pool_recycle_direct(rx_q->page_pool,
>>> +                             buf->page);
>>>                    rx_dropped++;
>>>                    count++;
>>>                    goto drain_data;
>>>                }
>>>                  /* XDP program may adjust header */
>>> -            skb_copy_to_linear_data(skb, ctx.xdp.data, buf1_len);
>>> +            skb_reserve(skb, ctx.xdp.data - ctx.xdp.data_hard_start);
>> The network subsystem still requires that the length
>> of each line of code should not exceed 80 characters.
>> So let's silence the warning:
>>
>> WARNING: line length of 81 exceeds 80 columns
>> #87: FILE: drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:5592:
>> +            skb_reserve(skb, ctx.xdp.data - ctx.xdp.data_hard_start);
> I agree that &ctx.xdp could be made an onstack pointer to shorten these
> lines, but please don't spam with the checkpatch output.
>
> 1. It's author's responsibility to read netdev CI output on Patchwork,
>     reviewers shouldn't copy its logs.
Sorry, I shouldn't have made noise on the mailing list.
> 2. The only alternative without making a shortcut for &ctx.xdp is
>
> 			skb_reserve(skb,
> 				    ctx.xdp.data - ctx.xdp.data_hard_start);
>
> This looks really ugly and does more harm than good.
Agree!
>
> If you really want to help, pls come with good propositions how to avoid
> such warnings in an elegant way.

Simple. Just do as with buf1_len.

Thanks,

Yanteng

>
>> Thanks,
>> Yanteng
> Thanks,
> Olek


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v1 1/3] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-13 12:03     ` Alexander Lobakin
  2025-01-13 15:16       ` Yanteng Si
@ 2025-01-13 16:48       ` Andrew Lunn
  2025-01-14 17:23         ` Alexander Lobakin
  1 sibling, 1 reply; 11+ messages in thread
From: Andrew Lunn @ 2025-01-13 16:48 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Yanteng Si, Furong Xu, netdev, linux-stm32, linux-arm-kernel,
	linux-kernel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr

> 1. It's author's responsibility to read netdev CI output on Patchwork,
>    reviewers shouldn't copy its logs.

I somewhat disagree with that. We want the author to of already run
all the static analysers before they post code. We don't want the
mailing list abused as a CI system.

So rather than pointing out a specific problem, it can be better to
say that static analysers XZY is not happy with this patch, please run
it and fix the issues it reports.

	Andrew


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v1 1/3] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-13 16:48       ` Andrew Lunn
@ 2025-01-14 17:23         ` Alexander Lobakin
  0 siblings, 0 replies; 11+ messages in thread
From: Alexander Lobakin @ 2025-01-14 17:23 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Yanteng Si, Furong Xu, netdev, linux-stm32, linux-arm-kernel,
	linux-kernel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr

From: Andrew Lunn <andrew@lunn.ch>
Date: Mon, 13 Jan 2025 17:48:22 +0100

>> 1. It's author's responsibility to read netdev CI output on Patchwork,
>>    reviewers shouldn't copy its logs.
> 
> I somewhat disagree with that. We want the author to of already run
> all the static analysers before they post code. We don't want the
> mailing list abused as a CI system.

Sure. Maybe I wasn't clear enough, but I don't encourage using our
MLs/CIs to test stuff :D What I meant is that reviewers shouldn't copy
stuff from the Patchwork output. The authors themselves should track
their series there, but only to make sure everything is fine, not to
"let's see if I need to fix anything else".

> 
> So rather than pointing out a specific problem, it can be better to
> say that static analysers XZY is not happy with this patch, please run
> it and fix the issues it reports.

Right, probably the best way.

> 
> 	Andrew

Thanks,
Olek


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-01-14 17:27 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-10  9:53 [PATCH net-next v1 0/3] net: stmmac: RX performance improvement Furong Xu
2025-01-10  9:53 ` [PATCH net-next v1 1/3] net: stmmac: Switch to zero-copy in non-XDP RX path Furong Xu
2025-01-13  9:41   ` Yanteng Si
2025-01-13 12:03     ` Alexander Lobakin
2025-01-13 15:16       ` Yanteng Si
2025-01-13 16:48       ` Andrew Lunn
2025-01-14 17:23         ` Alexander Lobakin
2025-01-10  9:53 ` [PATCH net-next v1 2/3] net: stmmac: Set page_pool_params.max_len to a precise size Furong Xu
2025-01-10  9:53 ` [PATCH net-next v1 3/3] net: stmmac: Optimize cache prefetch in RX path Furong Xu
2025-01-13 12:10   ` Alexander Lobakin
2025-01-13 12:37     ` Furong Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).