netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v2 0/3] net: stmmac: RX performance improvement
@ 2025-01-13 14:20 Furong Xu
  2025-01-13 14:20 ` [PATCH net-next v2 1/3] net: stmmac: Switch to zero-copy in non-XDP RX path Furong Xu
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Furong Xu @ 2025-01-13 14:20 UTC (permalink / raw)
  To: netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Alexander Lobakin, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr, Furong Xu

This series improves RX performance a lot, ~34% TCP RX throughput boost
has been observed with DWXGMAC CORE 3.20a running on Cortex-A65 CPUs:
from 2.18 Gbits/sec increased to 2.92 Gbits/sec.

---
Changes in v2:
  1. No cache prefetch for frags (Alexander Lobakin)
  2. Fix code style warning reported by netdev CI on Patchwork

  v1: https://patchwork.kernel.org/project/netdevbpf/list/?series=924103&state=%2A&archive=both
---

Furong Xu (3):
  net: stmmac: Switch to zero-copy in non-XDP RX path
  net: stmmac: Set page_pool_params.max_len to a precise size
  net: stmmac: Optimize cache prefetch in RX path

 drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  1 +
 .../net/ethernet/stmicro/stmmac/stmmac_main.c | 33 +++++++++++--------
 .../net/ethernet/stmicro/stmmac/stmmac_xdp.h  |  1 -
 3 files changed, 20 insertions(+), 15 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH net-next v2 1/3] net: stmmac: Switch to zero-copy in non-XDP RX path
  2025-01-13 14:20 [PATCH net-next v2 0/3] net: stmmac: RX performance improvement Furong Xu
@ 2025-01-13 14:20 ` Furong Xu
  2025-01-13 14:20 ` [PATCH net-next v2 2/3] net: stmmac: Set page_pool_params.max_len to a precise size Furong Xu
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Furong Xu @ 2025-01-13 14:20 UTC (permalink / raw)
  To: netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Alexander Lobakin, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr, Furong Xu

Avoid memcpy in non-XDP RX path by marking all allocated SKBs to
be recycled in the upper network stack.

This patch brings ~11.5% driver performance improvement in a TCP RX
throughput test with iPerf tool on a single isolated Cortex-A65 CPU
core, from 2.18 Gbits/sec increased to 2.43 Gbits/sec.

Signed-off-by: Furong Xu <0x1207@gmail.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  1 +
 .../net/ethernet/stmicro/stmmac/stmmac_main.c | 26 ++++++++++++-------
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index e8dbce20129c..f05cae103d83 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -126,6 +126,7 @@ struct stmmac_rx_queue {
 	unsigned int cur_rx;
 	unsigned int dirty_rx;
 	unsigned int buf_alloc_num;
+	unsigned int napi_skb_frag_size;
 	dma_addr_t dma_rx_phy;
 	u32 rx_tail_addr;
 	unsigned int state_saved;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 58b013528dea..6ec7bc61df9b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1330,7 +1330,7 @@ static unsigned int stmmac_rx_offset(struct stmmac_priv *priv)
 	if (stmmac_xdp_is_enabled(priv))
 		return XDP_PACKET_HEADROOM;
 
-	return 0;
+	return NET_SKB_PAD;
 }
 
 static int stmmac_set_bfsize(int mtu, int bufsize)
@@ -2029,17 +2029,21 @@ static int __alloc_dma_rx_desc_resources(struct stmmac_priv *priv,
 	struct stmmac_channel *ch = &priv->channel[queue];
 	bool xdp_prog = stmmac_xdp_is_enabled(priv);
 	struct page_pool_params pp_params = { 0 };
-	unsigned int num_pages;
+	unsigned int dma_buf_sz_pad, num_pages;
 	unsigned int napi_id;
 	int ret;
 
+	dma_buf_sz_pad = stmmac_rx_offset(priv) + dma_conf->dma_buf_sz +
+			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	num_pages = DIV_ROUND_UP(dma_buf_sz_pad, PAGE_SIZE);
+
 	rx_q->queue_index = queue;
 	rx_q->priv_data = priv;
+	rx_q->napi_skb_frag_size = num_pages * PAGE_SIZE;
 
 	pp_params.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
 	pp_params.pool_size = dma_conf->dma_rx_size;
-	num_pages = DIV_ROUND_UP(dma_conf->dma_buf_sz, PAGE_SIZE);
-	pp_params.order = ilog2(num_pages);
+	pp_params.order = order_base_2(num_pages);
 	pp_params.nid = dev_to_node(priv->device);
 	pp_params.dev = priv->device;
 	pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
@@ -5574,22 +5578,26 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 		}
 
 		if (!skb) {
+			unsigned int head_pad_len;
+
 			/* XDP program may expand or reduce tail */
 			buf1_len = ctx.xdp.data_end - ctx.xdp.data;
 
-			skb = napi_alloc_skb(&ch->rx_napi, buf1_len);
+			skb = napi_build_skb(page_address(buf->page),
+					     rx_q->napi_skb_frag_size);
 			if (!skb) {
+				page_pool_recycle_direct(rx_q->page_pool,
+							 buf->page);
 				rx_dropped++;
 				count++;
 				goto drain_data;
 			}
 
 			/* XDP program may adjust header */
-			skb_copy_to_linear_data(skb, ctx.xdp.data, buf1_len);
+			head_pad_len = ctx.xdp.data - ctx.xdp.data_hard_start;
+			skb_reserve(skb, head_pad_len);
 			skb_put(skb, buf1_len);
-
-			/* Data payload copied into SKB, page ready for recycle */
-			page_pool_recycle_direct(rx_q->page_pool, buf->page);
+			skb_mark_for_recycle(skb);
 			buf->page = NULL;
 		} else if (buf1_len) {
 			dma_sync_single_for_cpu(priv->device, buf->addr,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next v2 2/3] net: stmmac: Set page_pool_params.max_len to a precise size
  2025-01-13 14:20 [PATCH net-next v2 0/3] net: stmmac: RX performance improvement Furong Xu
  2025-01-13 14:20 ` [PATCH net-next v2 1/3] net: stmmac: Switch to zero-copy in non-XDP RX path Furong Xu
@ 2025-01-13 14:20 ` Furong Xu
  2025-01-13 14:20 ` [PATCH net-next v2 3/3] net: stmmac: Optimize cache prefetch in RX path Furong Xu
  2025-01-13 14:27 ` [PATCH net-next v2 0/3] net: stmmac: RX performance improvement Alexander Lobakin
  3 siblings, 0 replies; 9+ messages in thread
From: Furong Xu @ 2025-01-13 14:20 UTC (permalink / raw)
  To: netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Alexander Lobakin, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr, Furong Xu

DMA engine will always write no more than dma_buf_sz bytes of a received
frame into a page buffer, the remaining spaces are unused or used by CPU
exclusively.
Setting page_pool_params.max_len to almost the full size of page(s) helps
nothing more, but wastes more CPU cycles on cache maintenance.

For a standard MTU of 1500, then dma_buf_sz is assigned to 1536, and this
patch brings ~16.9% driver performance improvement in a TCP RX
throughput test with iPerf tool on a single isolated Cortex-A65 CPU
core, from 2.43 Gbits/sec increased to 2.84 Gbits/sec.

Signed-off-by: Furong Xu <0x1207@gmail.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 2 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_xdp.h  | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 6ec7bc61df9b..ca340fd8c937 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2048,7 +2048,7 @@ static int __alloc_dma_rx_desc_resources(struct stmmac_priv *priv,
 	pp_params.dev = priv->device;
 	pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
 	pp_params.offset = stmmac_rx_offset(priv);
-	pp_params.max_len = STMMAC_MAX_RX_BUF_SIZE(num_pages);
+	pp_params.max_len = dma_conf->dma_buf_sz;
 
 	rx_q->page_pool = page_pool_create(&pp_params);
 	if (IS_ERR(rx_q->page_pool)) {
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_xdp.h b/drivers/net/ethernet/stmicro/stmmac/stmmac_xdp.h
index 896dc987d4ef..77ce8cfbe976 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_xdp.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_xdp.h
@@ -4,7 +4,6 @@
 #ifndef _STMMAC_XDP_H_
 #define _STMMAC_XDP_H_
 
-#define STMMAC_MAX_RX_BUF_SIZE(num)	(((num) * PAGE_SIZE) - XDP_PACKET_HEADROOM)
 #define STMMAC_RX_DMA_ATTR	(DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING)
 
 int stmmac_xdp_setup_pool(struct stmmac_priv *priv, struct xsk_buff_pool *pool,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next v2 3/3] net: stmmac: Optimize cache prefetch in RX path
  2025-01-13 14:20 [PATCH net-next v2 0/3] net: stmmac: RX performance improvement Furong Xu
  2025-01-13 14:20 ` [PATCH net-next v2 1/3] net: stmmac: Switch to zero-copy in non-XDP RX path Furong Xu
  2025-01-13 14:20 ` [PATCH net-next v2 2/3] net: stmmac: Set page_pool_params.max_len to a precise size Furong Xu
@ 2025-01-13 14:20 ` Furong Xu
  2025-01-14 23:31   ` Joe Damato
  2025-01-13 14:27 ` [PATCH net-next v2 0/3] net: stmmac: RX performance improvement Alexander Lobakin
  3 siblings, 1 reply; 9+ messages in thread
From: Furong Xu @ 2025-01-13 14:20 UTC (permalink / raw)
  To: netdev, linux-stm32, linux-arm-kernel, linux-kernel
  Cc: Alexander Lobakin, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr, Furong Xu

Current code prefetches cache lines for the received frame first, and
then dma_sync_single_for_cpu() against this frame, this is wrong.
Cache prefetch should be triggered after dma_sync_single_for_cpu().

This patch brings ~2.8% driver performance improvement in a TCP RX
throughput test with iPerf tool on a single isolated Cortex-A65 CPU
core, 2.84 Gbits/sec increased to 2.92 Gbits/sec.

Signed-off-by: Furong Xu <0x1207@gmail.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index ca340fd8c937..b60f2f27140c 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -5500,10 +5500,6 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 
 		/* Buffer is good. Go on. */
 
-		prefetch(page_address(buf->page) + buf->page_offset);
-		if (buf->sec_page)
-			prefetch(page_address(buf->sec_page));
-
 		buf1_len = stmmac_rx_buf1_len(priv, p, status, len);
 		len += buf1_len;
 		buf2_len = stmmac_rx_buf2_len(priv, p, status, len);
@@ -5525,6 +5521,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 
 			dma_sync_single_for_cpu(priv->device, buf->addr,
 						buf1_len, dma_dir);
+			prefetch(page_address(buf->page) + buf->page_offset);
 
 			xdp_init_buff(&ctx.xdp, buf_sz, &rx_q->xdp_rxq);
 			xdp_prepare_buff(&ctx.xdp, page_address(buf->page),
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v2 0/3] net: stmmac: RX performance improvement
  2025-01-13 14:20 [PATCH net-next v2 0/3] net: stmmac: RX performance improvement Furong Xu
                   ` (2 preceding siblings ...)
  2025-01-13 14:20 ` [PATCH net-next v2 3/3] net: stmmac: Optimize cache prefetch in RX path Furong Xu
@ 2025-01-13 14:27 ` Alexander Lobakin
  3 siblings, 0 replies; 9+ messages in thread
From: Alexander Lobakin @ 2025-01-13 14:27 UTC (permalink / raw)
  To: Furong Xu
  Cc: netdev, linux-stm32, linux-arm-kernel, linux-kernel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Maxime Coquelin, xfr

From: Furong Xu <0x1207@gmail.com>
Date: Mon, 13 Jan 2025 22:20:28 +0800

> This series improves RX performance a lot, ~34% TCP RX throughput boost
> has been observed with DWXGMAC CORE 3.20a running on Cortex-A65 CPUs:
> from 2.18 Gbits/sec increased to 2.92 Gbits/sec.

Series:

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Thanks,
Olek

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v2 3/3] net: stmmac: Optimize cache prefetch in RX path
  2025-01-13 14:20 ` [PATCH net-next v2 3/3] net: stmmac: Optimize cache prefetch in RX path Furong Xu
@ 2025-01-14 23:31   ` Joe Damato
  2025-01-15  2:20     ` Jakub Kicinski
  2025-01-15  2:33     ` Furong Xu
  0 siblings, 2 replies; 9+ messages in thread
From: Joe Damato @ 2025-01-14 23:31 UTC (permalink / raw)
  To: Furong Xu
  Cc: netdev, linux-stm32, linux-arm-kernel, linux-kernel,
	Alexander Lobakin, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr

On Mon, Jan 13, 2025 at 10:20:31PM +0800, Furong Xu wrote:
> Current code prefetches cache lines for the received frame first, and
> then dma_sync_single_for_cpu() against this frame, this is wrong.
> Cache prefetch should be triggered after dma_sync_single_for_cpu().
> 
> This patch brings ~2.8% driver performance improvement in a TCP RX
> throughput test with iPerf tool on a single isolated Cortex-A65 CPU
> core, 2.84 Gbits/sec increased to 2.92 Gbits/sec.
> 
> Signed-off-by: Furong Xu <0x1207@gmail.com>
> ---
>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index ca340fd8c937..b60f2f27140c 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -5500,10 +5500,6 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
>  
>  		/* Buffer is good. Go on. */
>  
> -		prefetch(page_address(buf->page) + buf->page_offset);
> -		if (buf->sec_page)
> -			prefetch(page_address(buf->sec_page));
> -
>  		buf1_len = stmmac_rx_buf1_len(priv, p, status, len);
>  		len += buf1_len;
>  		buf2_len = stmmac_rx_buf2_len(priv, p, status, len);
> @@ -5525,6 +5521,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
>  
>  			dma_sync_single_for_cpu(priv->device, buf->addr,
>  						buf1_len, dma_dir);
> +			prefetch(page_address(buf->page) + buf->page_offset);

Minor nit: I've seen in other drivers authors using net_prefetch.
Probably not worth a re-roll just for something this minor.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v2 3/3] net: stmmac: Optimize cache prefetch in RX path
  2025-01-14 23:31   ` Joe Damato
@ 2025-01-15  2:20     ` Jakub Kicinski
  2025-01-15  2:33     ` Furong Xu
  1 sibling, 0 replies; 9+ messages in thread
From: Jakub Kicinski @ 2025-01-15  2:20 UTC (permalink / raw)
  To: Joe Damato
  Cc: Furong Xu, netdev, linux-stm32, linux-arm-kernel, linux-kernel,
	Alexander Lobakin, Andrew Lunn, David S. Miller, Eric Dumazet,
	Paolo Abeni, Maxime Coquelin, xfr

On Tue, 14 Jan 2025 15:31:05 -0800 Joe Damato wrote:
> > @@ -5525,6 +5521,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
> >  
> >  			dma_sync_single_for_cpu(priv->device, buf->addr,
> >  						buf1_len, dma_dir);
> > +			prefetch(page_address(buf->page) + buf->page_offset);  
> 
> Minor nit: I've seen in other drivers authors using net_prefetch.
> Probably not worth a re-roll just for something this minor.

Let's respin. I don't know how likely stmmac is to be integrated into
an SoC with 64B cachelines these days, but since you caught this - 
why not potentially save someone from investigating this later..
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v2 3/3] net: stmmac: Optimize cache prefetch in RX path
  2025-01-14 23:31   ` Joe Damato
  2025-01-15  2:20     ` Jakub Kicinski
@ 2025-01-15  2:33     ` Furong Xu
  2025-01-15 17:27       ` Joe Damato
  1 sibling, 1 reply; 9+ messages in thread
From: Furong Xu @ 2025-01-15  2:33 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, linux-stm32, linux-arm-kernel, linux-kernel,
	Alexander Lobakin, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr

On Tue, 14 Jan 2025 15:31:05 -0800, Joe Damato <jdamato@fastly.com> wrote:

> On Mon, Jan 13, 2025 at 10:20:31PM +0800, Furong Xu wrote:
> > Current code prefetches cache lines for the received frame first, and
> > then dma_sync_single_for_cpu() against this frame, this is wrong.
> > Cache prefetch should be triggered after dma_sync_single_for_cpu().
> > 
> > This patch brings ~2.8% driver performance improvement in a TCP RX
> > throughput test with iPerf tool on a single isolated Cortex-A65 CPU
> > core, 2.84 Gbits/sec increased to 2.92 Gbits/sec.
> > 
> > Signed-off-by: Furong Xu <0x1207@gmail.com>
> > ---
> >  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 5 +----
> >  1 file changed, 1 insertion(+), 4 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > index ca340fd8c937..b60f2f27140c 100644
> > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > @@ -5500,10 +5500,6 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
> >  
> >  		/* Buffer is good. Go on. */
> >  
> > -		prefetch(page_address(buf->page) + buf->page_offset);
> > -		if (buf->sec_page)
> > -			prefetch(page_address(buf->sec_page));
> > -
> >  		buf1_len = stmmac_rx_buf1_len(priv, p, status, len);
> >  		len += buf1_len;
> >  		buf2_len = stmmac_rx_buf2_len(priv, p, status, len);
> > @@ -5525,6 +5521,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
> >  
> >  			dma_sync_single_for_cpu(priv->device, buf->addr,
> >  						buf1_len, dma_dir);
> > +			prefetch(page_address(buf->page) + buf->page_offset);  
> 
> Minor nit: I've seen in other drivers authors using net_prefetch.
> Probably not worth a re-roll just for something this minor.

After switch to net_prefetch(), I get another 4.5% throughput improvement :)
Thanks! This definitely worth a v3 of this series.

pw-bot: changes-requested

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v2 3/3] net: stmmac: Optimize cache prefetch in RX path
  2025-01-15  2:33     ` Furong Xu
@ 2025-01-15 17:27       ` Joe Damato
  0 siblings, 0 replies; 9+ messages in thread
From: Joe Damato @ 2025-01-15 17:27 UTC (permalink / raw)
  To: Furong Xu
  Cc: netdev, linux-stm32, linux-arm-kernel, linux-kernel,
	Alexander Lobakin, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin, xfr

On Wed, Jan 15, 2025 at 10:33:58AM +0800, Furong Xu wrote:
> On Tue, 14 Jan 2025 15:31:05 -0800, Joe Damato <jdamato@fastly.com> wrote:
> 
> > On Mon, Jan 13, 2025 at 10:20:31PM +0800, Furong Xu wrote:
> > > Current code prefetches cache lines for the received frame first, and
> > > then dma_sync_single_for_cpu() against this frame, this is wrong.
> > > Cache prefetch should be triggered after dma_sync_single_for_cpu().
> > > 
> > > This patch brings ~2.8% driver performance improvement in a TCP RX
> > > throughput test with iPerf tool on a single isolated Cortex-A65 CPU
> > > core, 2.84 Gbits/sec increased to 2.92 Gbits/sec.
> > > 
> > > Signed-off-by: Furong Xu <0x1207@gmail.com>
> > > ---
> > >  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 5 +----
> > >  1 file changed, 1 insertion(+), 4 deletions(-)
> > > 
> > > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > index ca340fd8c937..b60f2f27140c 100644
> > > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > @@ -5500,10 +5500,6 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
> > >  
> > >  		/* Buffer is good. Go on. */
> > >  
> > > -		prefetch(page_address(buf->page) + buf->page_offset);
> > > -		if (buf->sec_page)
> > > -			prefetch(page_address(buf->sec_page));
> > > -
> > >  		buf1_len = stmmac_rx_buf1_len(priv, p, status, len);
> > >  		len += buf1_len;
> > >  		buf2_len = stmmac_rx_buf2_len(priv, p, status, len);
> > > @@ -5525,6 +5521,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
> > >  
> > >  			dma_sync_single_for_cpu(priv->device, buf->addr,
> > >  						buf1_len, dma_dir);
> > > +			prefetch(page_address(buf->page) + buf->page_offset);  
> > 
> > Minor nit: I've seen in other drivers authors using net_prefetch.
> > Probably not worth a re-roll just for something this minor.
> 
> After switch to net_prefetch(), I get another 4.5% throughput improvement :)
> Thanks! This definitely worth a v3 of this series.q

No worries. For what it's worth, it looks like there are a few other
instances in this driver where net_prefetch or net_prefetchw can be
used instead. That might be better as a followup / cleanup and
separate from this series though.

Just thought I'd mention it as you have a way to test the
improvements and I, unfortunately, do not have one of these devices.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-01-15 17:27 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-13 14:20 [PATCH net-next v2 0/3] net: stmmac: RX performance improvement Furong Xu
2025-01-13 14:20 ` [PATCH net-next v2 1/3] net: stmmac: Switch to zero-copy in non-XDP RX path Furong Xu
2025-01-13 14:20 ` [PATCH net-next v2 2/3] net: stmmac: Set page_pool_params.max_len to a precise size Furong Xu
2025-01-13 14:20 ` [PATCH net-next v2 3/3] net: stmmac: Optimize cache prefetch in RX path Furong Xu
2025-01-14 23:31   ` Joe Damato
2025-01-15  2:20     ` Jakub Kicinski
2025-01-15  2:33     ` Furong Xu
2025-01-15 17:27       ` Joe Damato
2025-01-13 14:27 ` [PATCH net-next v2 0/3] net: stmmac: RX performance improvement Alexander Lobakin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).