netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next] net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead
@ 2025-05-14 20:03 Tariq Toukan
  2025-05-15  0:26 ` Alexei Starovoitov
  2025-05-16 22:50 ` patchwork-bot+netdevbpf
  0 siblings, 2 replies; 6+ messages in thread
From: Tariq Toukan @ 2025-05-14 20:03 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, netdev,
	linux-rdma, linux-kernel, bpf, Moshe Shemesh, Mark Bloch,
	Gal Pressman, Carolina Jubran

From: Carolina Jubran <cjubran@nvidia.com>

CONFIG_INIT_STACK_ALL_ZERO introduces a performance cost by
zero-initializing all stack variables on function entry. The mlx5 XDP
RX path previously allocated a struct mlx5e_xdp_buff on the stack per
received CQE, resulting in measurable performance degradation under
this config.

This patch reuses a mlx5e_xdp_buff stored in the mlx5e_rq struct,
avoiding per-CQE stack allocations and repeated zeroing.

With this change, XDP_DROP and XDP_TX performance matches that of
kernels built without CONFIG_INIT_STACK_ALL_ZERO.

Performance was measured on a ConnectX-6Dx using a single RX channel
(1 CPU at 100% usage) at ~50 Mpps. The baseline results were taken from
net-next-6.15.

Stack zeroing disabled:
- XDP_DROP:
    * baseline:                     31.47 Mpps
    * baseline + per-RQ allocation: 32.31 Mpps (+2.68%)

- XDP_TX:
    * baseline:                     12.41 Mpps
    * baseline + per-RQ allocation: 12.95 Mpps (+4.30%)

Stack zeroing enabled:
- XDP_DROP:
    * baseline:                     24.32 Mpps
    * baseline + per-RQ allocation: 32.27 Mpps (+32.7%)

- XDP_TX:
    * baseline:                     11.80 Mpps
    * baseline + per-RQ allocation: 12.24 Mpps (+3.72%)

Reported-by: Sebastiano Miano <mianosebastiano@gmail.com>
Reported-by: Samuel Dobron <sdobron@redhat.com>
Link: https://lore.kernel.org/all/CAMENy5pb8ea+piKLg5q5yRTMZacQqYWAoVLE1FE9WhQPq92E0g@mail.gmail.com/
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  7 ++
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  6 --
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 81 ++++++++++---------
 3 files changed, 51 insertions(+), 43 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 32ed4963b8ad..5b0d03b3efe8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -520,6 +520,12 @@ struct mlx5e_xdpsq {
 	struct mlx5e_channel      *channel;
 } ____cacheline_aligned_in_smp;
 
+struct mlx5e_xdp_buff {
+	struct xdp_buff xdp;
+	struct mlx5_cqe64 *cqe;
+	struct mlx5e_rq *rq;
+};
+
 struct mlx5e_ktls_resync_resp;
 
 struct mlx5e_icosq {
@@ -716,6 +722,7 @@ struct mlx5e_rq {
 	struct mlx5e_xdpsq    *xdpsq;
 	DECLARE_BITMAP(flags, 8);
 	struct page_pool      *page_pool;
+	struct mlx5e_xdp_buff mxbuf;
 
 	/* AF_XDP zero-copy */
 	struct xsk_buff_pool  *xsk_pool;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index 446e492c6bb8..46ab0a9e8cdd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -45,12 +45,6 @@
 	(MLX5E_XDP_INLINE_WQE_MAX_DS_CNT * MLX5_SEND_WQE_DS - \
 	 sizeof(struct mlx5_wqe_inline_seg))
 
-struct mlx5e_xdp_buff {
-	struct xdp_buff xdp;
-	struct mlx5_cqe64 *cqe;
-	struct mlx5e_rq *rq;
-};
-
 /* XDP packets can be transmitted in different ways. On completion, we need to
  * distinguish between them to clean up things in a proper way.
  */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 5fd70b4d55be..84b1ab8233b8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1684,17 +1684,17 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
 
 	prog = rcu_dereference(rq->xdp_prog);
 	if (prog) {
-		struct mlx5e_xdp_buff mxbuf;
+		struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
 
 		net_prefetchw(va); /* xdp_frame data area */
 		mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, rq->buff.frame0_sz,
-				 cqe_bcnt, &mxbuf);
-		if (mlx5e_xdp_handle(rq, prog, &mxbuf))
+				 cqe_bcnt, mxbuf);
+		if (mlx5e_xdp_handle(rq, prog, mxbuf))
 			return NULL; /* page/packet was consumed by XDP */
 
-		rx_headroom = mxbuf.xdp.data - mxbuf.xdp.data_hard_start;
-		metasize = mxbuf.xdp.data - mxbuf.xdp.data_meta;
-		cqe_bcnt = mxbuf.xdp.data_end - mxbuf.xdp.data;
+		rx_headroom = mxbuf->xdp.data - mxbuf->xdp.data_hard_start;
+		metasize = mxbuf->xdp.data - mxbuf->xdp.data_meta;
+		cqe_bcnt = mxbuf->xdp.data_end - mxbuf->xdp.data;
 	}
 	frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
 	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);
@@ -1713,11 +1713,11 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
 {
 	struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
+	struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
 	struct mlx5e_wqe_frag_info *head_wi = wi;
 	u16 rx_headroom = rq->buff.headroom;
 	struct mlx5e_frag_page *frag_page;
 	struct skb_shared_info *sinfo;
-	struct mlx5e_xdp_buff mxbuf;
 	u32 frag_consumed_bytes;
 	struct bpf_prog *prog;
 	struct sk_buff *skb;
@@ -1737,8 +1737,8 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	net_prefetch(va + rx_headroom);
 
 	mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, rq->buff.frame0_sz,
-			 frag_consumed_bytes, &mxbuf);
-	sinfo = xdp_get_shared_info_from_buff(&mxbuf.xdp);
+			 frag_consumed_bytes, mxbuf);
+	sinfo = xdp_get_shared_info_from_buff(&mxbuf->xdp);
 	truesize = 0;
 
 	cqe_bcnt -= frag_consumed_bytes;
@@ -1750,8 +1750,9 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 
 		frag_consumed_bytes = min_t(u32, frag_info->frag_size, cqe_bcnt);
 
-		mlx5e_add_skb_shared_info_frag(rq, sinfo, &mxbuf.xdp, frag_page,
-					       wi->offset, frag_consumed_bytes);
+		mlx5e_add_skb_shared_info_frag(rq, sinfo, &mxbuf->xdp,
+					       frag_page, wi->offset,
+					       frag_consumed_bytes);
 		truesize += frag_info->frag_stride;
 
 		cqe_bcnt -= frag_consumed_bytes;
@@ -1760,7 +1761,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	}
 
 	prog = rcu_dereference(rq->xdp_prog);
-	if (prog && mlx5e_xdp_handle(rq, prog, &mxbuf)) {
+	if (prog && mlx5e_xdp_handle(rq, prog, mxbuf)) {
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
 			struct mlx5e_wqe_frag_info *pwi;
 
@@ -1770,21 +1771,23 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 		return NULL; /* page/packet was consumed by XDP */
 	}
 
-	skb = mlx5e_build_linear_skb(rq, mxbuf.xdp.data_hard_start, rq->buff.frame0_sz,
-				     mxbuf.xdp.data - mxbuf.xdp.data_hard_start,
-				     mxbuf.xdp.data_end - mxbuf.xdp.data,
-				     mxbuf.xdp.data - mxbuf.xdp.data_meta);
+	skb = mlx5e_build_linear_skb(
+		rq, mxbuf->xdp.data_hard_start, rq->buff.frame0_sz,
+		mxbuf->xdp.data - mxbuf->xdp.data_hard_start,
+		mxbuf->xdp.data_end - mxbuf->xdp.data,
+		mxbuf->xdp.data - mxbuf->xdp.data_meta);
 	if (unlikely(!skb))
 		return NULL;
 
 	skb_mark_for_recycle(skb);
 	head_wi->frag_page->frags++;
 
-	if (xdp_buff_has_frags(&mxbuf.xdp)) {
+	if (xdp_buff_has_frags(&mxbuf->xdp)) {
 		/* sinfo->nr_frags is reset by build_skb, calculate again. */
 		xdp_update_skb_shared_info(skb, wi - head_wi - 1,
 					   sinfo->xdp_frags_size, truesize,
-					   xdp_buff_is_frag_pfmemalloc(&mxbuf.xdp));
+					   xdp_buff_is_frag_pfmemalloc(
+						&mxbuf->xdp));
 
 		for (struct mlx5e_wqe_frag_info *pwi = head_wi + 1; pwi < wi; pwi++)
 			pwi->frag_page->frags++;
@@ -1984,10 +1987,10 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	struct mlx5e_frag_page *frag_page = &wi->alloc_units.frag_pages[page_idx];
 	u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
 	struct mlx5e_frag_page *head_page = frag_page;
+	struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
 	u32 frag_offset    = head_offset;
 	u32 byte_cnt       = cqe_bcnt;
 	struct skb_shared_info *sinfo;
-	struct mlx5e_xdp_buff mxbuf;
 	unsigned int truesize = 0;
 	struct bpf_prog *prog;
 	struct sk_buff *skb;
@@ -2033,9 +2036,10 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		}
 	}
 
-	mlx5e_fill_mxbuf(rq, cqe, va, linear_hr, linear_frame_sz, linear_data_len, &mxbuf);
+	mlx5e_fill_mxbuf(rq, cqe, va, linear_hr, linear_frame_sz,
+			 linear_data_len, mxbuf);
 
-	sinfo = xdp_get_shared_info_from_buff(&mxbuf.xdp);
+	sinfo = xdp_get_shared_info_from_buff(&mxbuf->xdp);
 
 	while (byte_cnt) {
 		/* Non-linear mode, hence non-XSK, which always uses PAGE_SIZE. */
@@ -2046,7 +2050,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		else
 			truesize += ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz));
 
-		mlx5e_add_skb_shared_info_frag(rq, sinfo, &mxbuf.xdp, frag_page, frag_offset,
+		mlx5e_add_skb_shared_info_frag(rq, sinfo, &mxbuf->xdp,
+					       frag_page, frag_offset,
 					       pg_consumed_bytes);
 		byte_cnt -= pg_consumed_bytes;
 		frag_offset = 0;
@@ -2054,7 +2059,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	}
 
 	if (prog) {
-		if (mlx5e_xdp_handle(rq, prog, &mxbuf)) {
+		if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
 			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
 				struct mlx5e_frag_page *pfp;
 
@@ -2067,10 +2072,10 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 			return NULL; /* page/packet was consumed by XDP */
 		}
 
-		skb = mlx5e_build_linear_skb(rq, mxbuf.xdp.data_hard_start,
-					     linear_frame_sz,
-					     mxbuf.xdp.data - mxbuf.xdp.data_hard_start, 0,
-					     mxbuf.xdp.data - mxbuf.xdp.data_meta);
+		skb = mlx5e_build_linear_skb(
+			rq, mxbuf->xdp.data_hard_start, linear_frame_sz,
+			mxbuf->xdp.data - mxbuf->xdp.data_hard_start, 0,
+			mxbuf->xdp.data - mxbuf->xdp.data_meta);
 		if (unlikely(!skb)) {
 			mlx5e_page_release_fragmented(rq, &wi->linear_page);
 			return NULL;
@@ -2080,13 +2085,14 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		wi->linear_page.frags++;
 		mlx5e_page_release_fragmented(rq, &wi->linear_page);
 
-		if (xdp_buff_has_frags(&mxbuf.xdp)) {
+		if (xdp_buff_has_frags(&mxbuf->xdp)) {
 			struct mlx5e_frag_page *pagep;
 
 			/* sinfo->nr_frags is reset by build_skb, calculate again. */
 			xdp_update_skb_shared_info(skb, frag_page - head_page,
 						   sinfo->xdp_frags_size, truesize,
-						   xdp_buff_is_frag_pfmemalloc(&mxbuf.xdp));
+						   xdp_buff_is_frag_pfmemalloc(
+							&mxbuf->xdp));
 
 			pagep = head_page;
 			do
@@ -2097,12 +2103,13 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	} else {
 		dma_addr_t addr;
 
-		if (xdp_buff_has_frags(&mxbuf.xdp)) {
+		if (xdp_buff_has_frags(&mxbuf->xdp)) {
 			struct mlx5e_frag_page *pagep;
 
 			xdp_update_skb_shared_info(skb, sinfo->nr_frags,
 						   sinfo->xdp_frags_size, truesize,
-						   xdp_buff_is_frag_pfmemalloc(&mxbuf.xdp));
+						   xdp_buff_is_frag_pfmemalloc(
+							&mxbuf->xdp));
 
 			pagep = frag_page - sinfo->nr_frags;
 			do
@@ -2152,20 +2159,20 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 
 	prog = rcu_dereference(rq->xdp_prog);
 	if (prog) {
-		struct mlx5e_xdp_buff mxbuf;
+		struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
 
 		net_prefetchw(va); /* xdp_frame data area */
 		mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, rq->buff.frame0_sz,
-				 cqe_bcnt, &mxbuf);
-		if (mlx5e_xdp_handle(rq, prog, &mxbuf)) {
+				 cqe_bcnt, mxbuf);
+		if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
 			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
 				frag_page->frags++;
 			return NULL; /* page/packet was consumed by XDP */
 		}
 
-		rx_headroom = mxbuf.xdp.data - mxbuf.xdp.data_hard_start;
-		metasize = mxbuf.xdp.data - mxbuf.xdp.data_meta;
-		cqe_bcnt = mxbuf.xdp.data_end - mxbuf.xdp.data;
+		rx_headroom = mxbuf->xdp.data - mxbuf->xdp.data_hard_start;
+		metasize =  mxbuf->xdp.data -  mxbuf->xdp.data_meta;
+		cqe_bcnt =  mxbuf->xdp.data_end -  mxbuf->xdp.data;
 	}
 	frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
 	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);

base-commit: 664bf117a30804b442a88a8462591bb23f5a0f22
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead
  2025-05-14 20:03 [PATCH net-next] net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead Tariq Toukan
@ 2025-05-15  0:26 ` Alexei Starovoitov
  2025-05-16 13:47   ` Tariq Toukan
  2025-05-16 22:50 ` patchwork-bot+netdevbpf
  1 sibling, 1 reply; 6+ messages in thread
From: Alexei Starovoitov @ 2025-05-15  0:26 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Saeed Mahameed, Leon Romanovsky, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Network Development, linux-rdma, LKML, bpf, Moshe Shemesh,
	Mark Bloch, Gal Pressman, Carolina Jubran

On Wed, May 14, 2025 at 1:04 PM Tariq Toukan <tariqt@nvidia.com> wrote:
>
> From: Carolina Jubran <cjubran@nvidia.com>
>
> CONFIG_INIT_STACK_ALL_ZERO introduces a performance cost by
> zero-initializing all stack variables on function entry. The mlx5 XDP
> RX path previously allocated a struct mlx5e_xdp_buff on the stack per
> received CQE, resulting in measurable performance degradation under
> this config.
>
> This patch reuses a mlx5e_xdp_buff stored in the mlx5e_rq struct,
> avoiding per-CQE stack allocations and repeated zeroing.
>
> With this change, XDP_DROP and XDP_TX performance matches that of
> kernels built without CONFIG_INIT_STACK_ALL_ZERO.
>
> Performance was measured on a ConnectX-6Dx using a single RX channel
> (1 CPU at 100% usage) at ~50 Mpps. The baseline results were taken from
> net-next-6.15.
>
> Stack zeroing disabled:
> - XDP_DROP:
>     * baseline:                     31.47 Mpps
>     * baseline + per-RQ allocation: 32.31 Mpps (+2.68%)
>
> - XDP_TX:
>     * baseline:                     12.41 Mpps
>     * baseline + per-RQ allocation: 12.95 Mpps (+4.30%)

Looks good, but where are these gains coming from ?
The patch just moves mxbuf from stack to rq.
The number of operations should really be the same.

> Stack zeroing enabled:
> - XDP_DROP:
>     * baseline:                     24.32 Mpps
>     * baseline + per-RQ allocation: 32.27 Mpps (+32.7%)

This part makes sense.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead
  2025-05-15  0:26 ` Alexei Starovoitov
@ 2025-05-16 13:47   ` Tariq Toukan
  2025-05-16 14:43     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 6+ messages in thread
From: Tariq Toukan @ 2025-05-16 13:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Saeed Mahameed, Leon Romanovsky, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Network Development, linux-rdma, LKML, bpf, Moshe Shemesh,
	Mark Bloch, Gal Pressman, Carolina Jubran, Sebastiano Miano,
	Samuel Dobron



On 15/05/2025 3:26, Alexei Starovoitov wrote:
> On Wed, May 14, 2025 at 1:04 PM Tariq Toukan <tariqt@nvidia.com> wrote:
>>
>> From: Carolina Jubran <cjubran@nvidia.com>
>>
>> CONFIG_INIT_STACK_ALL_ZERO introduces a performance cost by
>> zero-initializing all stack variables on function entry. The mlx5 XDP
>> RX path previously allocated a struct mlx5e_xdp_buff on the stack per
>> received CQE, resulting in measurable performance degradation under
>> this config.
>>
>> This patch reuses a mlx5e_xdp_buff stored in the mlx5e_rq struct,
>> avoiding per-CQE stack allocations and repeated zeroing.
>>
>> With this change, XDP_DROP and XDP_TX performance matches that of
>> kernels built without CONFIG_INIT_STACK_ALL_ZERO.
>>
>> Performance was measured on a ConnectX-6Dx using a single RX channel
>> (1 CPU at 100% usage) at ~50 Mpps. The baseline results were taken from
>> net-next-6.15.
>>
>> Stack zeroing disabled:
>> - XDP_DROP:
>>      * baseline:                     31.47 Mpps
>>      * baseline + per-RQ allocation: 32.31 Mpps (+2.68%)
>>
>> - XDP_TX:
>>      * baseline:                     12.41 Mpps
>>      * baseline + per-RQ allocation: 12.95 Mpps (+4.30%)
> 
> Looks good, but where are these gains coming from ?
> The patch just moves mxbuf from stack to rq.
> The number of operations should really be the same.
> 

I guess it's cache related. Hot/cold areas, alignments, movement of 
other fields in the mlx5e_rq structure...

>> Stack zeroing enabled:
>> - XDP_DROP:
>>      * baseline:                     24.32 Mpps
>>      * baseline + per-RQ allocation: 32.27 Mpps (+32.7%)
> 
> This part makes sense.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead
  2025-05-16 13:47   ` Tariq Toukan
@ 2025-05-16 14:43     ` Jesper Dangaard Brouer
  2025-05-21  8:56       ` Samuel Dobron
  0 siblings, 1 reply; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2025-05-16 14:43 UTC (permalink / raw)
  To: Tariq Toukan, Alexei Starovoitov
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Saeed Mahameed, Leon Romanovsky, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Network Development, linux-rdma,
	LKML, bpf, Moshe Shemesh, Mark Bloch, Gal Pressman,
	Carolina Jubran, Sebastiano Miano, Samuel Dobron



On 16/05/2025 15.47, Tariq Toukan wrote:
> 
> 
> On 15/05/2025 3:26, Alexei Starovoitov wrote:
>> On Wed, May 14, 2025 at 1:04 PM Tariq Toukan <tariqt@nvidia.com> wrote:
>>>
>>> From: Carolina Jubran <cjubran@nvidia.com>
>>>
>>> CONFIG_INIT_STACK_ALL_ZERO introduces a performance cost by
>>> zero-initializing all stack variables on function entry. The mlx5 XDP
>>> RX path previously allocated a struct mlx5e_xdp_buff on the stack per
>>> received CQE, resulting in measurable performance degradation under
>>> this config.
>>>
>>> This patch reuses a mlx5e_xdp_buff stored in the mlx5e_rq struct,
>>> avoiding per-CQE stack allocations and repeated zeroing.
>>>
>>> With this change, XDP_DROP and XDP_TX performance matches that of
>>> kernels built without CONFIG_INIT_STACK_ALL_ZERO.
>>>
>>> Performance was measured on a ConnectX-6Dx using a single RX channel
>>> (1 CPU at 100% usage) at ~50 Mpps. The baseline results were taken from
>>> net-next-6.15.
>>>
>>> Stack zeroing disabled:
>>> - XDP_DROP:
>>>      * baseline:                     31.47 Mpps
>>>      * baseline + per-RQ allocation: 32.31 Mpps (+2.68%)
>>>

31.47 Mpps = 31.77 nanosec per packet
32.31 Mpps = 30.95 nanosec per packet
Improvement:  0.82 nanosec faster

>>> - XDP_TX:
>>>      * baseline:                     12.41 Mpps
>>>      * baseline + per-RQ allocation: 12.95 Mpps (+4.30%)
>>

The XDP_TX number are actually lower than I expected.
Hmm... I wonder if we regressed here(?)

12.41 Mpps = 80.58 nanosec per packet
12.95 Mpps = 77.22 nanosec per packet
Improvement:  3.36 nanosec faster

>> Looks good, but where are these gains coming from ?
>> The patch just moves mxbuf from stack to rq.
>> The number of operations should really be the same.
>>
> 
> I guess it's cache related. Hot/cold areas, alignments, movement of 
> other fields in the mlx5e_rq structure...

The improvements for XDP_DROP (see calc above) in nanosec is so small
that it is hard to measure accurately/stable on any system.

The improvement for XDP_TX is above 2 nanosec, which looks like an 
actual improvement...


>>> Stack zeroing enabled:
>>> - XDP_DROP:
>>>      * baseline:                     24.32 Mpps
>>>      * baseline + per-RQ allocation: 32.27 Mpps (+32.7%)
>>
>> This part makes sense.
> 

Yes, this makes sense as it is a measurable improvement.

24.32 Mpps = 41.12 nanosec per packet
32.27 Mpps = 30.99 nanosec per packet
Improvement: 10.13 nanosec faster

Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>

--Jesper

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead
  2025-05-14 20:03 [PATCH net-next] net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead Tariq Toukan
  2025-05-15  0:26 ` Alexei Starovoitov
@ 2025-05-16 22:50 ` patchwork-bot+netdevbpf
  1 sibling, 0 replies; 6+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-05-16 22:50 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: davem, kuba, pabeni, edumazet, andrew+netdev, saeedm, leon, ast,
	daniel, hawk, john.fastabend, netdev, linux-rdma, linux-kernel,
	bpf, moshe, mbloch, gal, cjubran

Hello:

This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 14 May 2025 23:03:52 +0300 you wrote:
> From: Carolina Jubran <cjubran@nvidia.com>
> 
> CONFIG_INIT_STACK_ALL_ZERO introduces a performance cost by
> zero-initializing all stack variables on function entry. The mlx5 XDP
> RX path previously allocated a struct mlx5e_xdp_buff on the stack per
> received CQE, resulting in measurable performance degradation under
> this config.
> 
> [...]

Here is the summary with links:
  - [net-next] net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead
    https://git.kernel.org/netdev/net-next/c/b66b76a82c88

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead
  2025-05-16 14:43     ` Jesper Dangaard Brouer
@ 2025-05-21  8:56       ` Samuel Dobron
  0 siblings, 0 replies; 6+ messages in thread
From: Samuel Dobron @ 2025-05-21  8:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tariq Toukan, Alexei Starovoitov, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Eric Dumazet, Andrew Lunn, Saeed Mahameed,
	Leon Romanovsky, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Network Development, linux-rdma, LKML, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Carolina Jubran,
	Sebastiano Miano, Benjamin Poirier, Toke Hoiland Jorgensen

Hey,
I ran tests just on stack zeroing kernel.

> The XDP_TX number are actually lower than I expected.
> Hmm... I wonder if we regressed here(?)

The absolute numbers look more or less the same,
so I would say no. The first results we have for TX is from
6.13.0-0.rc1.20241202gite70140ba0d2b.14.eln144
comparing it to 6.15.0-0.rc5.250509g9c69f8884904.47.eln148
there is actually 1% improvement. But that might be a
random fluctuation (numbers are based on 1 iteration).
We don't have data for earlier kernels...

However, for TX I get better results:

XDP_TX: DPA, swap macs:
- baseline: 9.75 Mpps
- patched 10.78 Mpps (+10%)

Maybe just different test configuration? We use xdp-bench
in dpa mode+swapping macs.

XDP_DROP:
> >>> Stack zeroing enabled:
> >>> - XDP_DROP:
> >>>      * baseline:                     24.32 Mpps
> >>>      * baseline + per-RQ allocation: 32.27 Mpps (+32.7%)

Same results on my side:
- baseline 16.6 Mpps
- patched  24.6 Mpps (+32.5%)

Seems to be fixed :)

Sam.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-05-21  8:57 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-14 20:03 [PATCH net-next] net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead Tariq Toukan
2025-05-15  0:26 ` Alexei Starovoitov
2025-05-16 13:47   ` Tariq Toukan
2025-05-16 14:43     ` Jesper Dangaard Brouer
2025-05-21  8:56       ` Samuel Dobron
2025-05-16 22:50 ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).