[PATCH net-next v4 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v4 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing
@ 2025-08-29  3:36 Christoph Paasch via B4 Relay
  2025-08-29  3:36 ` [PATCH net-next v4 1/2] net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear Christoph Paasch via B4 Relay
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Christoph Paasch via B4 Relay @ 2025-08-29  3:36 UTC (permalink / raw)
  To: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev
  Cc: netdev, linux-rdma, bpf, Christoph Paasch

When LRO is enabled on the MLX, mlx5e_skb_from_cqe_mpwrq_nonlinear
copies parts of the payload to the linear part of the skb.

This triggers suboptimal processing in GRO, causing slow throughput,...

This patch series addresses this by using eth_get_headlen to compute the
size of the protocol headers and only copy those bits. This results in
a significant throughput improvement (detailled results in the specific
patch).

Signed-off-by: Christoph Paasch <cpaasch@openai.com>
---
Changes in v4:
- Use eth_get_headlen() instead of building a dissector based on struct mlx5_cqe64.
  This mimics what other drivers,... are doing as well. (Eric Dumazet
  <edumazet@google.com>)
- Link to v3: https://lore.kernel.org/r/20250825-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-v3-0-5527e9eb6efc@openai.com

Changes in v3:
- Avoid computing headlen when it is not absolutely necessary (e.g., xdp
  decides to "consume" the packet) (Dragos Tatulea <dtatulea@nvidia.com> & Jakub Kicinski <kuba@kernel.org>)
- Given the above change, consolidate the check for min3(...) in the new
  function to avoid code duplication.
- Make sure local variables are in reverse xmas-tree order.
- Refine comment about why the check for l4_type worsk as is.
- Link to v2: https://lore.kernel.org/r/20250816-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-v2-0-b11b30bc2d10@openai.com

Changes in v2:
- Refine commit-message with more info and testing data
- Make mlx5e_cqe_get_min_hdr_len() return MLX5E_RX_MAX_HEAD when l3_type
  is neither IPv4 nor IPv6. Same for the l4_type. That way behavior is
  unchanged for other traffic types.
- Rename mlx5e_cqe_get_min_hdr_len to mlx5e_cqe_estimate_hdr_len
- Link to v1: https://lore.kernel.org/r/20250713-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-v1-0-ecaed8c2844e@openai.com

---
Christoph Paasch (2):
      net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear
      net/mlx5: Avoid copying payload to the skb's linear part

 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)
---
base-commit: 29828b81a46a3ae55ebc053fce512219172560ba
change-id: 20250712-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-6524917455a6

Best regards,
-- 
Christoph Paasch <cpaasch@openai.com>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH net-next v4 1/2] net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear
  2025-08-29  3:36 [PATCH net-next v4 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch via B4 Relay
@ 2025-08-29  3:36 ` Christoph Paasch via B4 Relay
  2025-08-29 16:33   ` Eric Dumazet
  2025-08-29 22:39   ` Saeed Mahameed
  2025-08-29  3:36 ` [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
  2025-08-29 22:43 ` [PATCH net-next v4 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Saeed Mahameed
  2 siblings, 2 replies; 16+ messages in thread
From: Christoph Paasch via B4 Relay @ 2025-08-29  3:36 UTC (permalink / raw)
  To: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev
  Cc: netdev, linux-rdma, bpf, Christoph Paasch

From: Christoph Paasch <cpaasch@openai.com>

Doing the call to dma_sync_single_for_cpu() earlier will allow us to
adjust headlen based on the actual size of the protocol headers.

Doing this earlier means that we don't need to call
mlx5e_copy_skb_header() anymore and rather can call
skb_copy_to_linear_data() directly.

Signed-off-by: Christoph Paasch <cpaasch@openai.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index b8c609d91d11bd315e8fb67f794a91bd37cd28c0..8bedbda522808cbabc8e62ae91a8c25d66725ebb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -2005,17 +2005,19 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	struct skb_shared_info *sinfo;
 	unsigned int truesize = 0;
 	struct bpf_prog *prog;
+	void *va, *head_addr;
 	struct sk_buff *skb;
 	u32 linear_frame_sz;
 	u16 linear_data_len;
 	u16 linear_hr;
-	void *va;
 
 	prog = rcu_dereference(rq->xdp_prog);
 
+	head_addr = netmem_address(head_page->netmem) + head_offset;
+
 	if (prog) {
 		/* area for bpf_xdp_[store|load]_bytes */
-		net_prefetchw(netmem_address(frag_page->netmem) + frag_offset);
+		net_prefetchw(head_addr);
 		if (unlikely(mlx5e_page_alloc_fragmented(rq->page_pool,
 							 &wi->linear_page))) {
 			rq->stats->buff_alloc_err++;
@@ -2028,6 +2030,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		linear_data_len = 0;
 		linear_frame_sz = MLX5_SKB_FRAG_SZ(linear_hr + MLX5E_RX_MAX_HEAD);
 	} else {
+		dma_addr_t addr;
+
 		skb = napi_alloc_skb(rq->cq.napi,
 				     ALIGN(MLX5E_RX_MAX_HEAD, sizeof(long)));
 		if (unlikely(!skb)) {
@@ -2039,6 +2043,10 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		net_prefetchw(va); /* xdp_frame data area */
 		net_prefetchw(skb->data);
 
+		addr = page_pool_get_dma_addr_netmem(head_page->netmem);
+		dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
+					rq->buff.map_dir);
+
 		frag_offset += headlen;
 		byte_cnt -= headlen;
 		linear_hr = skb_headroom(skb);
@@ -2117,8 +2125,6 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		}
 		__pskb_pull_tail(skb, headlen);
 	} else {
-		dma_addr_t addr;
-
 		if (xdp_buff_has_frags(&mxbuf->xdp)) {
 			struct mlx5e_frag_page *pagep;
 
@@ -2133,9 +2139,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 			while (++pagep < frag_page);
 		}
 		/* copy header */
-		addr = page_pool_get_dma_addr_netmem(head_page->netmem);
-		mlx5e_copy_skb_header(rq, skb, head_page->netmem, addr,
-				      head_offset, head_offset, headlen);
+		skb_copy_to_linear_data(skb, head_addr, headlen);
+
 		/* skb linear part was allocated with headlen and aligned to long */
 		skb->tail += headlen;
 		skb->len  += headlen;

-- 
2.50.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 1/2] net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear
  2025-08-29  3:36 ` [PATCH net-next v4 1/2] net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear Christoph Paasch via B4 Relay
@ 2025-08-29 16:33   ` Eric Dumazet
  2025-08-29 22:39   ` Saeed Mahameed
  1 sibling, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2025-08-29 16:33 UTC (permalink / raw)
  To: cpaasch
  Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	netdev, linux-rdma, bpf

On Thu, Aug 28, 2025 at 8:36 PM Christoph Paasch via B4 Relay
<devnull+cpaasch.openai.com@kernel.org> wrote:
>
> From: Christoph Paasch <cpaasch@openai.com>
>
> Doing the call to dma_sync_single_for_cpu() earlier will allow us to
> adjust headlen based on the actual size of the protocol headers.
>
> Doing this earlier means that we don't need to call
> mlx5e_copy_skb_header() anymore and rather can call
> skb_copy_to_linear_data() directly.
>
> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> ---

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 1/2] net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear
  2025-08-29  3:36 ` [PATCH net-next v4 1/2] net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear Christoph Paasch via B4 Relay
  2025-08-29 16:33   ` Eric Dumazet
@ 2025-08-29 22:39   ` Saeed Mahameed
  1 sibling, 0 replies; 16+ messages in thread
From: Saeed Mahameed @ 2025-08-29 22:39 UTC (permalink / raw)
  To: cpaasch
  Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, netdev, linux-rdma, bpf

On 28 Aug 20:36, Christoph Paasch via B4 Relay wrote:
>From: Christoph Paasch <cpaasch@openai.com>
>
>Doing the call to dma_sync_single_for_cpu() earlier will allow us to
>adjust headlen based on the actual size of the protocol headers.
>
>Doing this earlier means that we don't need to call
>mlx5e_copy_skb_header() anymore and rather can call
>skb_copy_to_linear_data() directly.
>
>Signed-off-by: Christoph Paasch <cpaasch@openai.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-08-29  3:36 [PATCH net-next v4 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch via B4 Relay
  2025-08-29  3:36 ` [PATCH net-next v4 1/2] net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear Christoph Paasch via B4 Relay
@ 2025-08-29  3:36 ` Christoph Paasch via B4 Relay
  2025-08-29 16:34   ` Eric Dumazet
                     ` (2 more replies)
  2025-08-29 22:43 ` [PATCH net-next v4 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Saeed Mahameed
  2 siblings, 3 replies; 16+ messages in thread
From: Christoph Paasch via B4 Relay @ 2025-08-29  3:36 UTC (permalink / raw)
  To: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev
  Cc: netdev, linux-rdma, bpf, Christoph Paasch

From: Christoph Paasch <cpaasch@openai.com>

mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
bytes from the page-pool to the skb's linear part. Those 256 bytes
include part of the payload.

When attempting to do GRO in skb_gro_receive, if headlen > data_offset
(and skb->head_frag is not set), we end up aggregating packets in the
frag_list.

This is of course not good when we are CPU-limited. Also causes a worse
skb->len/truesize ratio,...

So, let's avoid copying parts of the payload to the linear part. We use
eth_get_headlen() to parse the headers and compute the length of the
protocol headers, which will be used to copy the relevant bits ot the
skb's linear part.

We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
stack needs to call pskb_may_pull() later on, we don't need to reallocate
memory.

This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
LRO enabled):

BEFORE:
=======
(netserver pinned to core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
 87380  16384 262144    60.01    32547.82

(netserver pinned to adjacent core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
 87380  16384 262144    60.00    52531.67

AFTER:
======
(netserver pinned to core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
 87380  16384 262144    60.00    52896.06

(netserver pinned to adjacent core receiving interrupts)
 $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
 87380  16384 262144    60.00    85094.90

Additional tests across a larger range of parameters w/ and w/o LRO, w/
and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
better performance with this patch.

Signed-off-by: Christoph Paasch <cpaasch@openai.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..792bb647ba28668ad7789c328456e3609440455d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
 					rq->buff.map_dir);
 
+		headlen = eth_get_headlen(skb->dev, head_addr, headlen);
+
 		frag_offset += headlen;
 		byte_cnt -= headlen;
 		linear_hr = skb_headroom(skb);
@@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 				pagep->frags++;
 			while (++pagep < frag_page);
 		}
+
+		headlen = eth_get_headlen(skb->dev, mxbuf->xdp.data, headlen);
+
 		__pskb_pull_tail(skb, headlen);
 	} else {
 		if (xdp_buff_has_frags(&mxbuf->xdp)) {

-- 
2.50.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-08-29  3:36 ` [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
@ 2025-08-29 16:34   ` Eric Dumazet
  2025-08-29 22:39   ` Saeed Mahameed
  2025-09-03 23:38   ` Amery Hung
  2 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2025-08-29 16:34 UTC (permalink / raw)
  To: cpaasch
  Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	netdev, linux-rdma, bpf

On Thu, Aug 28, 2025 at 8:36 PM Christoph Paasch via B4 Relay
<devnull+cpaasch.openai.com@kernel.org> wrote:
>
> From: Christoph Paasch <cpaasch@openai.com>
>
> mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> bytes from the page-pool to the skb's linear part. Those 256 bytes
> include part of the payload.
>
> When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> (and skb->head_frag is not set), we end up aggregating packets in the
> frag_list.
>
> This is of course not good when we are CPU-limited. Also causes a worse
> skb->len/truesize ratio,...
>
> So, let's avoid copying parts of the payload to the linear part. We use
> eth_get_headlen() to parse the headers and compute the length of the
> protocol headers, which will be used to copy the relevant bits ot the
> skb's linear part.
>
> We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> stack needs to call pskb_may_pull() later on, we don't need to reallocate
> memory.
>
> This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> LRO enabled):
>
> BEFORE:
> =======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.01    32547.82
>
> (netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.00    52531.67
>
> AFTER:
> ======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.00    52896.06
>
> (netserver pinned to adjacent core receiving interrupts)
>  $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.00    85094.90
>
> Additional tests across a larger range of parameters w/ and w/o LRO, w/
> and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> better performance with this patch.
>
> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> ---

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-08-29  3:36 ` [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
  2025-08-29 16:34   ` Eric Dumazet
@ 2025-08-29 22:39   ` Saeed Mahameed
  2025-09-03 23:38   ` Amery Hung
  2 siblings, 0 replies; 16+ messages in thread
From: Saeed Mahameed @ 2025-08-29 22:39 UTC (permalink / raw)
  To: cpaasch
  Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, netdev, linux-rdma, bpf

On 28 Aug 20:36, Christoph Paasch via B4 Relay wrote:
>From: Christoph Paasch <cpaasch@openai.com>
>
>mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
>bytes from the page-pool to the skb's linear part. Those 256 bytes
>include part of the payload.
>
>When attempting to do GRO in skb_gro_receive, if headlen > data_offset
>(and skb->head_frag is not set), we end up aggregating packets in the
>frag_list.
>
>This is of course not good when we are CPU-limited. Also causes a worse
>skb->len/truesize ratio,...
>
>So, let's avoid copying parts of the payload to the linear part. We use
>eth_get_headlen() to parse the headers and compute the length of the
>protocol headers, which will be used to copy the relevant bits ot the
>skb's linear part.
>
>We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
>stack needs to call pskb_may_pull() later on, we don't need to reallocate
>memory.
>
>This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
>LRO enabled):
>
>BEFORE:
>=======
>(netserver pinned to core receiving interrupts)
>$ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> 87380  16384 262144    60.01    32547.82
>
>(netserver pinned to adjacent core receiving interrupts)
>$ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> 87380  16384 262144    60.00    52531.67
>
>AFTER:
>======
>(netserver pinned to core receiving interrupts)
>$ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> 87380  16384 262144    60.00    52896.06
>
>(netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> 87380  16384 262144    60.00    85094.90
>
>Additional tests across a larger range of parameters w/ and w/o LRO, w/
>and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
>TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
>better performance with this patch.
>
>Signed-off-by: Christoph Paasch <cpaasch@openai.com>

Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-08-29  3:36 ` [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
  2025-08-29 16:34   ` Eric Dumazet
  2025-08-29 22:39   ` Saeed Mahameed
@ 2025-09-03 23:38   ` Amery Hung
  2025-09-03 23:57     ` Christoph Paasch
  2 siblings, 1 reply; 16+ messages in thread
From: Amery Hung @ 2025-09-03 23:38 UTC (permalink / raw)
  To: cpaasch, Gal Pressman, Dragos Tatulea, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev
  Cc: netdev, linux-rdma, bpf



On 8/28/25 8:36 PM, Christoph Paasch via B4 Relay wrote:
> From: Christoph Paasch <cpaasch@openai.com>
>
> mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> bytes from the page-pool to the skb's linear part. Those 256 bytes
> include part of the payload.
>
> When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> (and skb->head_frag is not set), we end up aggregating packets in the
> frag_list.
>
> This is of course not good when we are CPU-limited. Also causes a worse
> skb->len/truesize ratio,...
>
> So, let's avoid copying parts of the payload to the linear part. We use
> eth_get_headlen() to parse the headers and compute the length of the
> protocol headers, which will be used to copy the relevant bits ot the
> skb's linear part.
>
> We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> stack needs to call pskb_may_pull() later on, we don't need to reallocate
> memory.
>
> This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> LRO enabled):
>
> BEFORE:
> =======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>   87380  16384 262144    60.01    32547.82
>
> (netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>   87380  16384 262144    60.00    52531.67
>
> AFTER:
> ======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>   87380  16384 262144    60.00    52896.06
>
> (netserver pinned to adjacent core receiving interrupts)
>   $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>   87380  16384 262144    60.00    85094.90
>
> Additional tests across a larger range of parameters w/ and w/o LRO, w/
> and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> better performance with this patch.
>
> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++
>   1 file changed, 5 insertions(+)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..792bb647ba28668ad7789c328456e3609440455d 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>   		dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
>   					rq->buff.map_dir);
>   
> +		headlen = eth_get_headlen(skb->dev, head_addr, headlen);
> +

Hi,

I am building on top of this patchset and got a kernel crash. It was 
triggered by attaching an xdp program.

I think the problem is skb->dev is still NULL here. It will be set later by:
mlx5e_complete_rx_cqe() -> mlx5e_build_rx_skb() -> eth_type_trans()


>   		frag_offset += headlen;
>   		byte_cnt -= headlen;
>   		linear_hr = skb_headroom(skb);
> @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>   				pagep->frags++;
>   			while (++pagep < frag_page);
>   		}
> +
> +		headlen = eth_get_headlen(skb->dev, mxbuf->xdp.data, headlen);
> +
>   		__pskb_pull_tail(skb, headlen);
>   	} else {
>   		if (xdp_buff_has_frags(&mxbuf->xdp)) {
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-09-03 23:38   ` Amery Hung
@ 2025-09-03 23:57     ` Christoph Paasch
  2025-09-04  0:11       ` Amery Hung
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Paasch @ 2025-09-03 23:57 UTC (permalink / raw)
  To: Amery Hung
  Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, netdev, linux-rdma, bpf

On Wed, Sep 3, 2025 at 4:39 PM Amery Hung <ameryhung@gmail.com> wrote:
>
>
>
> On 8/28/25 8:36 PM, Christoph Paasch via B4 Relay wrote:
> > From: Christoph Paasch <cpaasch@openai.com>
> >
> > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > include part of the payload.
> >
> > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > (and skb->head_frag is not set), we end up aggregating packets in the
> > frag_list.
> >
> > This is of course not good when we are CPU-limited. Also causes a worse
> > skb->len/truesize ratio,...
> >
> > So, let's avoid copying parts of the payload to the linear part. We use
> > eth_get_headlen() to parse the headers and compute the length of the
> > protocol headers, which will be used to copy the relevant bits ot the
> > skb's linear part.
> >
> > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> > stack needs to call pskb_may_pull() later on, we don't need to reallocate
> > memory.
> >
> > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > LRO enabled):
> >
> > BEFORE:
> > =======
> > (netserver pinned to core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> >   87380  16384 262144    60.01    32547.82
> >
> > (netserver pinned to adjacent core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> >   87380  16384 262144    60.00    52531.67
> >
> > AFTER:
> > ======
> > (netserver pinned to core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> >   87380  16384 262144    60.00    52896.06
> >
> > (netserver pinned to adjacent core receiving interrupts)
> >   $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> >   87380  16384 262144    60.00    85094.90
> >
> > Additional tests across a larger range of parameters w/ and w/o LRO, w/
> > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> > better performance with this patch.
> >
> > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > ---
> >   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++
> >   1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..792bb647ba28668ad7789c328456e3609440455d 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> >               dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
> >                                       rq->buff.map_dir);
> >
> > +             headlen = eth_get_headlen(skb->dev, head_addr, headlen);
> > +
>
> Hi,
>
> I am building on top of this patchset and got a kernel crash. It was
> triggered by attaching an xdp program.
>
> I think the problem is skb->dev is still NULL here. It will be set later by:
> mlx5e_complete_rx_cqe() -> mlx5e_build_rx_skb() -> eth_type_trans()

Hmmm... Not sure what happened here...
I'm almost certain I tested with xdp as well...

I will try again later/tomorrow.

Thanks!
Christoph

>
>
> >               frag_offset += headlen;
> >               byte_cnt -= headlen;
> >               linear_hr = skb_headroom(skb);
> > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> >                               pagep->frags++;
> >                       while (++pagep < frag_page);
> >               }
> > +
> > +             headlen = eth_get_headlen(skb->dev, mxbuf->xdp.data, headlen);
> > +
> >               __pskb_pull_tail(skb, headlen);
> >       } else {
> >               if (xdp_buff_has_frags(&mxbuf->xdp)) {
> >
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-09-03 23:57     ` Christoph Paasch
@ 2025-09-04  0:11       ` Amery Hung
  2025-09-04  3:58         ` Christoph Paasch
  0 siblings, 1 reply; 16+ messages in thread
From: Amery Hung @ 2025-09-04  0:11 UTC (permalink / raw)
  To: Christoph Paasch
  Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, netdev, linux-rdma, bpf

On Wed, Sep 3, 2025 at 4:57 PM Christoph Paasch <cpaasch@openai.com> wrote:
>
> On Wed, Sep 3, 2025 at 4:39 PM Amery Hung <ameryhung@gmail.com> wrote:
> >
> >
> >
> > On 8/28/25 8:36 PM, Christoph Paasch via B4 Relay wrote:
> > > From: Christoph Paasch <cpaasch@openai.com>
> > >
> > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > > include part of the payload.
> > >
> > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > > (and skb->head_frag is not set), we end up aggregating packets in the
> > > frag_list.
> > >
> > > This is of course not good when we are CPU-limited. Also causes a worse
> > > skb->len/truesize ratio,...
> > >
> > > So, let's avoid copying parts of the payload to the linear part. We use
> > > eth_get_headlen() to parse the headers and compute the length of the
> > > protocol headers, which will be used to copy the relevant bits ot the
> > > skb's linear part.
> > >
> > > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> > > stack needs to call pskb_may_pull() later on, we don't need to reallocate
> > > memory.
> > >
> > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > > LRO enabled):
> > >
> > > BEFORE:
> > > =======
> > > (netserver pinned to core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > >   87380  16384 262144    60.01    32547.82
> > >
> > > (netserver pinned to adjacent core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > >   87380  16384 262144    60.00    52531.67
> > >
> > > AFTER:
> > > ======
> > > (netserver pinned to core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > >   87380  16384 262144    60.00    52896.06
> > >
> > > (netserver pinned to adjacent core receiving interrupts)
> > >   $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > >   87380  16384 262144    60.00    85094.90
> > >
> > > Additional tests across a larger range of parameters w/ and w/o LRO, w/
> > > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> > > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> > > better performance with this patch.
> > >
> > > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > > ---
> > >   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++
> > >   1 file changed, 5 insertions(+)
> > >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..792bb647ba28668ad7789c328456e3609440455d 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > >               dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
> > >                                       rq->buff.map_dir);
> > >
> > > +             headlen = eth_get_headlen(skb->dev, head_addr, headlen);
> > > +
> >
> > Hi,
> >
> > I am building on top of this patchset and got a kernel crash. It was
> > triggered by attaching an xdp program.
> >
> > I think the problem is skb->dev is still NULL here. It will be set later by:
> > mlx5e_complete_rx_cqe() -> mlx5e_build_rx_skb() -> eth_type_trans()
>
> Hmmm... Not sure what happened here...
> I'm almost certain I tested with xdp as well...
>
> I will try again later/tomorrow.
>

Here is the command that triggers the panic:

ip link set dev eth0 mtu 8000 xdp obj
/root/ksft-net-drv/net/lib/xdp_native.bpf.o sec xdp.frags

and I should have attached the log:

[ 2851.287387] BUG: kernel NULL pointer dereference, address: 0000000000000100
[ 2851.301329] #PF: supervisor read access in kernel mode
[ 2851.311602] #PF: error_code(0x0000) - not-present page
[ 2851.321879] PGD 0 P4D 0
[ 2851.326944] Oops: Oops: 0000 [#1] SMP
[ 2851.334272] CPU: 11 UID: 0 PID: 0 Comm: swapper/11 Kdump: loaded
Tainted: G S          E       6.17.0-rc1-gcf50ef415525 #305 NONE
[ 2851.357759] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE
[ 2851.369252] Hardware name: Wiwynn Delta Lake MP/Delta Lake-Class1,
BIOS Y3DL401 09/04/2024
[ 2851.385787] RIP: 0010:eth_get_headlen+0x16/0x90
[ 2851.394850] Code: 5e 41 5f 5d c3 b8 f2 ff ff ff eb f0 cc cc cc cc
cc cc cc cc 0f 1f 44 00 00 41 56 53 48 83 ec 10 89 d3 83 fa 0e 72 68
49 89 f6 <48> 8b bf 00 01 00 00 44 0f b7 4e 0c c7 44 24 08 00 00 00 00
48 c7
[ 2851.432413] RSP: 0018:ffffc90000720cc8 EFLAGS: 00010212
[ 2851.442864] RAX: 0000000000000000 RBX: 000000000000008a RCX: 00000000000000a0
[ 2851.457141] RDX: 000000000000008a RSI: ffff8885a5aee100 RDI: 0000000000000000
[ 2851.471417] RBP: ffff8883d01f3900 R08: ffff888204c7c000 R09: 0000000000000000
[ 2851.485696] R10: ffff8883d01f3900 R11: ffff8885a5aee340 R12: ffff8885add00030
[ 2851.499969] R13: ffff8885add00030 R14: ffff8885a5aee100 R15: 0000000000000000
[ 2851.514245] FS:  0000000000000000(0000) GS:ffff8890b4427000(0000)
knlGS:0000000000000000
[ 2851.530433] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2851.541931] CR2: 0000000000000100 CR3: 000000107d412003 CR4: 00000000007726f0
[ 2851.556208] PKRU: 55555554
[ 2851.561623] Call Trace:
[ 2851.566514]  <IRQ>
[ 2851.570540]  mlx5e_skb_from_cqe_mpwrq_nonlinear+0x7af/0x8d0
[ 2851.581689]  mlx5e_handle_rx_cqe_mpwrq+0xbc/0x180
[ 2851.591096]  mlx5e_poll_rx_cq+0x2ef/0x780
[ 2851.599114]  mlx5e_napi_poll+0x10c/0x710
[ 2851.606959]  __napi_poll+0x28/0x160
[ 2851.613934]  net_rx_action+0x1c0/0x350
[ 2851.621434]  ? mlx5_eq_comp_int+0xdf/0x190
[ 2851.629628]  ? sched_clock+0x5/0x10
[ 2851.636603]  ? sched_clock_cpu+0xc/0x170
[ 2851.644450]  handle_softirqs+0xd8/0x280
[ 2851.652121]  __irq_exit_rcu.llvm.7416059615185659459+0x44/0xd0
[ 2851.663788]  common_interrupt+0x85/0x90
[ 2851.671457]  </IRQ>
[ 2851.675653]  <TASK>
[ 2851.679850]  asm_common_interrupt+0x22/0x40

Thanks for taking a look!
Amery

> Thanks!
> Christoph
>
> >
> >
> > >               frag_offset += headlen;
> > >               byte_cnt -= headlen;
> > >               linear_hr = skb_headroom(skb);
> > > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > >                               pagep->frags++;
> > >                       while (++pagep < frag_page);
> > >               }
> > > +
> > > +             headlen = eth_get_headlen(skb->dev, mxbuf->xdp.data, headlen);
> > > +
> > >               __pskb_pull_tail(skb, headlen);
> > >       } else {
> > >               if (xdp_buff_has_frags(&mxbuf->xdp)) {
> > >
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-09-04  0:11       ` Amery Hung
@ 2025-09-04  3:58         ` Christoph Paasch
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Paasch @ 2025-09-04  3:58 UTC (permalink / raw)
  To: Amery Hung
  Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, netdev, linux-rdma, bpf

On Wed, Sep 3, 2025 at 5:12 PM Amery Hung <ameryhung@gmail.com> wrote:
>
> On Wed, Sep 3, 2025 at 4:57 PM Christoph Paasch <cpaasch@openai.com> wrote:
> >
> > On Wed, Sep 3, 2025 at 4:39 PM Amery Hung <ameryhung@gmail.com> wrote:
> > >
> > >
> > >
> > > On 8/28/25 8:36 PM, Christoph Paasch via B4 Relay wrote:
> > > > From: Christoph Paasch <cpaasch@openai.com>
> > > >
> > > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > > > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > > > include part of the payload.
> > > >
> > > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > > > (and skb->head_frag is not set), we end up aggregating packets in the
> > > > frag_list.
> > > >
> > > > This is of course not good when we are CPU-limited. Also causes a worse
> > > > skb->len/truesize ratio,...
> > > >
> > > > So, let's avoid copying parts of the payload to the linear part. We use
> > > > eth_get_headlen() to parse the headers and compute the length of the
> > > > protocol headers, which will be used to copy the relevant bits ot the
> > > > skb's linear part.
> > > >
> > > > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> > > > stack needs to call pskb_may_pull() later on, we don't need to reallocate
> > > > memory.
> > > >
> > > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > > > LRO enabled):
> > > >
> > > > BEFORE:
> > > > =======
> > > > (netserver pinned to core receiving interrupts)
> > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > > >   87380  16384 262144    60.01    32547.82
> > > >
> > > > (netserver pinned to adjacent core receiving interrupts)
> > > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > > >   87380  16384 262144    60.00    52531.67
> > > >
> > > > AFTER:
> > > > ======
> > > > (netserver pinned to core receiving interrupts)
> > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > > >   87380  16384 262144    60.00    52896.06
> > > >
> > > > (netserver pinned to adjacent core receiving interrupts)
> > > >   $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > > >   87380  16384 262144    60.00    85094.90
> > > >
> > > > Additional tests across a larger range of parameters w/ and w/o LRO, w/
> > > > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> > > > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> > > > better performance with this patch.
> > > >
> > > > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > > > ---
> > > >   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++
> > > >   1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..792bb647ba28668ad7789c328456e3609440455d 100644
> > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > > >               dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
> > > >                                       rq->buff.map_dir);
> > > >
> > > > +             headlen = eth_get_headlen(skb->dev, head_addr, headlen);
> > > > +
> > >
> > > Hi,
> > >
> > > I am building on top of this patchset and got a kernel crash. It was
> > > triggered by attaching an xdp program.
> > >
> > > I think the problem is skb->dev is still NULL here. It will be set later by:
> > > mlx5e_complete_rx_cqe() -> mlx5e_build_rx_skb() -> eth_type_trans()
> >
> > Hmmm... Not sure what happened here...
> > I'm almost certain I tested with xdp as well...
> >
> > I will try again later/tomorrow.
> >
>
> Here is the command that triggers the panic:
>
> ip link set dev eth0 mtu 8000 xdp obj
> /root/ksft-net-drv/net/lib/xdp_native.bpf.o sec xdp.frags
>
> and I should have attached the log:
>
> [ 2851.287387] BUG: kernel NULL pointer dereference, address: 0000000000000100
> [ 2851.301329] #PF: supervisor read access in kernel mode
> [ 2851.311602] #PF: error_code(0x0000) - not-present page
> [ 2851.321879] PGD 0 P4D 0
> [ 2851.326944] Oops: Oops: 0000 [#1] SMP
> [ 2851.334272] CPU: 11 UID: 0 PID: 0 Comm: swapper/11 Kdump: loaded
> Tainted: G S          E       6.17.0-rc1-gcf50ef415525 #305 NONE
> [ 2851.357759] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE
> [ 2851.369252] Hardware name: Wiwynn Delta Lake MP/Delta Lake-Class1,
> BIOS Y3DL401 09/04/2024
> [ 2851.385787] RIP: 0010:eth_get_headlen+0x16/0x90
> [ 2851.394850] Code: 5e 41 5f 5d c3 b8 f2 ff ff ff eb f0 cc cc cc cc
> cc cc cc cc 0f 1f 44 00 00 41 56 53 48 83 ec 10 89 d3 83 fa 0e 72 68
> 49 89 f6 <48> 8b bf 00 01 00 00 44 0f b7 4e 0c c7 44 24 08 00 00 00 00
> 48 c7
> [ 2851.432413] RSP: 0018:ffffc90000720cc8 EFLAGS: 00010212
> [ 2851.442864] RAX: 0000000000000000 RBX: 000000000000008a RCX: 00000000000000a0
> [ 2851.457141] RDX: 000000000000008a RSI: ffff8885a5aee100 RDI: 0000000000000000
> [ 2851.471417] RBP: ffff8883d01f3900 R08: ffff888204c7c000 R09: 0000000000000000
> [ 2851.485696] R10: ffff8883d01f3900 R11: ffff8885a5aee340 R12: ffff8885add00030
> [ 2851.499969] R13: ffff8885add00030 R14: ffff8885a5aee100 R15: 0000000000000000
> [ 2851.514245] FS:  0000000000000000(0000) GS:ffff8890b4427000(0000)
> knlGS:0000000000000000
> [ 2851.530433] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2851.541931] CR2: 0000000000000100 CR3: 000000107d412003 CR4: 00000000007726f0
> [ 2851.556208] PKRU: 55555554
> [ 2851.561623] Call Trace:
> [ 2851.566514]  <IRQ>
> [ 2851.570540]  mlx5e_skb_from_cqe_mpwrq_nonlinear+0x7af/0x8d0
> [ 2851.581689]  mlx5e_handle_rx_cqe_mpwrq+0xbc/0x180
> [ 2851.591096]  mlx5e_poll_rx_cq+0x2ef/0x780
> [ 2851.599114]  mlx5e_napi_poll+0x10c/0x710
> [ 2851.606959]  __napi_poll+0x28/0x160
> [ 2851.613934]  net_rx_action+0x1c0/0x350
> [ 2851.621434]  ? mlx5_eq_comp_int+0xdf/0x190
> [ 2851.629628]  ? sched_clock+0x5/0x10
> [ 2851.636603]  ? sched_clock_cpu+0xc/0x170
> [ 2851.644450]  handle_softirqs+0xd8/0x280
> [ 2851.652121]  __irq_exit_rcu.llvm.7416059615185659459+0x44/0xd0
> [ 2851.663788]  common_interrupt+0x85/0x90
> [ 2851.671457]  </IRQ>
> [ 2851.675653]  <TASK>
> [ 2851.679850]  asm_common_interrupt+0x22/0x40

Oh, I see why I didn't hit the bug when testing with xdp... I wasn't
using a multi-buffer xdp prog and thus had to reduce the MTU and so
ended up not using the mlx5e_skb_from_cqe_mpwrq_nonlinear()
code-path...

I can reproduce the panic and will fix it.


Christoph

>
> Thanks for taking a look!
> Amery
>
> > Thanks!
> > Christoph
> >
> > >
> > >
> > > >               frag_offset += headlen;
> > > >               byte_cnt -= headlen;
> > > >               linear_hr = skb_headroom(skb);
> > > > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > > >                               pagep->frags++;
> > > >                       while (++pagep < frag_page);
> > > >               }
> > > > +
> > > > +             headlen = eth_get_headlen(skb->dev, mxbuf->xdp.data, headlen);
> > > > +
> > > >               __pskb_pull_tail(skb, headlen);
> > > >       } else {
> > > >               if (xdp_buff_has_frags(&mxbuf->xdp)) {
> > > >
> > >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing
  2025-08-29  3:36 [PATCH net-next v4 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch via B4 Relay
  2025-08-29  3:36 ` [PATCH net-next v4 1/2] net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear Christoph Paasch via B4 Relay
  2025-08-29  3:36 ` [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
@ 2025-08-29 22:43 ` Saeed Mahameed
  2025-08-31  9:28   ` Tariq Toukan
  2 siblings, 1 reply; 16+ messages in thread
From: Saeed Mahameed @ 2025-08-29 22:43 UTC (permalink / raw)
  To: cpaasch
  Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, netdev, linux-rdma, bpf

On 28 Aug 20:36, Christoph Paasch via B4 Relay wrote:
>When LRO is enabled on the MLX, mlx5e_skb_from_cqe_mpwrq_nonlinear
>copies parts of the payload to the linear part of the skb.
>
>This triggers suboptimal processing in GRO, causing slow throughput,...
>
>This patch series addresses this by using eth_get_headlen to compute the
>size of the protocol headers and only copy those bits. This results in
>a significant throughput improvement (detailled results in the specific
>patch).
>
>Signed-off-by: Christoph Paasch <cpaasch@openai.com>

LGTM, I would love to take this to net-next-mlx5 and submit it back to
netdev after regression testing if that's ok? Christoph? 
Anyway I will wait for Jakub to mark this as "awaiting-upstream" or if he
applies it directly then fine.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing
  2025-08-29 22:43 ` [PATCH net-next v4 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Saeed Mahameed
@ 2025-08-31  9:28   ` Tariq Toukan
  2025-09-02 15:51     ` Christoph Paasch
  0 siblings, 1 reply; 16+ messages in thread
From: Tariq Toukan @ 2025-08-31  9:28 UTC (permalink / raw)
  To: Saeed Mahameed, cpaasch
  Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, netdev, linux-rdma, bpf



On 30/08/2025 1:43, Saeed Mahameed wrote:
> On 28 Aug 20:36, Christoph Paasch via B4 Relay wrote:
>> When LRO is enabled on the MLX, mlx5e_skb_from_cqe_mpwrq_nonlinear
>> copies parts of the payload to the linear part of the skb.
>>
>> This triggers suboptimal processing in GRO, causing slow throughput,...
>>
>> This patch series addresses this by using eth_get_headlen to compute the
>> size of the protocol headers and only copy those bits. This results in
>> a significant throughput improvement (detailled results in the specific
>> patch).
>>
>> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> 
> LGTM, I would love to take this to net-next-mlx5 and submit it back to
> netdev after regression testing if that's ok? Christoph? Anyway I will 
> wait for Jakub to mark this as "awaiting-upstream" or if he
> applies it directly then fine.
> 
> 
> 

Hi,

I recall trying out similar approach internally a few years ago.

eth_get_headlen() function didn't work properly for non-Eth frames 
(ipoib). I believe this is still the case.

Extra care is needed for the ipoib flow, which I assume gets broken here.

According to the perf gain, it is worth splitting to multiple code paths 
via branches/function pointers.

Regards,
Tariq

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing
  2025-08-31  9:28   ` Tariq Toukan
@ 2025-09-02 15:51     ` Christoph Paasch
  2025-09-02 16:15       ` Saeed Mahameed
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Paasch @ 2025-09-02 15:51 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Saeed Mahameed, Gal Pressman, Dragos Tatulea, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, netdev, linux-rdma, bpf

Hello Tariq,

On Sun, Aug 31, 2025 at 2:28 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>
>
>
> On 30/08/2025 1:43, Saeed Mahameed wrote:
> > On 28 Aug 20:36, Christoph Paasch via B4 Relay wrote:
> >> When LRO is enabled on the MLX, mlx5e_skb_from_cqe_mpwrq_nonlinear
> >> copies parts of the payload to the linear part of the skb.
> >>
> >> This triggers suboptimal processing in GRO, causing slow throughput,...
> >>
> >> This patch series addresses this by using eth_get_headlen to compute the
> >> size of the protocol headers and only copy those bits. This results in
> >> a significant throughput improvement (detailled results in the specific
> >> patch).
> >>
> >> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> >
> > LGTM, I would love to take this to net-next-mlx5 and submit it back to
> > netdev after regression testing if that's ok? Christoph? Anyway I will
> > wait for Jakub to mark this as "awaiting-upstream" or if he
> > applies it directly then fine.
> >
> >
> >
>
> Hi,
>
> I recall trying out similar approach internally a few years ago.
>
> eth_get_headlen() function didn't work properly for non-Eth frames
> (ipoib). I believe this is still the case.
>
> Extra care is needed for the ipoib flow, which I assume gets broken here.

Are you actually sure that ipoib goes through
mlx5e_skb_from_cqe_mpwrq_nonlinear() ? Because, as far as I can see,
IPoIB disables striding in mlx5i_build_nic_params().

It's rather mlx5e_skb_from_cqe_nonlinear() that handles both, ethernet
and ipoib.


Christoph

>
> According to the perf gain, it is worth splitting to multiple code paths
> via branches/function pointers.
>
> Regards,
> Tariq

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing
  2025-09-02 15:51     ` Christoph Paasch
@ 2025-09-02 16:15       ` Saeed Mahameed
  2025-09-02 16:30         ` Christoph Paasch
  0 siblings, 1 reply; 16+ messages in thread
From: Saeed Mahameed @ 2025-09-02 16:15 UTC (permalink / raw)
  To: Christoph Paasch
  Cc: Tariq Toukan, Gal Pressman, Dragos Tatulea, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, netdev, linux-rdma, bpf

On 02 Sep 08:51, Christoph Paasch wrote:
>Hello Tariq,
>
>On Sun, Aug 31, 2025 at 2:28 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>>
>>
>>
>> On 30/08/2025 1:43, Saeed Mahameed wrote:
>> > On 28 Aug 20:36, Christoph Paasch via B4 Relay wrote:
>> >> When LRO is enabled on the MLX, mlx5e_skb_from_cqe_mpwrq_nonlinear
>> >> copies parts of the payload to the linear part of the skb.
>> >>
>> >> This triggers suboptimal processing in GRO, causing slow throughput,...
>> >>
>> >> This patch series addresses this by using eth_get_headlen to compute the
>> >> size of the protocol headers and only copy those bits. This results in
>> >> a significant throughput improvement (detailled results in the specific
>> >> patch).
>> >>
>> >> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
>> >
>> > LGTM, I would love to take this to net-next-mlx5 and submit it back to
>> > netdev after regression testing if that's ok? Christoph? Anyway I will
>> > wait for Jakub to mark this as "awaiting-upstream" or if he
>> > applies it directly then fine.
>> >
>> >
>> >
>>
>> Hi,
>>
>> I recall trying out similar approach internally a few years ago.
>>
>> eth_get_headlen() function didn't work properly for non-Eth frames
>> (ipoib). I believe this is still the case.
>>
>> Extra care is needed for the ipoib flow, which I assume gets broken here.
>
>Are you actually sure that ipoib goes through
>mlx5e_skb_from_cqe_mpwrq_nonlinear() ? Because, as far as I can see,
>IPoIB disables striding in mlx5i_build_nic_params().
>
>It's rather mlx5e_skb_from_cqe_nonlinear() that handles both, ethernet
>and ipoib.
>
correct,

const struct mlx5e_rx_handlers mlx5i_rx_handlers = {
	.handle_rx_cqe       = mlx5i_handle_rx_cqe,
	.handle_rx_cqe_mpwqe = NULL, /* Not supported */
};

I see that the patches are "awaiting-upstream" so I applied it to our internal
queue, will let you know if we find any issues, otherwise, will repost as
part of our upcoming submissions.

Thanks,
Saeed.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing
  2025-09-02 16:15       ` Saeed Mahameed
@ 2025-09-02 16:30         ` Christoph Paasch
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Paasch @ 2025-09-02 16:30 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Tariq Toukan, Gal Pressman, Dragos Tatulea, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, netdev, linux-rdma, bpf

On Tue, Sep 2, 2025 at 9:15 AM Saeed Mahameed <saeed@kernel.org> wrote:
>
> On 02 Sep 08:51, Christoph Paasch wrote:
> >Hello Tariq,
> >
> >On Sun, Aug 31, 2025 at 2:28 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
> >>
> >>
> >>
> >> On 30/08/2025 1:43, Saeed Mahameed wrote:
> >> > On 28 Aug 20:36, Christoph Paasch via B4 Relay wrote:
> >> >> When LRO is enabled on the MLX, mlx5e_skb_from_cqe_mpwrq_nonlinear
> >> >> copies parts of the payload to the linear part of the skb.
> >> >>
> >> >> This triggers suboptimal processing in GRO, causing slow throughput,...
> >> >>
> >> >> This patch series addresses this by using eth_get_headlen to compute the
> >> >> size of the protocol headers and only copy those bits. This results in
> >> >> a significant throughput improvement (detailled results in the specific
> >> >> patch).
> >> >>
> >> >> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> >> >
> >> > LGTM, I would love to take this to net-next-mlx5 and submit it back to
> >> > netdev after regression testing if that's ok? Christoph? Anyway I will
> >> > wait for Jakub to mark this as "awaiting-upstream" or if he
> >> > applies it directly then fine.
> >> >
> >> >
> >> >
> >>
> >> Hi,
> >>
> >> I recall trying out similar approach internally a few years ago.
> >>
> >> eth_get_headlen() function didn't work properly for non-Eth frames
> >> (ipoib). I believe this is still the case.
> >>
> >> Extra care is needed for the ipoib flow, which I assume gets broken here.
> >
> >Are you actually sure that ipoib goes through
> >mlx5e_skb_from_cqe_mpwrq_nonlinear() ? Because, as far as I can see,
> >IPoIB disables striding in mlx5i_build_nic_params().
> >
> >It's rather mlx5e_skb_from_cqe_nonlinear() that handles both, ethernet
> >and ipoib.
> >
> correct,
>
> const struct mlx5e_rx_handlers mlx5i_rx_handlers = {
>         .handle_rx_cqe       = mlx5i_handle_rx_cqe,
>         .handle_rx_cqe_mpwqe = NULL, /* Not supported */
> };
>
> I see that the patches are "awaiting-upstream" so I applied it to our internal
> queue, will let you know if we find any issues, otherwise, will repost as
> part of our upcoming submissions.

Sounds good! Thanks, Saeed!

Christoph

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-09-04  3:58 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-29  3:36 [PATCH net-next v4 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch via B4 Relay
2025-08-29  3:36 ` [PATCH net-next v4 1/2] net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear Christoph Paasch via B4 Relay
2025-08-29 16:33   ` Eric Dumazet
2025-08-29 22:39   ` Saeed Mahameed
2025-08-29  3:36 ` [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
2025-08-29 16:34   ` Eric Dumazet
2025-08-29 22:39   ` Saeed Mahameed
2025-09-03 23:38   ` Amery Hung
2025-09-03 23:57     ` Christoph Paasch
2025-09-04  0:11       ` Amery Hung
2025-09-04  3:58         ` Christoph Paasch
2025-08-29 22:43 ` [PATCH net-next v4 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Saeed Mahameed
2025-08-31  9:28   ` Tariq Toukan
2025-09-02 15:51     ` Christoph Paasch
2025-09-02 16:15       ` Saeed Mahameed
2025-09-02 16:30         ` Christoph Paasch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).