[PATCH net-next v5 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next v5 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing
@ 2025-09-04 22:53 Christoph Paasch via B4 Relay
  2025-09-04 22:53 ` [PATCH net-next v5 1/2] net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear Christoph Paasch via B4 Relay
  2025-09-04 22:53 ` [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
  0 siblings, 2 replies; 10+ messages in thread
From: Christoph Paasch via B4 Relay @ 2025-09-04 22:53 UTC (permalink / raw)
  To: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, Amery Hung
  Cc: netdev, linux-rdma, bpf, Christoph Paasch

When LRO is enabled on the MLX, mlx5e_skb_from_cqe_mpwrq_nonlinear
copies parts of the payload to the linear part of the skb.

This triggers suboptimal processing in GRO, causing slow throughput,...

This patch series addresses this by using eth_get_headlen to compute the
size of the protocol headers and only copy those bits. This results in
a significant throughput improvement (detailled results in the specific
patch).

Signed-off-by: Christoph Paasch <cpaasch@openai.com>
---
Changes in v5:
- Fix patch #2 NULL-ptr deref when an xdp-prog is set. Instead of using
  skb->dev for eth_get_headlen, use rq->netdev. skb->dev will not be set
  in the xdp code-path. (Reported-by: Amery Hung <ameryhung@gmail.com>)
- Kept previous Reviewed-by as the change is minor.
- Link to v4: https://lore.kernel.org/r/20250828-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-v4-0-bfcd5033a77c@openai.com

Changes in v4:
- Use eth_get_headlen() instead of building a dissector based on struct mlx5_cqe64.
  This mimics what other drivers,... are doing as well. (Eric Dumazet
  <edumazet@google.com>)
- Link to v3: https://lore.kernel.org/r/20250825-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-v3-0-5527e9eb6efc@openai.com

Changes in v3:
- Avoid computing headlen when it is not absolutely necessary (e.g., xdp
  decides to "consume" the packet) (Dragos Tatulea <dtatulea@nvidia.com> & Jakub Kicinski <kuba@kernel.org>)
- Given the above change, consolidate the check for min3(...) in the new
  function to avoid code duplication.
- Make sure local variables are in reverse xmas-tree order.
- Refine comment about why the check for l4_type worsk as is.
- Link to v2: https://lore.kernel.org/r/20250816-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-v2-0-b11b30bc2d10@openai.com

Changes in v2:
- Refine commit-message with more info and testing data
- Make mlx5e_cqe_get_min_hdr_len() return MLX5E_RX_MAX_HEAD when l3_type
  is neither IPv4 nor IPv6. Same for the l4_type. That way behavior is
  unchanged for other traffic types.
- Rename mlx5e_cqe_get_min_hdr_len to mlx5e_cqe_estimate_hdr_len
- Link to v1: https://lore.kernel.org/r/20250713-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-v1-0-ecaed8c2844e@openai.com

---
Christoph Paasch (2):
      net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear
      net/mlx5: Avoid copying payload to the skb's linear part

 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)
---
base-commit: 5ef04a7b068cbb828eba226aacb42f880f7924d7
change-id: 20250712-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-6524917455a6

Best regards,
-- 
Christoph Paasch <cpaasch@openai.com>



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH net-next v5 1/2] net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear
  2025-09-04 22:53 [PATCH net-next v5 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch via B4 Relay
@ 2025-09-04 22:53 ` Christoph Paasch via B4 Relay
  2025-09-04 22:53 ` [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
  1 sibling, 0 replies; 10+ messages in thread
From: Christoph Paasch via B4 Relay @ 2025-09-04 22:53 UTC (permalink / raw)
  To: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, Amery Hung
  Cc: netdev, linux-rdma, bpf, Christoph Paasch

From: Christoph Paasch <cpaasch@openai.com>

Doing the call to dma_sync_single_for_cpu() earlier will allow us to
adjust headlen based on the actual size of the protocol headers.

Doing this earlier means that we don't need to call
mlx5e_copy_skb_header() anymore and rather can call
skb_copy_to_linear_data() directly.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Christoph Paasch <cpaasch@openai.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index b8c609d91d11bd315e8fb67f794a91bd37cd28c0..8bedbda522808cbabc8e62ae91a8c25d66725ebb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -2005,17 +2005,19 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	struct skb_shared_info *sinfo;
 	unsigned int truesize = 0;
 	struct bpf_prog *prog;
+	void *va, *head_addr;
 	struct sk_buff *skb;
 	u32 linear_frame_sz;
 	u16 linear_data_len;
 	u16 linear_hr;
-	void *va;
 
 	prog = rcu_dereference(rq->xdp_prog);
 
+	head_addr = netmem_address(head_page->netmem) + head_offset;
+
 	if (prog) {
 		/* area for bpf_xdp_[store|load]_bytes */
-		net_prefetchw(netmem_address(frag_page->netmem) + frag_offset);
+		net_prefetchw(head_addr);
 		if (unlikely(mlx5e_page_alloc_fragmented(rq->page_pool,
 							 &wi->linear_page))) {
 			rq->stats->buff_alloc_err++;
@@ -2028,6 +2030,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		linear_data_len = 0;
 		linear_frame_sz = MLX5_SKB_FRAG_SZ(linear_hr + MLX5E_RX_MAX_HEAD);
 	} else {
+		dma_addr_t addr;
+
 		skb = napi_alloc_skb(rq->cq.napi,
 				     ALIGN(MLX5E_RX_MAX_HEAD, sizeof(long)));
 		if (unlikely(!skb)) {
@@ -2039,6 +2043,10 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		net_prefetchw(va); /* xdp_frame data area */
 		net_prefetchw(skb->data);
 
+		addr = page_pool_get_dma_addr_netmem(head_page->netmem);
+		dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
+					rq->buff.map_dir);
+
 		frag_offset += headlen;
 		byte_cnt -= headlen;
 		linear_hr = skb_headroom(skb);
@@ -2117,8 +2125,6 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		}
 		__pskb_pull_tail(skb, headlen);
 	} else {
-		dma_addr_t addr;
-
 		if (xdp_buff_has_frags(&mxbuf->xdp)) {
 			struct mlx5e_frag_page *pagep;
 
@@ -2133,9 +2139,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 			while (++pagep < frag_page);
 		}
 		/* copy header */
-		addr = page_pool_get_dma_addr_netmem(head_page->netmem);
-		mlx5e_copy_skb_header(rq, skb, head_page->netmem, addr,
-				      head_offset, head_offset, headlen);
+		skb_copy_to_linear_data(skb, head_addr, headlen);
+
 		/* skb linear part was allocated with headlen and aligned to long */
 		skb->tail += headlen;
 		skb->len  += headlen;

-- 
2.50.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-09-04 22:53 [PATCH net-next v5 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch via B4 Relay
  2025-09-04 22:53 ` [PATCH net-next v5 1/2] net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear Christoph Paasch via B4 Relay
@ 2025-09-04 22:53 ` Christoph Paasch via B4 Relay
  2025-09-04 23:30   ` Amery Hung
  1 sibling, 1 reply; 10+ messages in thread
From: Christoph Paasch via B4 Relay @ 2025-09-04 22:53 UTC (permalink / raw)
  To: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, Amery Hung
  Cc: netdev, linux-rdma, bpf, Christoph Paasch

From: Christoph Paasch <cpaasch@openai.com>

mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
bytes from the page-pool to the skb's linear part. Those 256 bytes
include part of the payload.

When attempting to do GRO in skb_gro_receive, if headlen > data_offset
(and skb->head_frag is not set), we end up aggregating packets in the
frag_list.

This is of course not good when we are CPU-limited. Also causes a worse
skb->len/truesize ratio,...

So, let's avoid copying parts of the payload to the linear part. We use
eth_get_headlen() to parse the headers and compute the length of the
protocol headers, which will be used to copy the relevant bits ot the
skb's linear part.

We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
stack needs to call pskb_may_pull() later on, we don't need to reallocate
memory.

This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
LRO enabled):

BEFORE:
=======
(netserver pinned to core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
 87380  16384 262144    60.01    32547.82

(netserver pinned to adjacent core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
 87380  16384 262144    60.00    52531.67

AFTER:
======
(netserver pinned to core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
 87380  16384 262144    60.00    52896.06

(netserver pinned to adjacent core receiving interrupts)
 $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
 87380  16384 262144    60.00    85094.90

Additional tests across a larger range of parameters w/ and w/o LRO, w/
and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
better performance with this patch.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Christoph Paasch <cpaasch@openai.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..0ac31c7fb64cd60720d390de45a5b6b453ed0a3f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
 					rq->buff.map_dir);
 
+		headlen = eth_get_headlen(rq->netdev, head_addr, headlen);
+
 		frag_offset += headlen;
 		byte_cnt -= headlen;
 		linear_hr = skb_headroom(skb);
@@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 				pagep->frags++;
 			while (++pagep < frag_page);
 		}
+
+		headlen = eth_get_headlen(rq->netdev, mxbuf->xdp.data, headlen);
+
 		__pskb_pull_tail(skb, headlen);
 	} else {
 		if (xdp_buff_has_frags(&mxbuf->xdp)) {

-- 
2.50.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-09-04 22:53 ` [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
@ 2025-09-04 23:30   ` Amery Hung
  2025-09-09  4:00     ` Christoph Paasch
  0 siblings, 1 reply; 10+ messages in thread
From: Amery Hung @ 2025-09-04 23:30 UTC (permalink / raw)
  To: cpaasch
  Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, netdev, linux-rdma, bpf

On Thu, Sep 4, 2025 at 3:57 PM Christoph Paasch via B4 Relay
<devnull+cpaasch.openai.com@kernel.org> wrote:
>
> From: Christoph Paasch <cpaasch@openai.com>
>
> mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> bytes from the page-pool to the skb's linear part. Those 256 bytes
> include part of the payload.
>
> When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> (and skb->head_frag is not set), we end up aggregating packets in the
> frag_list.
>
> This is of course not good when we are CPU-limited. Also causes a worse
> skb->len/truesize ratio,...
>
> So, let's avoid copying parts of the payload to the linear part. We use
> eth_get_headlen() to parse the headers and compute the length of the
> protocol headers, which will be used to copy the relevant bits ot the
> skb's linear part.
>
> We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> stack needs to call pskb_may_pull() later on, we don't need to reallocate
> memory.
>
> This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> LRO enabled):
>
> BEFORE:
> =======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.01    32547.82
>
> (netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.00    52531.67
>
> AFTER:
> ======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.00    52896.06
>
> (netserver pinned to adjacent core receiving interrupts)
>  $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.00    85094.90
>
> Additional tests across a larger range of parameters w/ and w/o LRO, w/
> and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> better performance with this patch.
>
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..0ac31c7fb64cd60720d390de45a5b6b453ed0a3f 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>                 dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
>                                         rq->buff.map_dir);
>
> +               headlen = eth_get_headlen(rq->netdev, head_addr, headlen);
> +
>                 frag_offset += headlen;
>                 byte_cnt -= headlen;
>                 linear_hr = skb_headroom(skb);
> @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>                                 pagep->frags++;
>                         while (++pagep < frag_page);
>                 }
> +
> +               headlen = eth_get_headlen(rq->netdev, mxbuf->xdp.data, headlen);
> +

The size of mxbuf->xdp.data is most likely not headlen here.

The driver currently generates a xdp_buff with empty linear data, pass
it to the xdp program and assumes the layout If the xdp program does
not change the layout of the xdp_buff through bpf_xdp_adjust_head() or
bpf_xdp_adjust_tail(). The assumption is not correct and I am working
on a fix. But, if we keep that assumption for now, mxbuf->xdp.data
will not contain any headers or payload. The thing that you try to do
probably should be:

        skb_frag_t *frag = &sinfo->frags[0];

        headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag),
skb_frag_size(frag));



>                 __pskb_pull_tail(skb, headlen);
>         } else {
>                 if (xdp_buff_has_frags(&mxbuf->xdp)) {
>
> --
> 2.50.1
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-09-04 23:30   ` Amery Hung
@ 2025-09-09  4:00     ` Christoph Paasch
  2025-09-09 15:18       ` Christoph Paasch
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Paasch @ 2025-09-09  4:00 UTC (permalink / raw)
  To: Amery Hung
  Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, netdev, linux-rdma, bpf

On Thu, Sep 4, 2025 at 4:30 PM Amery Hung <ameryhung@gmail.com> wrote:
>
> On Thu, Sep 4, 2025 at 3:57 PM Christoph Paasch via B4 Relay
> <devnull+cpaasch.openai.com@kernel.org> wrote:
> >
> > From: Christoph Paasch <cpaasch@openai.com>
> >
> > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > include part of the payload.
> >
> > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > (and skb->head_frag is not set), we end up aggregating packets in the
> > frag_list.
> >
> > This is of course not good when we are CPU-limited. Also causes a worse
> > skb->len/truesize ratio,...
> >
> > So, let's avoid copying parts of the payload to the linear part. We use
> > eth_get_headlen() to parse the headers and compute the length of the
> > protocol headers, which will be used to copy the relevant bits ot the
> > skb's linear part.
> >
> > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> > stack needs to call pskb_may_pull() later on, we don't need to reallocate
> > memory.
> >
> > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > LRO enabled):
> >
> > BEFORE:
> > =======
> > (netserver pinned to core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> >  87380  16384 262144    60.01    32547.82
> >
> > (netserver pinned to adjacent core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> >  87380  16384 262144    60.00    52531.67
> >
> > AFTER:
> > ======
> > (netserver pinned to core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> >  87380  16384 262144    60.00    52896.06
> >
> > (netserver pinned to adjacent core receiving interrupts)
> >  $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> >  87380  16384 262144    60.00    85094.90
> >
> > Additional tests across a larger range of parameters w/ and w/o LRO, w/
> > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> > better performance with this patch.
> >
> > Reviewed-by: Eric Dumazet <edumazet@google.com>
> > Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
> > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..0ac31c7fb64cd60720d390de45a5b6b453ed0a3f 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> >                 dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
> >                                         rq->buff.map_dir);
> >
> > +               headlen = eth_get_headlen(rq->netdev, head_addr, headlen);
> > +
> >                 frag_offset += headlen;
> >                 byte_cnt -= headlen;
> >                 linear_hr = skb_headroom(skb);
> > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> >                                 pagep->frags++;
> >                         while (++pagep < frag_page);
> >                 }
> > +
> > +               headlen = eth_get_headlen(rq->netdev, mxbuf->xdp.data, headlen);
> > +
>
> The size of mxbuf->xdp.data is most likely not headlen here.
>
> The driver currently generates a xdp_buff with empty linear data, pass
> it to the xdp program and assumes the layout If the xdp program does
> not change the layout of the xdp_buff through bpf_xdp_adjust_head() or
> bpf_xdp_adjust_tail(). The assumption is not correct and I am working
> on a fix. But, if we keep that assumption for now, mxbuf->xdp.data
> will not contain any headers or payload. The thing that you try to do
> probably should be:
>
>         skb_frag_t *frag = &sinfo->frags[0];
>
>         headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag),
> skb_frag_size(frag));

Ok, I think I understand what you mean! Thanks for taking the time to explain!

I will do some tests on my side to make sure I get it right.

As your change goes to net and mine to netnext, I can wait until yours
is in the tree so that there aren't any conflicts that need to be
taken care of.


Christoph

>
>
>
> >                 __pskb_pull_tail(skb, headlen);
> >         } else {
> >                 if (xdp_buff_has_frags(&mxbuf->xdp)) {
> >
> > --
> > 2.50.1
> >
> >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-09-09  4:00     ` Christoph Paasch
@ 2025-09-09 15:18       ` Christoph Paasch
  2025-09-10  3:17         ` Amery Hung
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Paasch @ 2025-09-09 15:18 UTC (permalink / raw)
  To: Amery Hung
  Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, netdev, linux-rdma, bpf

On Mon, Sep 8, 2025 at 9:00 PM Christoph Paasch <cpaasch@openai.com> wrote:
>
> On Thu, Sep 4, 2025 at 4:30 PM Amery Hung <ameryhung@gmail.com> wrote:
> >
> > On Thu, Sep 4, 2025 at 3:57 PM Christoph Paasch via B4 Relay
> > <devnull+cpaasch.openai.com@kernel.org> wrote:
> > >
> > > From: Christoph Paasch <cpaasch@openai.com>
> > >
> > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > > include part of the payload.
> > >
> > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > > (and skb->head_frag is not set), we end up aggregating packets in the
> > > frag_list.
> > >
> > > This is of course not good when we are CPU-limited. Also causes a worse
> > > skb->len/truesize ratio,...
> > >
> > > So, let's avoid copying parts of the payload to the linear part. We use
> > > eth_get_headlen() to parse the headers and compute the length of the
> > > protocol headers, which will be used to copy the relevant bits ot the
> > > skb's linear part.
> > >
> > > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> > > stack needs to call pskb_may_pull() later on, we don't need to reallocate
> > > memory.
> > >
> > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > > LRO enabled):
> > >
> > > BEFORE:
> > > =======
> > > (netserver pinned to core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > >  87380  16384 262144    60.01    32547.82
> > >
> > > (netserver pinned to adjacent core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > >  87380  16384 262144    60.00    52531.67
> > >
> > > AFTER:
> > > ======
> > > (netserver pinned to core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > >  87380  16384 262144    60.00    52896.06
> > >
> > > (netserver pinned to adjacent core receiving interrupts)
> > >  $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > >  87380  16384 262144    60.00    85094.90
> > >
> > > Additional tests across a larger range of parameters w/ and w/o LRO, w/
> > > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> > > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> > > better performance with this patch.
> > >
> > > Reviewed-by: Eric Dumazet <edumazet@google.com>
> > > Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
> > > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > > ---
> > >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..0ac31c7fb64cd60720d390de45a5b6b453ed0a3f 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > >                 dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
> > >                                         rq->buff.map_dir);
> > >
> > > +               headlen = eth_get_headlen(rq->netdev, head_addr, headlen);
> > > +
> > >                 frag_offset += headlen;
> > >                 byte_cnt -= headlen;
> > >                 linear_hr = skb_headroom(skb);
> > > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > >                                 pagep->frags++;
> > >                         while (++pagep < frag_page);
> > >                 }
> > > +
> > > +               headlen = eth_get_headlen(rq->netdev, mxbuf->xdp.data, headlen);
> > > +
> >
> > The size of mxbuf->xdp.data is most likely not headlen here.
> >
> > The driver currently generates a xdp_buff with empty linear data, pass
> > it to the xdp program and assumes the layout If the xdp program does
> > not change the layout of the xdp_buff through bpf_xdp_adjust_head() or
> > bpf_xdp_adjust_tail(). The assumption is not correct and I am working
> > on a fix. But, if we keep that assumption for now, mxbuf->xdp.data
> > will not contain any headers or payload. The thing that you try to do
> > probably should be:
> >
> >         skb_frag_t *frag = &sinfo->frags[0];
> >
> >         headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag),
> > skb_frag_size(frag));

So, when I look at the headlen I get, it is correct (even with my old
code using mxbuf->xdp.data).

To make sure I test the right thing, which scenario would
mxbuf->xdp.data not contain any headers or payload ? What do I need to
do to reproduce that ?

Thanks,
Christoph

>
> Ok, I think I understand what you mean! Thanks for taking the time to explain!
>
> I will do some tests on my side to make sure I get it right.
>
> As your change goes to net and mine to netnext, I can wait until yours
> is in the tree so that there aren't any conflicts that need to be
> taken care of.
>
>
> Christoph
>
> >
> >
> >
> > >                 __pskb_pull_tail(skb, headlen);
> > >         } else {
> > >                 if (xdp_buff_has_frags(&mxbuf->xdp)) {
> > >
> > > --
> > > 2.50.1
> > >
> > >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-09-09 15:18       ` Christoph Paasch
@ 2025-09-10  3:17         ` Amery Hung
  2025-09-10 17:36           ` Christoph Paasch
  0 siblings, 1 reply; 10+ messages in thread
From: Amery Hung @ 2025-09-10  3:17 UTC (permalink / raw)
  To: Christoph Paasch
  Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, netdev, linux-rdma, bpf

On Tue, Sep 9, 2025 at 11:18 AM Christoph Paasch <cpaasch@openai.com> wrote:
>
> On Mon, Sep 8, 2025 at 9:00 PM Christoph Paasch <cpaasch@openai.com> wrote:
> >
> > On Thu, Sep 4, 2025 at 4:30 PM Amery Hung <ameryhung@gmail.com> wrote:
> > >
> > > On Thu, Sep 4, 2025 at 3:57 PM Christoph Paasch via B4 Relay
> > > <devnull+cpaasch.openai.com@kernel.org> wrote:
> > > >
> > > > From: Christoph Paasch <cpaasch@openai.com>
> > > >
> > > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > > > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > > > include part of the payload.
> > > >
> > > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > > > (and skb->head_frag is not set), we end up aggregating packets in the
> > > > frag_list.
> > > >
> > > > This is of course not good when we are CPU-limited. Also causes a worse
> > > > skb->len/truesize ratio,...
> > > >
> > > > So, let's avoid copying parts of the payload to the linear part. We use
> > > > eth_get_headlen() to parse the headers and compute the length of the
> > > > protocol headers, which will be used to copy the relevant bits ot the
> > > > skb's linear part.
> > > >
> > > > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> > > > stack needs to call pskb_may_pull() later on, we don't need to reallocate
> > > > memory.
> > > >
> > > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > > > LRO enabled):
> > > >
> > > > BEFORE:
> > > > =======
> > > > (netserver pinned to core receiving interrupts)
> > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > > >  87380  16384 262144    60.01    32547.82
> > > >
> > > > (netserver pinned to adjacent core receiving interrupts)
> > > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > > >  87380  16384 262144    60.00    52531.67
> > > >
> > > > AFTER:
> > > > ======
> > > > (netserver pinned to core receiving interrupts)
> > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > > >  87380  16384 262144    60.00    52896.06
> > > >
> > > > (netserver pinned to adjacent core receiving interrupts)
> > > >  $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > > >  87380  16384 262144    60.00    85094.90
> > > >
> > > > Additional tests across a larger range of parameters w/ and w/o LRO, w/
> > > > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> > > > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> > > > better performance with this patch.
> > > >
> > > > Reviewed-by: Eric Dumazet <edumazet@google.com>
> > > > Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
> > > > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > > > ---
> > > >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++
> > > >  1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..0ac31c7fb64cd60720d390de45a5b6b453ed0a3f 100644
> > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > > >                 dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
> > > >                                         rq->buff.map_dir);
> > > >
> > > > +               headlen = eth_get_headlen(rq->netdev, head_addr, headlen);
> > > > +
> > > >                 frag_offset += headlen;
> > > >                 byte_cnt -= headlen;
> > > >                 linear_hr = skb_headroom(skb);
> > > > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > > >                                 pagep->frags++;
> > > >                         while (++pagep < frag_page);
> > > >                 }
> > > > +
> > > > +               headlen = eth_get_headlen(rq->netdev, mxbuf->xdp.data, headlen);
> > > > +
> > >
> > > The size of mxbuf->xdp.data is most likely not headlen here.
> > >
> > > The driver currently generates a xdp_buff with empty linear data, pass
> > > it to the xdp program and assumes the layout If the xdp program does
> > > not change the layout of the xdp_buff through bpf_xdp_adjust_head() or
> > > bpf_xdp_adjust_tail(). The assumption is not correct and I am working
> > > on a fix. But, if we keep that assumption for now, mxbuf->xdp.data
> > > will not contain any headers or payload. The thing that you try to do
> > > probably should be:
> > >
> > >         skb_frag_t *frag = &sinfo->frags[0];
> > >
> > >         headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag),
> > > skb_frag_size(frag));
>
> So, when I look at the headlen I get, it is correct (even with my old
> code using mxbuf->xdp.data).
>
> To make sure I test the right thing, which scenario would
> mxbuf->xdp.data not contain any headers or payload ? What do I need to
> do to reproduce that ?

A quick look at the code, could it be that
skb_flow_dissect_flow_keys_basic() returns false so that
eth_get_headlen() always returns sizeof(*eth)? The linear part
contains nothing meaning before __psk_pull_tail(), so it is possible
for skb_flow_dissect_flow_keys_basic() to fail.

>
> Thanks,
> Christoph
>
> >
> > Ok, I think I understand what you mean! Thanks for taking the time to explain!
> >
> > I will do some tests on my side to make sure I get it right.
> >
> > As your change goes to net and mine to netnext, I can wait until yours
> > is in the tree so that there aren't any conflicts that need to be
> > taken care of.

Will Copy you in the mlx5 non-linear xdp fixing patchset.

> >
> >
> > Christoph
> >
> > >
> > >
> > >
> > > >                 __pskb_pull_tail(skb, headlen);
> > > >         } else {
> > > >                 if (xdp_buff_has_frags(&mxbuf->xdp)) {
> > > >
> > > > --
> > > > 2.50.1
> > > >
> > > >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-09-10  3:17         ` Amery Hung
@ 2025-09-10 17:36           ` Christoph Paasch
  2025-09-10 19:09             ` Amery Hung
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Paasch @ 2025-09-10 17:36 UTC (permalink / raw)
  To: Amery Hung
  Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, netdev, linux-rdma, bpf

On Tue, Sep 9, 2025 at 8:17 PM Amery Hung <ameryhung@gmail.com> wrote:
>
> On Tue, Sep 9, 2025 at 11:18 AM Christoph Paasch <cpaasch@openai.com> wrote:
> >
> > On Mon, Sep 8, 2025 at 9:00 PM Christoph Paasch <cpaasch@openai.com> wrote:
> > >
> > > On Thu, Sep 4, 2025 at 4:30 PM Amery Hung <ameryhung@gmail.com> wrote:
> > > >
> > > > On Thu, Sep 4, 2025 at 3:57 PM Christoph Paasch via B4 Relay
> > > > <devnull+cpaasch.openai.com@kernel.org> wrote:
> > > > >
> > > > > From: Christoph Paasch <cpaasch@openai.com>
> > > > >
> > > > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > > > > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > > > > include part of the payload.
> > > > >
> > > > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > > > > (and skb->head_frag is not set), we end up aggregating packets in the
> > > > > frag_list.
> > > > >
> > > > > This is of course not good when we are CPU-limited. Also causes a worse
> > > > > skb->len/truesize ratio,...
> > > > >
> > > > > So, let's avoid copying parts of the payload to the linear part. We use
> > > > > eth_get_headlen() to parse the headers and compute the length of the
> > > > > protocol headers, which will be used to copy the relevant bits ot the
> > > > > skb's linear part.
> > > > >
> > > > > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> > > > > stack needs to call pskb_may_pull() later on, we don't need to reallocate
> > > > > memory.
> > > > >
> > > > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > > > > LRO enabled):
> > > > >
> > > > > BEFORE:
> > > > > =======
> > > > > (netserver pinned to core receiving interrupts)
> > > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > > > >  87380  16384 262144    60.01    32547.82
> > > > >
> > > > > (netserver pinned to adjacent core receiving interrupts)
> > > > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > > > >  87380  16384 262144    60.00    52531.67
> > > > >
> > > > > AFTER:
> > > > > ======
> > > > > (netserver pinned to core receiving interrupts)
> > > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > > > >  87380  16384 262144    60.00    52896.06
> > > > >
> > > > > (netserver pinned to adjacent core receiving interrupts)
> > > > >  $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > > > >  87380  16384 262144    60.00    85094.90
> > > > >
> > > > > Additional tests across a larger range of parameters w/ and w/o LRO, w/
> > > > > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> > > > > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> > > > > better performance with this patch.
> > > > >
> > > > > Reviewed-by: Eric Dumazet <edumazet@google.com>
> > > > > Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
> > > > > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > > > > ---
> > > > >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++
> > > > >  1 file changed, 5 insertions(+)
> > > > >
> > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > > > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..0ac31c7fb64cd60720d390de45a5b6b453ed0a3f 100644
> > > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > > > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > > > >                 dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
> > > > >                                         rq->buff.map_dir);
> > > > >
> > > > > +               headlen = eth_get_headlen(rq->netdev, head_addr, headlen);
> > > > > +
> > > > >                 frag_offset += headlen;
> > > > >                 byte_cnt -= headlen;
> > > > >                 linear_hr = skb_headroom(skb);
> > > > > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > > > >                                 pagep->frags++;
> > > > >                         while (++pagep < frag_page);
> > > > >                 }
> > > > > +
> > > > > +               headlen = eth_get_headlen(rq->netdev, mxbuf->xdp.data, headlen);
> > > > > +
> > > >
> > > > The size of mxbuf->xdp.data is most likely not headlen here.
> > > >
> > > > The driver currently generates a xdp_buff with empty linear data, pass
> > > > it to the xdp program and assumes the layout If the xdp program does
> > > > not change the layout of the xdp_buff through bpf_xdp_adjust_head() or
> > > > bpf_xdp_adjust_tail(). The assumption is not correct and I am working
> > > > on a fix. But, if we keep that assumption for now, mxbuf->xdp.data
> > > > will not contain any headers or payload. The thing that you try to do
> > > > probably should be:
> > > >
> > > >         skb_frag_t *frag = &sinfo->frags[0];
> > > >
> > > >         headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag),
> > > > skb_frag_size(frag));
> >
> > So, when I look at the headlen I get, it is correct (even with my old
> > code using mxbuf->xdp.data).
> >
> > To make sure I test the right thing, which scenario would
> > mxbuf->xdp.data not contain any headers or payload ? What do I need to
> > do to reproduce that ?
>
> A quick look at the code, could it be that
> skb_flow_dissect_flow_keys_basic() returns false so that
> eth_get_headlen() always returns sizeof(*eth)?

No, the headlen values were correct (meaning, it was the actual length
of the headers):

This is TCP-traffic with a simple print after eth_get_headlen:
[130982.311088] mlx5e_skb_from_cqe_mpwrq_nonlinear xdp headlen is 86

So, eth_get_headlen was able to correctly parse things.

My xdp-program is as simple as possible:
SEC("xdp.frags")
int xdp_pass_prog(struct xdp_md *ctx)
{
    return XDP_PASS;
}


> The linear part
> contains nothing meaning before __psk_pull_tail(), so it is possible
> for skb_flow_dissect_flow_keys_basic() to fail.
>
> >
> > Thanks,
> > Christoph
> >
> > >
> > > Ok, I think I understand what you mean! Thanks for taking the time to explain!
> > >
> > > I will do some tests on my side to make sure I get it right.
> > >
> > > As your change goes to net and mine to netnext, I can wait until yours
> > > is in the tree so that there aren't any conflicts that need to be
> > > taken care of.
>
> Will Copy you in the mlx5 non-linear xdp fixing patchset.

Thx!


Christoph

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-09-10 17:36           ` Christoph Paasch
@ 2025-09-10 19:09             ` Amery Hung
  2026-02-11 17:12               ` Dragos Tatulea
  0 siblings, 1 reply; 10+ messages in thread
From: Amery Hung @ 2025-09-10 19:09 UTC (permalink / raw)
  To: Christoph Paasch
  Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
	Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, netdev, linux-rdma, bpf

On Wed, Sep 10, 2025 at 1:36 PM Christoph Paasch <cpaasch@openai.com> wrote:
>
> On Tue, Sep 9, 2025 at 8:17 PM Amery Hung <ameryhung@gmail.com> wrote:
> >
> > On Tue, Sep 9, 2025 at 11:18 AM Christoph Paasch <cpaasch@openai.com> wrote:
> > >
> > > On Mon, Sep 8, 2025 at 9:00 PM Christoph Paasch <cpaasch@openai.com> wrote:
> > > >
> > > > On Thu, Sep 4, 2025 at 4:30 PM Amery Hung <ameryhung@gmail.com> wrote:
> > > > >
> > > > > On Thu, Sep 4, 2025 at 3:57 PM Christoph Paasch via B4 Relay
> > > > > <devnull+cpaasch.openai.com@kernel.org> wrote:
> > > > > >
> > > > > > From: Christoph Paasch <cpaasch@openai.com>
> > > > > >
> > > > > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > > > > > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > > > > > include part of the payload.
> > > > > >
> > > > > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > > > > > (and skb->head_frag is not set), we end up aggregating packets in the
> > > > > > frag_list.
> > > > > >
> > > > > > This is of course not good when we are CPU-limited. Also causes a worse
> > > > > > skb->len/truesize ratio,...
> > > > > >
> > > > > > So, let's avoid copying parts of the payload to the linear part. We use
> > > > > > eth_get_headlen() to parse the headers and compute the length of the
> > > > > > protocol headers, which will be used to copy the relevant bits ot the
> > > > > > skb's linear part.
> > > > > >
> > > > > > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> > > > > > stack needs to call pskb_may_pull() later on, we don't need to reallocate
> > > > > > memory.
> > > > > >
> > > > > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > > > > > LRO enabled):
> > > > > >
> > > > > > BEFORE:
> > > > > > =======
> > > > > > (netserver pinned to core receiving interrupts)
> > > > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > > > > >  87380  16384 262144    60.01    32547.82
> > > > > >
> > > > > > (netserver pinned to adjacent core receiving interrupts)
> > > > > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > > > > >  87380  16384 262144    60.00    52531.67
> > > > > >
> > > > > > AFTER:
> > > > > > ======
> > > > > > (netserver pinned to core receiving interrupts)
> > > > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > > > > >  87380  16384 262144    60.00    52896.06
> > > > > >
> > > > > > (netserver pinned to adjacent core receiving interrupts)
> > > > > >  $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > > > > >  87380  16384 262144    60.00    85094.90
> > > > > >
> > > > > > Additional tests across a larger range of parameters w/ and w/o LRO, w/
> > > > > > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> > > > > > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> > > > > > better performance with this patch.
> > > > > >
> > > > > > Reviewed-by: Eric Dumazet <edumazet@google.com>
> > > > > > Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
> > > > > > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > > > > > ---
> > > > > >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++
> > > > > >  1 file changed, 5 insertions(+)
> > > > > >
> > > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > > > > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..0ac31c7fb64cd60720d390de45a5b6b453ed0a3f 100644
> > > > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > > > > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > > > > >                 dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
> > > > > >                                         rq->buff.map_dir);
> > > > > >
> > > > > > +               headlen = eth_get_headlen(rq->netdev, head_addr, headlen);
> > > > > > +
> > > > > >                 frag_offset += headlen;
> > > > > >                 byte_cnt -= headlen;
> > > > > >                 linear_hr = skb_headroom(skb);
> > > > > > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > > > > >                                 pagep->frags++;
> > > > > >                         while (++pagep < frag_page);
> > > > > >                 }
> > > > > > +
> > > > > > +               headlen = eth_get_headlen(rq->netdev, mxbuf->xdp.data, headlen);
> > > > > > +
> > > > >
> > > > > The size of mxbuf->xdp.data is most likely not headlen here.
> > > > >
> > > > > The driver currently generates a xdp_buff with empty linear data, pass
> > > > > it to the xdp program and assumes the layout If the xdp program does
> > > > > not change the layout of the xdp_buff through bpf_xdp_adjust_head() or
> > > > > bpf_xdp_adjust_tail(). The assumption is not correct and I am working
> > > > > on a fix. But, if we keep that assumption for now, mxbuf->xdp.data
> > > > > will not contain any headers or payload. The thing that you try to do
> > > > > probably should be:
> > > > >
> > > > >         skb_frag_t *frag = &sinfo->frags[0];
> > > > >
> > > > >         headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag),
> > > > > skb_frag_size(frag));
> > >
> > > So, when I look at the headlen I get, it is correct (even with my old
> > > code using mxbuf->xdp.data).
> > >
> > > To make sure I test the right thing, which scenario would
> > > mxbuf->xdp.data not contain any headers or payload ? What do I need to
> > > do to reproduce that ?
> >
> > A quick look at the code, could it be that
> > skb_flow_dissect_flow_keys_basic() returns false so that
> > eth_get_headlen() always returns sizeof(*eth)?
>
> No, the headlen values were correct (meaning, it was the actual length
> of the headers):
>

Another possibility is that mxbuf->xdp is reused and not zeroed
between calls to mlx5e_skb_from_cqe_mpwrq_nonlinear(). The stale
headers might have been written to mxbuf->xdp.data before the XDP is
attached.

I am not sure what exactly happens, but my main question remains. when
the XDP program is attached and does nothing, the linear data will be
empty, so what is eth_get_headlen() parsing here...?

> This is TCP-traffic with a simple print after eth_get_headlen:
> [130982.311088] mlx5e_skb_from_cqe_mpwrq_nonlinear xdp headlen is 86
>
> So, eth_get_headlen was able to correctly parse things.
>
> My xdp-program is as simple as possible:
> SEC("xdp.frags")
> int xdp_pass_prog(struct xdp_md *ctx)
> {
>     return XDP_PASS;
> }
>
>
> > The linear part
> > contains nothing meaning before __psk_pull_tail(), so it is possible
> > for skb_flow_dissect_flow_keys_basic() to fail.
> >
> > >
> > > Thanks,
> > > Christoph
> > >
> > > >
> > > > Ok, I think I understand what you mean! Thanks for taking the time to explain!
> > > >
> > > > I will do some tests on my side to make sure I get it right.
> > > >
> > > > As your change goes to net and mine to netnext, I can wait until yours
> > > > is in the tree so that there aren't any conflicts that need to be
> > > > taken care of.
> >
> > Will Copy you in the mlx5 non-linear xdp fixing patchset.
>
> Thx!
>
>
> Christoph

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-09-10 19:09             ` Amery Hung
@ 2026-02-11 17:12               ` Dragos Tatulea
  0 siblings, 0 replies; 10+ messages in thread
From: Dragos Tatulea @ 2026-02-11 17:12 UTC (permalink / raw)
  To: Amery Hung, Christoph Paasch
  Cc: Gal Pressman, Saeed Mahameed, Tariq Toukan, Mark Bloch,
	Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	netdev, linux-rdma, bpf

Hi Christoph,

On 10.09.25 21:09, Amery Hung wrote:
> On Wed, Sep 10, 2025 at 1:36 PM Christoph Paasch <cpaasch@openai.com> wrote:

[...]

>>>>>>> This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
>>>>>>> LRO enabled):
>>>>>>>
>>>>>>> BEFORE:
>>>>>>> =======
>>>>>>> (netserver pinned to core receiving interrupts)
>>>>>>> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>>>>>>>  87380  16384 262144    60.01    32547.82
>>>>>>>
>>>>>>> (netserver pinned to adjacent core receiving interrupts)
>>>>>>> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>>>>>>>  87380  16384 262144    60.00    52531.67
>>>>>>>
>>>>>>> AFTER:
>>>>>>> ======
>>>>>>> (netserver pinned to core receiving interrupts)
>>>>>>> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>>>>>>>  87380  16384 262144    60.00    52896.06
>>>>>>>
>>>>>>> (netserver pinned to adjacent core receiving interrupts)
>>>>>>>  $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>>>>>>>  87380  16384 262144    60.00    85094.90
>>>>>>>
We still want this nice optimization. Seems like last discussion was
stuck on reading the header length for XDP. See below for sugession.

[...]
>>>>>>> @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>>>>>>>                                 pagep->frags++;
>>>>>>>                         while (++pagep < frag_page);
>>>>>>>                 }
>>>>>>> +
>>>>>>> +               headlen = eth_get_headlen(rq->netdev, mxbuf->xdp.data, headlen);
>>>>>>> +
>>>>>>
>>>>>> The size of mxbuf->xdp.data is most likely not headlen here.
>>>>>>
>>>>>> The driver currently generates a xdp_buff with empty linear data, pass
>>>>>> it to the xdp program and assumes the layout If the xdp program does
>>>>>> not change the layout of the xdp_buff through bpf_xdp_adjust_head() or
>>>>>> bpf_xdp_adjust_tail(). The assumption is not correct and I am working
>>>>>> on a fix. But, if we keep that assumption for now, mxbuf->xdp.data
>>>>>> will not contain any headers or payload. The thing that you try to do
>>>>>> probably should be:
>>>>>>
>>>>>>         skb_frag_t *frag = &sinfo->frags[0];
>>>>>>
>>>>>>         headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag),
>>>>>> skb_frag_size(frag));
>>>>
>>>> So, when I look at the headlen I get, it is correct (even with my old
>>>> code using mxbuf->xdp.data).
>>>>
>>>> To make sure I test the right thing, which scenario would
>>>> mxbuf->xdp.data not contain any headers or payload ? What do I need to
>>>> do to reproduce that ?
>>>
>>> A quick look at the code, could it be that
>>> skb_flow_dissect_flow_keys_basic() returns false so that
>>> eth_get_headlen() always returns sizeof(*eth)?
>>
>> No, the headlen values were correct (meaning, it was the actual length
>> of the headers):
>>
> 
> Another possibility is that mxbuf->xdp is reused and not zeroed
> between calls to mlx5e_skb_from_cqe_mpwrq_nonlinear(). The stale
> headers might have been written to mxbuf->xdp.data before the XDP is
> attached.
> 
> I am not sure what exactly happens, but my main question remains. when
> the XDP program is attached and does nothing, the linear data will be
> empty, so what is eth_get_headlen() parsing here...?
> 

AFAIU there could be 2 cases here:

1) XDP program doesn't modify the XDP buffer so the header is in the
  first frag,
2) XDP program modifies the XDP buffer and pulls some len bytes into the
  linear part.

So how about always reading headlen from first frag *before* the XDP
program execution. And afterwards calculate the new headlen for
__pskb_pull_tail() based on previous headlen - len? Like this (untested):

```
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -2159,8 +2159,12 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
        if (prog) {
                u8 nr_frags_free, old_nr_frags = sinfo->nr_frags;
+               skb_frag_t *frag = &sinfo->frags[0];
                u32 len;
 
+               headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag),
+                                         skb_frag_size(frag));
+
                if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
                        if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT,
                                                 rq->flags)) {
@@ -2208,8 +2212,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
                                pagep->frags++;
                        while (++pagep < frag_page);
 
-                       headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len,
-                                       skb->data_len);
+                       headlen = min_t(u16, headlen - len, skb->data_len);
                        __pskb_pull_tail(skb, headlen);
                }
        } else {
```

Amery, do you agree?

Would you like to re-send the patch after the merge window closes?
If you are busy we can also pick it up ourselves, do the required
modifications and send it keeping your authorship and the existing tags.

Thanks,
Dragos



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-02-11 17:12 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-04 22:53 [PATCH net-next v5 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch via B4 Relay
2025-09-04 22:53 ` [PATCH net-next v5 1/2] net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear Christoph Paasch via B4 Relay
2025-09-04 22:53 ` [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
2025-09-04 23:30   ` Amery Hung
2025-09-09  4:00     ` Christoph Paasch
2025-09-09 15:18       ` Christoph Paasch
2025-09-10  3:17         ` Amery Hung
2025-09-10 17:36           ` Christoph Paasch
2025-09-10 19:09             ` Amery Hung
2026-02-11 17:12               ` Dragos Tatulea

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox