The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* [PATCH net-next V6 0/3] net/mlx5: Avoid payload in skb's linear part for better GRO-processing
@ 2026-05-07  9:53 Tariq Toukan
  2026-05-07  9:53 ` [PATCH net-next V6 1/3] net/mlx5e: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear Tariq Toukan
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Tariq Toukan @ 2026-05-07  9:53 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, David S. Miller
  Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky, netdev,
	linux-rdma, linux-kernel, Gal Pressman, Dragos Tatulea,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, Amery Hung, Alexei Starovoitov

Hi,

This is V6 of a series originally submitted by Christoph.

When LRO is enabled on the MLX, mlx5e_skb_from_cqe_mpwrq_nonlinear
copies parts of the payload to the linear part of the skb.

This triggers suboptimal processing in GRO, causing slow throughput.

This patch series addresses this by using eth_get_headlen to compute the
size of the protocol headers and only copy those bits. This results in a
significant throughput improvement (detailed results in the specific
patch).

Regards,
Tariq

---

V6:
- Rebase after Amery's changes.
- Address Amery's concern about header length after XDP pull.
- Add a small optimization to memcpy the header length aligned to cache
  line.

V5: https://lore.kernel.org/all/20250904-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-v5-0-ea492f7b11ac@openai.com/


Christoph Paasch (2):
  net/mlx5e: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear
  net/mlx5e: Avoid copying payload to the skb's linear part

Dragos Tatulea (1):
  net/mlx5e: Align header copy to cache line for Striding RQ non-linear

 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 31 +++++++++++++------
 1 file changed, 22 insertions(+), 9 deletions(-)


base-commit: dacf281771a9aed1a723b196120a0de8637910b9
-- 
2.44.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH net-next V6 1/3] net/mlx5e: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear
  2026-05-07  9:53 [PATCH net-next V6 0/3] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Tariq Toukan
@ 2026-05-07  9:53 ` Tariq Toukan
  2026-05-07  9:53 ` [PATCH net-next V6 2/3] net/mlx5e: Avoid copying payload to the skb's linear part Tariq Toukan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2026-05-07  9:53 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, David S. Miller
  Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky, netdev,
	linux-rdma, linux-kernel, Gal Pressman, Dragos Tatulea,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, Amery Hung, Alexei Starovoitov

From: Christoph Paasch <cpaasch@openai.com>

Doing the call to dma_sync_single_for_cpu() earlier will allow us to
adjust headlen based on the actual size of the protocol headers.

Doing this earlier means that we don't need to call
mlx5e_copy_skb_header() anymore and rather can call
skb_copy_to_linear_data() directly.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Christoph Paasch <cpaasch@openai.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 22 +++++++++++++------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 5b60aa47c75b..75ccf40a7f17 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1923,11 +1923,11 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	unsigned int truesize = 0;
 	u32 pg_consumed_bytes;
 	struct bpf_prog *prog;
+	void *va, *head_addr;
 	struct sk_buff *skb;
 	u32 linear_frame_sz;
 	u16 linear_data_len;
 	u16 linear_hr;
-	void *va;
 
 	if (unlikely(cqe_bcnt > rq->hw_mtu)) {
 		u8 lro_num_seg = get_cqe_lro_num_seg(cqe);
@@ -1940,9 +1940,11 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 	prog = rcu_dereference(rq->xdp_prog);
 
+	head_addr = netmem_address(head_page->netmem) + head_offset;
+
 	if (prog) {
 		/* area for bpf_xdp_[store|load]_bytes */
-		net_prefetchw(netmem_address(frag_page->netmem) + frag_offset);
+		net_prefetchw(head_addr);
 
 		va = mlx5e_mpwqe_get_linear_page_frag(rq);
 		if (!va) {
@@ -1956,6 +1958,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		linear_frame_sz = MLX5_SKB_FRAG_SZ(linear_hr + MLX5E_RX_MAX_HEAD);
 		linear_page = &rq->mpwqe.linear_info->frag_page;
 	} else {
+		dma_addr_t addr;
+
 		skb = napi_alloc_skb(rq->cq.napi,
 				     ALIGN(MLX5E_RX_MAX_HEAD, sizeof(long)));
 		if (unlikely(!skb)) {
@@ -1967,6 +1971,11 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		net_prefetchw(va); /* xdp_frame data area */
 		net_prefetchw(skb->data);
 
+		addr = page_pool_get_dma_addr_netmem(head_page->netmem);
+		dma_sync_single_for_cpu(rq->pdev, addr + head_offset,
+					ALIGN(headlen, sizeof(long)),
+					rq->buff.map_dir);
+
 		frag_offset += headlen;
 		byte_cnt -= headlen;
 		linear_hr = skb_headroom(skb);
@@ -2056,8 +2065,6 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 			__pskb_pull_tail(skb, headlen);
 		}
 	} else {
-		dma_addr_t addr;
-
 		if (xdp_buff_has_frags(&mxbuf->xdp)) {
 			struct mlx5e_frag_page *pagep;
 
@@ -2071,10 +2078,11 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 				pagep->frags++;
 			while (++pagep < frag_page);
 		}
+
 		/* copy header */
-		addr = page_pool_get_dma_addr_netmem(head_page->netmem);
-		mlx5e_copy_skb_header(rq, skb, head_page->netmem, addr,
-				      head_offset, head_offset, headlen);
+		skb_copy_to_linear_data(skb, head_addr,
+					ALIGN(headlen, sizeof(long)));
+
 		/* skb linear part was allocated with headlen and aligned to long */
 		skb->tail += headlen;
 		skb->len  += headlen;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next V6 2/3] net/mlx5e: Avoid copying payload to the skb's linear part
  2026-05-07  9:53 [PATCH net-next V6 0/3] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Tariq Toukan
  2026-05-07  9:53 ` [PATCH net-next V6 1/3] net/mlx5e: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear Tariq Toukan
@ 2026-05-07  9:53 ` Tariq Toukan
  2026-05-07 13:53   ` Amery Hung
  2026-05-08 12:43   ` David Laight
  2026-05-07  9:53 ` [PATCH net-next V6 3/3] net/mlx5e: Align header copy to cache line for Striding RQ non-linear Tariq Toukan
  2026-05-07 19:58 ` [PATCH net-next V6 0/3] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch
  3 siblings, 2 replies; 14+ messages in thread
From: Tariq Toukan @ 2026-05-07  9:53 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, David S. Miller
  Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky, netdev,
	linux-rdma, linux-kernel, Gal Pressman, Dragos Tatulea,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, Amery Hung, Alexei Starovoitov

From: Christoph Paasch <cpaasch@openai.com>

mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
bytes from the page-pool to the skb's linear part. Those 256 bytes
include part of the payload.

When attempting to do GRO in skb_gro_receive, if headlen > data_offset
(and skb->head_frag is not set), we end up aggregating packets in the
frag_list.

This is of course not good when we are CPU-limited. Also causes a worse
skb->len/truesize ratio,...

So, let's avoid copying parts of the payload to the linear part. We use
eth_get_headlen() to parse the headers and compute the length of the
protocol headers, which will be used to copy the relevant bits of the
skb's linear part.

We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
stack needs to call pskb_may_pull() later on, we don't need to reallocate
memory.

This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
LRO enabled):

BEFORE:
=======
(netserver pinned to core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
 87380  16384 262144    60.01    32547.82

(netserver pinned to adjacent core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
 87380  16384 262144    60.00    52531.67

AFTER:
======
(netserver pinned to core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
 87380  16384 262144    60.00    52896.06

(netserver pinned to adjacent core receiving interrupts)
 $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
 87380  16384 262144    60.00    85094.90

Additional tests across a larger range of parameters w/ and w/o LRO, w/
and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
better performance with this patch.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Christoph Paasch <cpaasch@openai.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 75ccf40a7f17..301b33419207 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1976,6 +1976,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 					ALIGN(headlen, sizeof(long)),
 					rq->buff.map_dir);
 
+		headlen = eth_get_headlen(rq->netdev, head_addr, headlen);
+
 		frag_offset += headlen;
 		byte_cnt -= headlen;
 		linear_hr = skb_headroom(skb);
@@ -2012,9 +2014,13 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 	if (prog) {
 		u8 nr_frags_free, old_nr_frags = sinfo->nr_frags;
+		skb_frag_t *frag = &sinfo->frags[0];
 		u8 new_nr_frags;
 		u32 len;
 
+		headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag),
+					  skb_frag_size(frag));
+
 		if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
 			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
 				struct mlx5e_frag_page *pfp;
@@ -2060,8 +2066,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 				pagep->frags++;
 			while (++pagep < frag_page);
 
-			headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len,
-					skb->data_len);
+			headlen = min_t(u16, headlen - len, skb->data_len);
 			__pskb_pull_tail(skb, headlen);
 		}
 	} else {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next V6 3/3] net/mlx5e: Align header copy to cache line for Striding RQ non-linear
  2026-05-07  9:53 [PATCH net-next V6 0/3] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Tariq Toukan
  2026-05-07  9:53 ` [PATCH net-next V6 1/3] net/mlx5e: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear Tariq Toukan
  2026-05-07  9:53 ` [PATCH net-next V6 2/3] net/mlx5e: Avoid copying payload to the skb's linear part Tariq Toukan
@ 2026-05-07  9:53 ` Tariq Toukan
  2026-05-07 19:58 ` [PATCH net-next V6 0/3] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch
  3 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2026-05-07  9:53 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, David S. Miller
  Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky, netdev,
	linux-rdma, linux-kernel, Gal Pressman, Dragos Tatulea,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, Amery Hung, Alexei Starovoitov

From: Dragos Tatulea <dtatulea@nvidia.com>

In Striding RQ non-linear mode, there is a memcpy to pull the
header from the first fragment into the linear part of the skb.
As the header length is not aligned, it can cause cache thrashing
from a Read-Modify-Write cycle for the remaining bytes of the
cache line.

This patch changes the memcopy length to be aligned to the cache line.
The DMA sync is also aligned to cache line size accordingly. Note that
the original DMA sync is done on the initial conservative headlen
which is min(MLX5E_RX_MAX_HEAD, cqe_bcnt).

To show the improvement, a test was run with an XDP_DROP program
processing 64B packets at 100% CPU utilization over a single queue at
9000 MTU:

|----------+----------+------|
| Before   | After    | Diff |
|----------+----------+------|
| 3.6 Mpps | 3.8 Mpps | 5%   |
|----------+----------+------|

(CX7 NIC on Intel Xeon Platinum 8580 system)

While small packets profit most from this improvement, large packets
are not negatively affected (no regressions).

Suggested-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 301b33419207..e5963e1b5309 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1973,7 +1973,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 		addr = page_pool_get_dma_addr_netmem(head_page->netmem);
 		dma_sync_single_for_cpu(rq->pdev, addr + head_offset,
-					ALIGN(headlen, sizeof(long)),
+					ALIGN(headlen, cache_line_size()),
 					rq->buff.map_dir);
 
 		headlen = eth_get_headlen(rq->netdev, head_addr, headlen);
@@ -2086,7 +2086,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 		/* copy header */
 		skb_copy_to_linear_data(skb, head_addr,
-					ALIGN(headlen, sizeof(long)));
+					ALIGN(headlen, cache_line_size()));
 
 		/* skb linear part was allocated with headlen and aligned to long */
 		skb->tail += headlen;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next V6 2/3] net/mlx5e: Avoid copying payload to the skb's linear part
  2026-05-07  9:53 ` [PATCH net-next V6 2/3] net/mlx5e: Avoid copying payload to the skb's linear part Tariq Toukan
@ 2026-05-07 13:53   ` Amery Hung
  2026-05-07 15:49     ` Dragos Tatulea
  2026-05-08 12:43   ` David Laight
  1 sibling, 1 reply; 14+ messages in thread
From: Amery Hung @ 2026-05-07 13:53 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Christoph Paasch, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, David S. Miller, Saeed Mahameed, Mark Bloch,
	Leon Romanovsky, netdev, linux-rdma, linux-kernel, Gal Pressman,
	Dragos Tatulea, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, Alexei Starovoitov

On Thu, May 7, 2026 at 10:54 AM Tariq Toukan <tariqt@nvidia.com> wrote:
>
> From: Christoph Paasch <cpaasch@openai.com>
>
> mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> bytes from the page-pool to the skb's linear part. Those 256 bytes
> include part of the payload.
>
> When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> (and skb->head_frag is not set), we end up aggregating packets in the
> frag_list.
>
> This is of course not good when we are CPU-limited. Also causes a worse
> skb->len/truesize ratio,...
>
> So, let's avoid copying parts of the payload to the linear part. We use
> eth_get_headlen() to parse the headers and compute the length of the
> protocol headers, which will be used to copy the relevant bits of the
> skb's linear part.
>
> We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> stack needs to call pskb_may_pull() later on, we don't need to reallocate
> memory.
>
> This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> LRO enabled):
>
> BEFORE:
> =======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.01    32547.82
>
> (netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.00    52531.67
>
> AFTER:
> ======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.00    52896.06
>
> (netserver pinned to adjacent core receiving interrupts)
>  $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.00    85094.90
>
> Additional tests across a larger range of parameters w/ and w/o LRO, w/
> and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> better performance with this patch.
>
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 75ccf40a7f17..301b33419207 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -1976,6 +1976,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>                                         ALIGN(headlen, sizeof(long)),
>                                         rq->buff.map_dir);
>
> +               headlen = eth_get_headlen(rq->netdev, head_addr, headlen);
> +
>                 frag_offset += headlen;
>                 byte_cnt -= headlen;
>                 linear_hr = skb_headroom(skb);
> @@ -2012,9 +2014,13 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>
>         if (prog) {
>                 u8 nr_frags_free, old_nr_frags = sinfo->nr_frags;
> +               skb_frag_t *frag = &sinfo->frags[0];
>                 u8 new_nr_frags;
>                 u32 len;
>
> +               headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag),
> +                                         skb_frag_size(frag));
> +
>                 if (mlx5e_xdp_handle(rq, prog, mxbuf)) {

Hello,

Am I understanding correctly that the better performance comes with
the assumption that the XDP does not change headers?

headlen is determined before the XDP program runs. If it push/pop
headers, there could be headers in frags or data in the linear region
after __pskb_pull_tail().

>                         if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
>                                 struct mlx5e_frag_page *pfp;
> @@ -2060,8 +2066,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>                                 pagep->frags++;
>                         while (++pagep < frag_page);
>
> -                       headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len,
> -                                       skb->data_len);
> +                       headlen = min_t(u16, headlen - len, skb->data_len);

headlen - len can underflow but will be capped by skb->data_len, so
this should be okay, right?

>                         __pskb_pull_tail(skb, headlen);
>                 }
>         } else {
> --
> 2.44.0
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next V6 2/3] net/mlx5e: Avoid copying payload to the skb's linear part
  2026-05-07 13:53   ` Amery Hung
@ 2026-05-07 15:49     ` Dragos Tatulea
  2026-05-07 20:50       ` Amery Hung
  0 siblings, 1 reply; 14+ messages in thread
From: Dragos Tatulea @ 2026-05-07 15:49 UTC (permalink / raw)
  To: Amery Hung, Tariq Toukan
  Cc: Christoph Paasch, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, David S. Miller, Saeed Mahameed, Mark Bloch,
	Leon Romanovsky, netdev, linux-rdma, linux-kernel, Gal Pressman,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, Alexei Starovoitov


Hi Amery,

On 07.05.26 15:53, Amery Hung wrote:
> [...]
> Am I understanding correctly that the better performance comes with
> the assumption that the XDP does not change headers?
> 
> headlen is determined before the XDP program runs. If it push/pop
> headers, there could be headers in frags or data in the linear region
> after __pskb_pull_tail().
> 
That's right.

>>                         if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
>>                                 struct mlx5e_frag_page *pfp;
>> @@ -2060,8 +2066,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>>                                 pagep->frags++;
>>                         while (++pagep < frag_page);
>>
>> -                       headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len,
>> -                                       skb->data_len);
>> +                       headlen = min_t(u16, headlen - len, skb->data_len);
> 
> headlen - len can underflow but will be capped by skb->data_len, so
> this should be okay, right?
It is safe. But it might trigger an extra allocation in the pull when
len > headlen. We could also skip the pull in that case. Or do a
min(headlen - len, min(skb->data_len, MLX5E_RX_MAX_HEAD)). WDYT?

Thanks,
Dragos


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next V6 0/3] net/mlx5: Avoid payload in skb's linear part for better GRO-processing
  2026-05-07  9:53 [PATCH net-next V6 0/3] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Tariq Toukan
                   ` (2 preceding siblings ...)
  2026-05-07  9:53 ` [PATCH net-next V6 3/3] net/mlx5e: Align header copy to cache line for Striding RQ non-linear Tariq Toukan
@ 2026-05-07 19:58 ` Christoph Paasch
  3 siblings, 0 replies; 14+ messages in thread
From: Christoph Paasch @ 2026-05-07 19:58 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller, Saeed Mahameed, Mark Bloch, Leon Romanovsky,
	netdev, linux-rdma, linux-kernel, Gal Pressman, Dragos Tatulea,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, Amery Hung, Alexei Starovoitov

On Thu, May 7, 2026 at 2:54 AM Tariq Toukan <tariqt@nvidia.com> wrote:
>
> Hi,
>
> This is V6 of a series originally submitted by Christoph.

Sorry for having dropped the ball on this!

Thanks for pushing it forward!


Christoph

>
> When LRO is enabled on the MLX, mlx5e_skb_from_cqe_mpwrq_nonlinear
> copies parts of the payload to the linear part of the skb.
>
> This triggers suboptimal processing in GRO, causing slow throughput.
>
> This patch series addresses this by using eth_get_headlen to compute the
> size of the protocol headers and only copy those bits. This results in a
> significant throughput improvement (detailed results in the specific
> patch).
>
> Regards,
> Tariq
>
> ---
>
> V6:
> - Rebase after Amery's changes.
> - Address Amery's concern about header length after XDP pull.
> - Add a small optimization to memcpy the header length aligned to cache
>   line.
>
> V5: https://lore.kernel.org/all/20250904-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-v5-0-ea492f7b11ac@openai.com/
>
>
> Christoph Paasch (2):
>   net/mlx5e: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear
>   net/mlx5e: Avoid copying payload to the skb's linear part
>
> Dragos Tatulea (1):
>   net/mlx5e: Align header copy to cache line for Striding RQ non-linear
>
>  .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 31 +++++++++++++------
>  1 file changed, 22 insertions(+), 9 deletions(-)
>
>
> base-commit: dacf281771a9aed1a723b196120a0de8637910b9
> --
> 2.44.0
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next V6 2/3] net/mlx5e: Avoid copying payload to the skb's linear part
  2026-05-07 15:49     ` Dragos Tatulea
@ 2026-05-07 20:50       ` Amery Hung
  2026-05-08  9:15         ` Dragos Tatulea
  0 siblings, 1 reply; 14+ messages in thread
From: Amery Hung @ 2026-05-07 20:50 UTC (permalink / raw)
  To: Dragos Tatulea
  Cc: Tariq Toukan, Christoph Paasch, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andrew Lunn, David S. Miller, Saeed Mahameed,
	Mark Bloch, Leon Romanovsky, netdev, linux-rdma, linux-kernel,
	Gal Pressman, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, Alexei Starovoitov

On Thu, May 7, 2026 at 4:50 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>
>
> Hi Amery,
>
> On 07.05.26 15:53, Amery Hung wrote:
> > [...]
> > Am I understanding correctly that the better performance comes with
> > the assumption that the XDP does not change headers?
> >
> > headlen is determined before the XDP program runs. If it push/pop
> > headers, there could be headers in frags or data in the linear region
> > after __pskb_pull_tail().
> >
> That's right.
>
> >>                         if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
> >>                                 struct mlx5e_frag_page *pfp;
> >> @@ -2060,8 +2066,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> >>                                 pagep->frags++;
> >>                         while (++pagep < frag_page);
> >>
> >> -                       headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len,
> >> -                                       skb->data_len);
> >> +                       headlen = min_t(u16, headlen - len, skb->data_len);
> >
> > headlen - len can underflow but will be capped by skb->data_len, so
> > this should be okay, right?
> It is safe. But it might trigger an extra allocation in the pull when
> len > headlen. We could also skip the pull in that case. Or do a
> min(headlen - len, min(skb->data_len, MLX5E_RX_MAX_HEAD)). WDYT?

Make sense, but this line took me a bit to understand. Maybe consider
checking len < headlen first?

if (len < headlen) {
        headlen = min_t(u32, headlen - len, skb->data_len);
        __pskb_pull_tail(skb, headlen);
}

Another clarifying question. So this patch will improve the
performance when the XDP programs don't change header length. For
those that encap/decap, they should precisely pull only headers into
the linear area for optimal performance. Is it correct?

>
> Thanks,
> Dragos
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next V6 2/3] net/mlx5e: Avoid copying payload to the skb's linear part
  2026-05-07 20:50       ` Amery Hung
@ 2026-05-08  9:15         ` Dragos Tatulea
  2026-05-08 17:44           ` Amery Hung
  0 siblings, 1 reply; 14+ messages in thread
From: Dragos Tatulea @ 2026-05-08  9:15 UTC (permalink / raw)
  To: Amery Hung
  Cc: Tariq Toukan, Christoph Paasch, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andrew Lunn, David S. Miller, Saeed Mahameed,
	Mark Bloch, Leon Romanovsky, netdev, linux-rdma, linux-kernel,
	Gal Pressman, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, Alexei Starovoitov



On 07.05.26 22:50, Amery Hung wrote:
> On Thu, May 7, 2026 at 4:50 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>>
>>
>> Hi Amery,
>>
>> On 07.05.26 15:53, Amery Hung wrote:
>>> [...]
>>> Am I understanding correctly that the better performance comes with
>>> the assumption that the XDP does not change headers?
>>>
>>> headlen is determined before the XDP program runs. If it push/pop
>>> headers, there could be headers in frags or data in the linear region
>>> after __pskb_pull_tail().
>>>
>> That's right.
>>
>>>>                         if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
>>>>                                 struct mlx5e_frag_page *pfp;
>>>> @@ -2060,8 +2066,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>>>>                                 pagep->frags++;
>>>>                         while (++pagep < frag_page);
>>>>
>>>> -                       headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len,
>>>> -                                       skb->data_len);
>>>> +                       headlen = min_t(u16, headlen - len, skb->data_len);
>>>
>>> headlen - len can underflow but will be capped by skb->data_len, so
>>> this should be okay, right?
>> It is safe. But it might trigger an extra allocation in the pull when
>> len > headlen. We could also skip the pull in that case. Or do a
>> min(headlen - len, min(skb->data_len, MLX5E_RX_MAX_HEAD)). WDYT?
> 
> Make sense, but this line took me a bit to understand. Maybe consider
> checking len < headlen first?
> 
> if (len < headlen) {
>         headlen = min_t(u32, headlen - len, skb->data_len);
>         __pskb_pull_tail(skb, headlen);
> }
> 
Yes, that's what I had in mind when skipping the pull. I would also
tag this as likely.

> Another clarifying question. So this patch will improve the
> performance when the XDP programs don't change header length. For
> those that encap/decap, they should precisely pull only headers into
> the linear area for optimal performance. Is it correct?
> 
Right for encap, but for decap not quite:

Let's say that the XDP program pulls 64B header into the linear part
and snips 4B of the encap out. This would result in a pull of an
additional 4B (headlen (64B) - len (60B) = 4B) which are now
data bytes => sub-optimal layout.

I don't see how we can improve this corner case though.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next V6 2/3] net/mlx5e: Avoid copying payload to the skb's linear part
  2026-05-07  9:53 ` [PATCH net-next V6 2/3] net/mlx5e: Avoid copying payload to the skb's linear part Tariq Toukan
  2026-05-07 13:53   ` Amery Hung
@ 2026-05-08 12:43   ` David Laight
  2026-05-08 13:30     ` Dragos Tatulea
  1 sibling, 1 reply; 14+ messages in thread
From: David Laight @ 2026-05-08 12:43 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Christoph Paasch, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, David S. Miller, Saeed Mahameed, Mark Bloch,
	Leon Romanovsky, netdev, linux-rdma, linux-kernel, Gal Pressman,
	Dragos Tatulea, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, Amery Hung,
	Alexei Starovoitov

On Thu, 7 May 2026 12:53:29 +0300
Tariq Toukan <tariqt@nvidia.com> wrote:

> From: Christoph Paasch <cpaasch@openai.com>
> 
> mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> bytes from the page-pool to the skb's linear part. Those 256 bytes
> include part of the payload.
> 
> When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> (and skb->head_frag is not set), we end up aggregating packets in the
> frag_list.
> 
> This is of course not good when we are CPU-limited. Also causes a worse
> skb->len/truesize ratio,...
> 
> So, let's avoid copying parts of the payload to the linear part. We use
> eth_get_headlen() to parse the headers and compute the length of the
> protocol headers, which will be used to copy the relevant bits of the
> skb's linear part.
> 
> We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> stack needs to call pskb_may_pull() later on, we don't need to reallocate
> memory.
> 
> This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> LRO enabled):
> 
> BEFORE:
> =======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.01    32547.82
> 
> (netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.00    52531.67
> 
> AFTER:
> ======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.00    52896.06
> 
> (netserver pinned to adjacent core receiving interrupts)
>  $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.00    85094.90
> 
> Additional tests across a larger range of parameters w/ and w/o LRO, w/
> and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> better performance with this patch.
> 
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 75ccf40a7f17..301b33419207 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
...
> @@ -2060,8 +2066,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>  				pagep->frags++;
>  			while (++pagep < frag_page);
>  
> -			headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len,
> -					skb->data_len);
> +			headlen = min_t(u16, headlen - len, skb->data_len);

That looks entirely broken.
skb->data_len can be larger than 65535 so (u16)skb->data_len can
discard significant bits.

I can't quite see why the subtract can't overflow either.
It is entirely non-obvious.

There seem to be far too many u16 local variables in this code.
Typically they just make the code larger because they require the
compiler mask arithmetic results to 16bits all the time.
(Only x86 and m68k have instructions for 8 and 16bit arithmetic.)
The same is true for function parameters and results.

I think all the min_t() in this file can easily be changed to min().

-- David

>  			__pskb_pull_tail(skb, headlen);
>  		}
>  	} else {


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next V6 2/3] net/mlx5e: Avoid copying payload to the skb's linear part
  2026-05-08 12:43   ` David Laight
@ 2026-05-08 13:30     ` Dragos Tatulea
  0 siblings, 0 replies; 14+ messages in thread
From: Dragos Tatulea @ 2026-05-08 13:30 UTC (permalink / raw)
  To: David Laight, Tariq Toukan
  Cc: Christoph Paasch, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, David S. Miller, Saeed Mahameed, Mark Bloch,
	Leon Romanovsky, netdev, linux-rdma, linux-kernel, Gal Pressman,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, Amery Hung, Alexei Starovoitov



On 08.05.26 14:43, David Laight wrote:
> On Thu, 7 May 2026 12:53:29 +0300
> Tariq Toukan <tariqt@nvidia.com> wrote:
> 
>> From: Christoph Paasch <cpaasch@openai.com>
>>
>> mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
>> bytes from the page-pool to the skb's linear part. Those 256 bytes
>> include part of the payload.
>>
>> When attempting to do GRO in skb_gro_receive, if headlen > data_offset
>> (and skb->head_frag is not set), we end up aggregating packets in the
>> frag_list.
>>
>> This is of course not good when we are CPU-limited. Also causes a worse
>> skb->len/truesize ratio,...
>>
>> So, let's avoid copying parts of the payload to the linear part. We use
>> eth_get_headlen() to parse the headers and compute the length of the
>> protocol headers, which will be used to copy the relevant bits of the
>> skb's linear part.
>>
>> We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
>> stack needs to call pskb_may_pull() later on, we don't need to reallocate
>> memory.
>>
>> This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
>> LRO enabled):
>>
>> BEFORE:
>> =======
>> (netserver pinned to core receiving interrupts)
>> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>>  87380  16384 262144    60.01    32547.82
>>
>> (netserver pinned to adjacent core receiving interrupts)
>> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>>  87380  16384 262144    60.00    52531.67
>>
>> AFTER:
>> ======
>> (netserver pinned to core receiving interrupts)
>> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>>  87380  16384 262144    60.00    52896.06
>>
>> (netserver pinned to adjacent core receiving interrupts)
>>  $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>>  87380  16384 262144    60.00    85094.90
>>
>> Additional tests across a larger range of parameters w/ and w/o LRO, w/
>> and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
>> TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
>> better performance with this patch.
>>
>> Reviewed-by: Eric Dumazet <edumazet@google.com>
>> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
>> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
>> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>> ---
>>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 9 +++++++--
>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
>> index 75ccf40a7f17..301b33419207 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> ...
>> @@ -2060,8 +2066,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>>  				pagep->frags++;
>>  			while (++pagep < frag_page);
>>  
>> -			headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len,
>> -					skb->data_len);
>> +			headlen = min_t(u16, headlen - len, skb->data_len);
> 
> That looks entirely broken.
> skb->data_len can be larger than 65535 so (u16)skb->data_len can
> discard significant bits.
> 
> I can't quite see why the subtract can't overflow either.
> It is entirely non-obvious.
>
A check will be added for that.
 
> There seem to be far too many u16 local variables in this code.
> Typically they just make the code larger because they require the
> compiler mask arithmetic results to 16bits all the time.
> (Only x86 and m68k have instructions for 8 and 16bit arithmetic.)
> The same is true for function parameters and results.
> 
> I think all the min_t() in this file can easily be changed to min().
>
Will use min() here. And for the rest of the datapath files I will look
into a follow-up patch.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next V6 2/3] net/mlx5e: Avoid copying payload to the skb's linear part
  2026-05-08  9:15         ` Dragos Tatulea
@ 2026-05-08 17:44           ` Amery Hung
  2026-05-08 18:42             ` Dragos Tatulea
  0 siblings, 1 reply; 14+ messages in thread
From: Amery Hung @ 2026-05-08 17:44 UTC (permalink / raw)
  To: Dragos Tatulea
  Cc: Tariq Toukan, Christoph Paasch, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andrew Lunn, David S. Miller, Saeed Mahameed,
	Mark Bloch, Leon Romanovsky, netdev, linux-rdma, linux-kernel,
	Gal Pressman, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, Alexei Starovoitov

On Fri, May 8, 2026 at 2:15 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>
>
>
> On 07.05.26 22:50, Amery Hung wrote:
> > On Thu, May 7, 2026 at 4:50 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> >>
> >>
> >> Hi Amery,
> >>
> >> On 07.05.26 15:53, Amery Hung wrote:
> >>> [...]
> >>> Am I understanding correctly that the better performance comes with
> >>> the assumption that the XDP does not change headers?
> >>>
> >>> headlen is determined before the XDP program runs. If it push/pop
> >>> headers, there could be headers in frags or data in the linear region
> >>> after __pskb_pull_tail().
> >>>
> >> That's right.
> >>
> >>>>                         if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
> >>>>                                 struct mlx5e_frag_page *pfp;
> >>>> @@ -2060,8 +2066,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> >>>>                                 pagep->frags++;
> >>>>                         while (++pagep < frag_page);
> >>>>
> >>>> -                       headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len,
> >>>> -                                       skb->data_len);
> >>>> +                       headlen = min_t(u16, headlen - len, skb->data_len);
> >>>
> >>> headlen - len can underflow but will be capped by skb->data_len, so
> >>> this should be okay, right?
> >> It is safe. But it might trigger an extra allocation in the pull when
> >> len > headlen. We could also skip the pull in that case. Or do a
> >> min(headlen - len, min(skb->data_len, MLX5E_RX_MAX_HEAD)). WDYT?
> >
> > Make sense, but this line took me a bit to understand. Maybe consider
> > checking len < headlen first?
> >
> > if (len < headlen) {
> >         headlen = min_t(u32, headlen - len, skb->data_len);
> >         __pskb_pull_tail(skb, headlen);
> > }
> >
> Yes, that's what I had in mind when skipping the pull. I would also
> tag this as likely.
>
> > Another clarifying question. So this patch will improve the
> > performance when the XDP programs don't change header length. For
> > those that encap/decap, they should precisely pull only headers into
> > the linear area for optimal performance. Is it correct?
> >
> Right for encap, but for decap not quite:
>
> Let's say that the XDP program pulls 64B header into the linear part
> and snips 4B of the encap out. This would result in a pull of an
> additional 4B (headlen (64B) - len (60B) = 4B) which are now
> data bytes => sub-optimal layout.
>
> I don't see how we can improve this corner case though.

I see. Thanks for the clarification.

I think the "if (len < headlen)" makes too many assumptions about what
the XDP program did.

How about this policy instead: If the XDP program did not create/pull
data into the linear area, pull the parsed headers; otherwise, assume
the XDP program owns the geometry. min() is still needed since the
program can shrink the packet.

if (!len) {
        headlen = min(headlen, skb->data_len);
        __pskb_pull_tail(skb, headen);
}

This preserves the optimization for the default no-modification case,
and most importantly allow XDP program to get the optimal performance
if it gets the final geometry right.

>
> Thanks,
> Dragos

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next V6 2/3] net/mlx5e: Avoid copying payload to the skb's linear part
  2026-05-08 17:44           ` Amery Hung
@ 2026-05-08 18:42             ` Dragos Tatulea
  2026-05-10  6:50               ` Dragos Tatulea
  0 siblings, 1 reply; 14+ messages in thread
From: Dragos Tatulea @ 2026-05-08 18:42 UTC (permalink / raw)
  To: Amery Hung
  Cc: Tariq Toukan, Christoph Paasch, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andrew Lunn, David S. Miller, Saeed Mahameed,
	Mark Bloch, Leon Romanovsky, netdev, linux-rdma, linux-kernel,
	Gal Pressman, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, Alexei Starovoitov



On 08.05.26 19:44, Amery Hung wrote:
> On Fri, May 8, 2026 at 2:15 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>>
>>
>>
>> On 07.05.26 22:50, Amery Hung wrote:
>>> On Thu, May 7, 2026 at 4:50 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>>>>
>>>>
>>>> Hi Amery,
>>>>
>>>> On 07.05.26 15:53, Amery Hung wrote:
>>>>> [...]
>>>>> Am I understanding correctly that the better performance comes with
>>>>> the assumption that the XDP does not change headers?
>>>>>
>>>>> headlen is determined before the XDP program runs. If it push/pop
>>>>> headers, there could be headers in frags or data in the linear region
>>>>> after __pskb_pull_tail().
>>>>>
>>>> That's right.
>>>>
>>>>>>                         if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
>>>>>>                                 struct mlx5e_frag_page *pfp;
>>>>>> @@ -2060,8 +2066,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>>>>>>                                 pagep->frags++;
>>>>>>                         while (++pagep < frag_page);
>>>>>>
>>>>>> -                       headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len,
>>>>>> -                                       skb->data_len);
>>>>>> +                       headlen = min_t(u16, headlen - len, skb->data_len);
>>>>>
>>>>> headlen - len can underflow but will be capped by skb->data_len, so
>>>>> this should be okay, right?
>>>> It is safe. But it might trigger an extra allocation in the pull when
>>>> len > headlen. We could also skip the pull in that case. Or do a
>>>> min(headlen - len, min(skb->data_len, MLX5E_RX_MAX_HEAD)). WDYT?
>>>
>>> Make sense, but this line took me a bit to understand. Maybe consider
>>> checking len < headlen first?
>>>
>>> if (len < headlen) {
>>>         headlen = min_t(u32, headlen - len, skb->data_len);
>>>         __pskb_pull_tail(skb, headlen);
>>> }
>>>
>> Yes, that's what I had in mind when skipping the pull. I would also
>> tag this as likely.
>>
>>> Another clarifying question. So this patch will improve the
>>> performance when the XDP programs don't change header length. For
>>> those that encap/decap, they should precisely pull only headers into
>>> the linear area for optimal performance. Is it correct?
>>>
>> Right for encap, but for decap not quite:
>>
>> Let's say that the XDP program pulls 64B header into the linear part
>> and snips 4B of the encap out. This would result in a pull of an
>> additional 4B (headlen (64B) - len (60B) = 4B) which are now
>> data bytes => sub-optimal layout.
>>
>> I don't see how we can improve this corner case though.
> 
> I see. Thanks for the clarification.
> 
> I think the "if (len < headlen)" makes too many assumptions about what
> the XDP program did.
> 
> How about this policy instead: If the XDP program did not create/pull
> data into the linear area, pull the parsed headers; otherwise, assume
> the XDP program owns the geometry. min() is still needed since the
> program can shrink the packet.
> 
> if (!len) {
>         headlen = min(headlen, skb->data_len);
>         __pskb_pull_tail(skb, headen);
> }
> 
> This preserves the optimization for the default no-modification case,
> and most importantly allow XDP program to get the optimal performance
> if it gets the final geometry right.
> 
I like this. It will also save us some neurons next time we need to
touch these lines.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next V6 2/3] net/mlx5e: Avoid copying payload to the skb's linear part
  2026-05-08 18:42             ` Dragos Tatulea
@ 2026-05-10  6:50               ` Dragos Tatulea
  0 siblings, 0 replies; 14+ messages in thread
From: Dragos Tatulea @ 2026-05-10  6:50 UTC (permalink / raw)
  To: Amery Hung
  Cc: Tariq Toukan, Christoph Paasch, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andrew Lunn, David S. Miller, Saeed Mahameed,
	Mark Bloch, Leon Romanovsky, netdev, linux-rdma, linux-kernel,
	Gal Pressman, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, Alexei Starovoitov



On 08.05.26 20:42, Dragos Tatulea wrote:
> 
> 
> On 08.05.26 19:44, Amery Hung wrote:
>> On Fri, May 8, 2026 at 2:15 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>>>
>>>
>>>
>>> On 07.05.26 22:50, Amery Hung wrote:
>>>> On Thu, May 7, 2026 at 4:50 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>>>>>
>>>>>
>>>>> Hi Amery,
>>>>>
>>>>> On 07.05.26 15:53, Amery Hung wrote:
>>>>>> [...]
>>>>>> Am I understanding correctly that the better performance comes with
>>>>>> the assumption that the XDP does not change headers?
>>>>>>
>>>>>> headlen is determined before the XDP program runs. If it push/pop
>>>>>> headers, there could be headers in frags or data in the linear region
>>>>>> after __pskb_pull_tail().
>>>>>>
>>>>> That's right.
>>>>>
>>>>>>>                         if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
>>>>>>>                                 struct mlx5e_frag_page *pfp;
>>>>>>> @@ -2060,8 +2066,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>>>>>>>                                 pagep->frags++;
>>>>>>>                         while (++pagep < frag_page);
>>>>>>>
>>>>>>> -                       headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len,
>>>>>>> -                                       skb->data_len);
>>>>>>> +                       headlen = min_t(u16, headlen - len, skb->data_len);
>>>>>>
>>>>>> headlen - len can underflow but will be capped by skb->data_len, so
>>>>>> this should be okay, right?
>>>>> It is safe. But it might trigger an extra allocation in the pull when
>>>>> len > headlen. We could also skip the pull in that case. Or do a
>>>>> min(headlen - len, min(skb->data_len, MLX5E_RX_MAX_HEAD)). WDYT?
>>>>
>>>> Make sense, but this line took me a bit to understand. Maybe consider
>>>> checking len < headlen first?
>>>>
>>>> if (len < headlen) {
>>>>         headlen = min_t(u32, headlen - len, skb->data_len);
>>>>         __pskb_pull_tail(skb, headlen);
>>>> }
>>>>
>>> Yes, that's what I had in mind when skipping the pull. I would also
>>> tag this as likely.
>>>
>>>> Another clarifying question. So this patch will improve the
>>>> performance when the XDP programs don't change header length. For
>>>> those that encap/decap, they should precisely pull only headers into
>>>> the linear area for optimal performance. Is it correct?
>>>>
>>> Right for encap, but for decap not quite:
>>>
>>> Let's say that the XDP program pulls 64B header into the linear part
>>> and snips 4B of the encap out. This would result in a pull of an
>>> additional 4B (headlen (64B) - len (60B) = 4B) which are now
>>> data bytes => sub-optimal layout.
>>>
>>> I don't see how we can improve this corner case though.
>>
>> I see. Thanks for the clarification.
>>
>> I think the "if (len < headlen)" makes too many assumptions about what
>> the XDP program did.
>>
>> How about this policy instead: If the XDP program did not create/pull
>> data into the linear area, pull the parsed headers; otherwise, assume
>> the XDP program owns the geometry. min() is still needed since the
>> program can shrink the packet.
>>
>> if (!len) {
>>         headlen = min(headlen, skb->data_len);
>>         __pskb_pull_tail(skb, headen);
>> }
>>
>> This preserves the optimization for the default no-modification case,
>> and most importantly allow XDP program to get the optimal performance
>> if it gets the final geometry right.
>>
> I like this. It will also save us some neurons next time we need to
> touch these lines.
> 
Sashiko disagrees:

"""
If an XDP program changes the packet geometry by prepending data, len will
be greater than 0, which skips the __pskb_pull_tail() call entirely.
The resulting SKB's linear part will only contain the prepended data, with
the Ethernet headers remaining in the fragments.
When the network stack later calls eth_type_trans(), it unconditionally
pulls ETH_HLEN:
net/ethernet/eth.c:eth_type_trans() {
    ...
    skb_pull_inline(skb, ETH_HLEN);
    ...
}
Could pulling 14 bytes from a smaller linear area cause skb->len to drop
below skb->data_len and trigger a BUG() in __skb_pull()?
"""

So I think we need an else where we preserve the old behavior:
  if (!len)
          headlen = min(headlen, skb->data_len);
  else
          headlen = min(MLX5E_RX_MAX_HEAD - len, skb->data_len);

  __pskb_pull_tail(skb, headlen);

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-05-10  6:51 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-07  9:53 [PATCH net-next V6 0/3] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Tariq Toukan
2026-05-07  9:53 ` [PATCH net-next V6 1/3] net/mlx5e: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear Tariq Toukan
2026-05-07  9:53 ` [PATCH net-next V6 2/3] net/mlx5e: Avoid copying payload to the skb's linear part Tariq Toukan
2026-05-07 13:53   ` Amery Hung
2026-05-07 15:49     ` Dragos Tatulea
2026-05-07 20:50       ` Amery Hung
2026-05-08  9:15         ` Dragos Tatulea
2026-05-08 17:44           ` Amery Hung
2026-05-08 18:42             ` Dragos Tatulea
2026-05-10  6:50               ` Dragos Tatulea
2026-05-08 12:43   ` David Laight
2026-05-08 13:30     ` Dragos Tatulea
2026-05-07  9:53 ` [PATCH net-next V6 3/3] net/mlx5e: Align header copy to cache line for Striding RQ non-linear Tariq Toukan
2026-05-07 19:58 ` [PATCH net-next V6 0/3] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox