* [PATCH net-next v5 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing
@ 2025-09-04 22:53 Christoph Paasch via B4 Relay
2025-09-04 22:53 ` [PATCH net-next v5 1/2] net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear Christoph Paasch via B4 Relay
2025-09-04 22:53 ` [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
0 siblings, 2 replies; 10+ messages in thread
From: Christoph Paasch via B4 Relay @ 2025-09-04 22:53 UTC (permalink / raw)
To: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Amery Hung
Cc: netdev, linux-rdma, bpf, Christoph Paasch
When LRO is enabled on the MLX, mlx5e_skb_from_cqe_mpwrq_nonlinear
copies parts of the payload to the linear part of the skb.
This triggers suboptimal processing in GRO, causing slow throughput,...
This patch series addresses this by using eth_get_headlen to compute the
size of the protocol headers and only copy those bits. This results in
a significant throughput improvement (detailled results in the specific
patch).
Signed-off-by: Christoph Paasch <cpaasch@openai.com>
---
Changes in v5:
- Fix patch #2 NULL-ptr deref when an xdp-prog is set. Instead of using
skb->dev for eth_get_headlen, use rq->netdev. skb->dev will not be set
in the xdp code-path. (Reported-by: Amery Hung <ameryhung@gmail.com>)
- Kept previous Reviewed-by as the change is minor.
- Link to v4: https://lore.kernel.org/r/20250828-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-v4-0-bfcd5033a77c@openai.com
Changes in v4:
- Use eth_get_headlen() instead of building a dissector based on struct mlx5_cqe64.
This mimics what other drivers,... are doing as well. (Eric Dumazet
<edumazet@google.com>)
- Link to v3: https://lore.kernel.org/r/20250825-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-v3-0-5527e9eb6efc@openai.com
Changes in v3:
- Avoid computing headlen when it is not absolutely necessary (e.g., xdp
decides to "consume" the packet) (Dragos Tatulea <dtatulea@nvidia.com> & Jakub Kicinski <kuba@kernel.org>)
- Given the above change, consolidate the check for min3(...) in the new
function to avoid code duplication.
- Make sure local variables are in reverse xmas-tree order.
- Refine comment about why the check for l4_type worsk as is.
- Link to v2: https://lore.kernel.org/r/20250816-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-v2-0-b11b30bc2d10@openai.com
Changes in v2:
- Refine commit-message with more info and testing data
- Make mlx5e_cqe_get_min_hdr_len() return MLX5E_RX_MAX_HEAD when l3_type
is neither IPv4 nor IPv6. Same for the l4_type. That way behavior is
unchanged for other traffic types.
- Rename mlx5e_cqe_get_min_hdr_len to mlx5e_cqe_estimate_hdr_len
- Link to v1: https://lore.kernel.org/r/20250713-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-v1-0-ecaed8c2844e@openai.com
---
Christoph Paasch (2):
net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear
net/mlx5: Avoid copying payload to the skb's linear part
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)
---
base-commit: 5ef04a7b068cbb828eba226aacb42f880f7924d7
change-id: 20250712-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-6524917455a6
Best regards,
--
Christoph Paasch <cpaasch@openai.com>
^ permalink raw reply [flat|nested] 10+ messages in thread* [PATCH net-next v5 1/2] net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear 2025-09-04 22:53 [PATCH net-next v5 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch via B4 Relay @ 2025-09-04 22:53 ` Christoph Paasch via B4 Relay 2025-09-04 22:53 ` [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay 1 sibling, 0 replies; 10+ messages in thread From: Christoph Paasch via B4 Relay @ 2025-09-04 22:53 UTC (permalink / raw) To: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev, Amery Hung Cc: netdev, linux-rdma, bpf, Christoph Paasch From: Christoph Paasch <cpaasch@openai.com> Doing the call to dma_sync_single_for_cpu() earlier will allow us to adjust headlen based on the actual size of the protocol headers. Doing this earlier means that we don't need to call mlx5e_copy_skb_header() anymore and rather can call skb_copy_to_linear_data() directly. Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Christoph Paasch <cpaasch@openai.com> --- drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index b8c609d91d11bd315e8fb67f794a91bd37cd28c0..8bedbda522808cbabc8e62ae91a8c25d66725ebb 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -2005,17 +2005,19 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w struct skb_shared_info *sinfo; unsigned int truesize = 0; struct bpf_prog *prog; + void *va, *head_addr; struct sk_buff *skb; u32 linear_frame_sz; u16 linear_data_len; u16 linear_hr; - void *va; prog = rcu_dereference(rq->xdp_prog); + head_addr = netmem_address(head_page->netmem) + head_offset; + if (prog) { /* area for bpf_xdp_[store|load]_bytes */ - net_prefetchw(netmem_address(frag_page->netmem) + frag_offset); + net_prefetchw(head_addr); if (unlikely(mlx5e_page_alloc_fragmented(rq->page_pool, &wi->linear_page))) { rq->stats->buff_alloc_err++; @@ -2028,6 +2030,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w linear_data_len = 0; linear_frame_sz = MLX5_SKB_FRAG_SZ(linear_hr + MLX5E_RX_MAX_HEAD); } else { + dma_addr_t addr; + skb = napi_alloc_skb(rq->cq.napi, ALIGN(MLX5E_RX_MAX_HEAD, sizeof(long))); if (unlikely(!skb)) { @@ -2039,6 +2043,10 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w net_prefetchw(va); /* xdp_frame data area */ net_prefetchw(skb->data); + addr = page_pool_get_dma_addr_netmem(head_page->netmem); + dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen, + rq->buff.map_dir); + frag_offset += headlen; byte_cnt -= headlen; linear_hr = skb_headroom(skb); @@ -2117,8 +2125,6 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w } __pskb_pull_tail(skb, headlen); } else { - dma_addr_t addr; - if (xdp_buff_has_frags(&mxbuf->xdp)) { struct mlx5e_frag_page *pagep; @@ -2133,9 +2139,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w while (++pagep < frag_page); } /* copy header */ - addr = page_pool_get_dma_addr_netmem(head_page->netmem); - mlx5e_copy_skb_header(rq, skb, head_page->netmem, addr, - head_offset, head_offset, headlen); + skb_copy_to_linear_data(skb, head_addr, headlen); + /* skb linear part was allocated with headlen and aligned to long */ skb->tail += headlen; skb->len += headlen; -- 2.50.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part 2025-09-04 22:53 [PATCH net-next v5 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch via B4 Relay 2025-09-04 22:53 ` [PATCH net-next v5 1/2] net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear Christoph Paasch via B4 Relay @ 2025-09-04 22:53 ` Christoph Paasch via B4 Relay 2025-09-04 23:30 ` Amery Hung 1 sibling, 1 reply; 10+ messages in thread From: Christoph Paasch via B4 Relay @ 2025-09-04 22:53 UTC (permalink / raw) To: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev, Amery Hung Cc: netdev, linux-rdma, bpf, Christoph Paasch From: Christoph Paasch <cpaasch@openai.com> mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256) bytes from the page-pool to the skb's linear part. Those 256 bytes include part of the payload. When attempting to do GRO in skb_gro_receive, if headlen > data_offset (and skb->head_frag is not set), we end up aggregating packets in the frag_list. This is of course not good when we are CPU-limited. Also causes a worse skb->len/truesize ratio,... So, let's avoid copying parts of the payload to the linear part. We use eth_get_headlen() to parse the headers and compute the length of the protocol headers, which will be used to copy the relevant bits ot the skb's linear part. We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking stack needs to call pskb_may_pull() later on, we don't need to reallocate memory. This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and LRO enabled): BEFORE: ======= (netserver pinned to core receiving interrupts) $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K 87380 16384 262144 60.01 32547.82 (netserver pinned to adjacent core receiving interrupts) $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K 87380 16384 262144 60.00 52531.67 AFTER: ====== (netserver pinned to core receiving interrupts) $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K 87380 16384 262144 60.00 52896.06 (netserver pinned to adjacent core receiving interrupts) $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K 87380 16384 262144 60.00 85094.90 Additional tests across a larger range of parameters w/ and w/o LRO, w/ and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different TCP read/write-sizes as well as UDP benchmarks, all have shown equal or better performance with this patch. Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Christoph Paasch <cpaasch@openai.com> --- drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..0ac31c7fb64cd60720d390de45a5b6b453ed0a3f 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen, rq->buff.map_dir); + headlen = eth_get_headlen(rq->netdev, head_addr, headlen); + frag_offset += headlen; byte_cnt -= headlen; linear_hr = skb_headroom(skb); @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w pagep->frags++; while (++pagep < frag_page); } + + headlen = eth_get_headlen(rq->netdev, mxbuf->xdp.data, headlen); + __pskb_pull_tail(skb, headlen); } else { if (xdp_buff_has_frags(&mxbuf->xdp)) { -- 2.50.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part 2025-09-04 22:53 ` [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay @ 2025-09-04 23:30 ` Amery Hung 2025-09-09 4:00 ` Christoph Paasch 0 siblings, 1 reply; 10+ messages in thread From: Amery Hung @ 2025-09-04 23:30 UTC (permalink / raw) To: cpaasch Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev, netdev, linux-rdma, bpf On Thu, Sep 4, 2025 at 3:57 PM Christoph Paasch via B4 Relay <devnull+cpaasch.openai.com@kernel.org> wrote: > > From: Christoph Paasch <cpaasch@openai.com> > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256) > bytes from the page-pool to the skb's linear part. Those 256 bytes > include part of the payload. > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset > (and skb->head_frag is not set), we end up aggregating packets in the > frag_list. > > This is of course not good when we are CPU-limited. Also causes a worse > skb->len/truesize ratio,... > > So, let's avoid copying parts of the payload to the linear part. We use > eth_get_headlen() to parse the headers and compute the length of the > protocol headers, which will be used to copy the relevant bits ot the > skb's linear part. > > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking > stack needs to call pskb_may_pull() later on, we don't need to reallocate > memory. > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and > LRO enabled): > > BEFORE: > ======= > (netserver pinned to core receiving interrupts) > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K > 87380 16384 262144 60.01 32547.82 > > (netserver pinned to adjacent core receiving interrupts) > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K > 87380 16384 262144 60.00 52531.67 > > AFTER: > ====== > (netserver pinned to core receiving interrupts) > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K > 87380 16384 262144 60.00 52896.06 > > (netserver pinned to adjacent core receiving interrupts) > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K > 87380 16384 262144 60.00 85094.90 > > Additional tests across a larger range of parameters w/ and w/o LRO, w/ > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or > better performance with this patch. > > Reviewed-by: Eric Dumazet <edumazet@google.com> > Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> > Signed-off-by: Christoph Paasch <cpaasch@openai.com> > --- > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..0ac31c7fb64cd60720d390de45a5b6b453ed0a3f 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen, > rq->buff.map_dir); > > + headlen = eth_get_headlen(rq->netdev, head_addr, headlen); > + > frag_offset += headlen; > byte_cnt -= headlen; > linear_hr = skb_headroom(skb); > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > pagep->frags++; > while (++pagep < frag_page); > } > + > + headlen = eth_get_headlen(rq->netdev, mxbuf->xdp.data, headlen); > + The size of mxbuf->xdp.data is most likely not headlen here. The driver currently generates a xdp_buff with empty linear data, pass it to the xdp program and assumes the layout If the xdp program does not change the layout of the xdp_buff through bpf_xdp_adjust_head() or bpf_xdp_adjust_tail(). The assumption is not correct and I am working on a fix. But, if we keep that assumption for now, mxbuf->xdp.data will not contain any headers or payload. The thing that you try to do probably should be: skb_frag_t *frag = &sinfo->frags[0]; headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag), skb_frag_size(frag)); > __pskb_pull_tail(skb, headlen); > } else { > if (xdp_buff_has_frags(&mxbuf->xdp)) { > > -- > 2.50.1 > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part 2025-09-04 23:30 ` Amery Hung @ 2025-09-09 4:00 ` Christoph Paasch 2025-09-09 15:18 ` Christoph Paasch 0 siblings, 1 reply; 10+ messages in thread From: Christoph Paasch @ 2025-09-09 4:00 UTC (permalink / raw) To: Amery Hung Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev, netdev, linux-rdma, bpf On Thu, Sep 4, 2025 at 4:30 PM Amery Hung <ameryhung@gmail.com> wrote: > > On Thu, Sep 4, 2025 at 3:57 PM Christoph Paasch via B4 Relay > <devnull+cpaasch.openai.com@kernel.org> wrote: > > > > From: Christoph Paasch <cpaasch@openai.com> > > > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256) > > bytes from the page-pool to the skb's linear part. Those 256 bytes > > include part of the payload. > > > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset > > (and skb->head_frag is not set), we end up aggregating packets in the > > frag_list. > > > > This is of course not good when we are CPU-limited. Also causes a worse > > skb->len/truesize ratio,... > > > > So, let's avoid copying parts of the payload to the linear part. We use > > eth_get_headlen() to parse the headers and compute the length of the > > protocol headers, which will be used to copy the relevant bits ot the > > skb's linear part. > > > > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking > > stack needs to call pskb_may_pull() later on, we don't need to reallocate > > memory. > > > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and > > LRO enabled): > > > > BEFORE: > > ======= > > (netserver pinned to core receiving interrupts) > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K > > 87380 16384 262144 60.01 32547.82 > > > > (netserver pinned to adjacent core receiving interrupts) > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K > > 87380 16384 262144 60.00 52531.67 > > > > AFTER: > > ====== > > (netserver pinned to core receiving interrupts) > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K > > 87380 16384 262144 60.00 52896.06 > > > > (netserver pinned to adjacent core receiving interrupts) > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K > > 87380 16384 262144 60.00 85094.90 > > > > Additional tests across a larger range of parameters w/ and w/o LRO, w/ > > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different > > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or > > better performance with this patch. > > > > Reviewed-by: Eric Dumazet <edumazet@google.com> > > Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> > > Signed-off-by: Christoph Paasch <cpaasch@openai.com> > > --- > > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..0ac31c7fb64cd60720d390de45a5b6b453ed0a3f 100644 > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > > dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen, > > rq->buff.map_dir); > > > > + headlen = eth_get_headlen(rq->netdev, head_addr, headlen); > > + > > frag_offset += headlen; > > byte_cnt -= headlen; > > linear_hr = skb_headroom(skb); > > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > > pagep->frags++; > > while (++pagep < frag_page); > > } > > + > > + headlen = eth_get_headlen(rq->netdev, mxbuf->xdp.data, headlen); > > + > > The size of mxbuf->xdp.data is most likely not headlen here. > > The driver currently generates a xdp_buff with empty linear data, pass > it to the xdp program and assumes the layout If the xdp program does > not change the layout of the xdp_buff through bpf_xdp_adjust_head() or > bpf_xdp_adjust_tail(). The assumption is not correct and I am working > on a fix. But, if we keep that assumption for now, mxbuf->xdp.data > will not contain any headers or payload. The thing that you try to do > probably should be: > > skb_frag_t *frag = &sinfo->frags[0]; > > headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag), > skb_frag_size(frag)); Ok, I think I understand what you mean! Thanks for taking the time to explain! I will do some tests on my side to make sure I get it right. As your change goes to net and mine to netnext, I can wait until yours is in the tree so that there aren't any conflicts that need to be taken care of. Christoph > > > > > __pskb_pull_tail(skb, headlen); > > } else { > > if (xdp_buff_has_frags(&mxbuf->xdp)) { > > > > -- > > 2.50.1 > > > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part 2025-09-09 4:00 ` Christoph Paasch @ 2025-09-09 15:18 ` Christoph Paasch 2025-09-10 3:17 ` Amery Hung 0 siblings, 1 reply; 10+ messages in thread From: Christoph Paasch @ 2025-09-09 15:18 UTC (permalink / raw) To: Amery Hung Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev, netdev, linux-rdma, bpf On Mon, Sep 8, 2025 at 9:00 PM Christoph Paasch <cpaasch@openai.com> wrote: > > On Thu, Sep 4, 2025 at 4:30 PM Amery Hung <ameryhung@gmail.com> wrote: > > > > On Thu, Sep 4, 2025 at 3:57 PM Christoph Paasch via B4 Relay > > <devnull+cpaasch.openai.com@kernel.org> wrote: > > > > > > From: Christoph Paasch <cpaasch@openai.com> > > > > > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256) > > > bytes from the page-pool to the skb's linear part. Those 256 bytes > > > include part of the payload. > > > > > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset > > > (and skb->head_frag is not set), we end up aggregating packets in the > > > frag_list. > > > > > > This is of course not good when we are CPU-limited. Also causes a worse > > > skb->len/truesize ratio,... > > > > > > So, let's avoid copying parts of the payload to the linear part. We use > > > eth_get_headlen() to parse the headers and compute the length of the > > > protocol headers, which will be used to copy the relevant bits ot the > > > skb's linear part. > > > > > > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking > > > stack needs to call pskb_may_pull() later on, we don't need to reallocate > > > memory. > > > > > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and > > > LRO enabled): > > > > > > BEFORE: > > > ======= > > > (netserver pinned to core receiving interrupts) > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K > > > 87380 16384 262144 60.01 32547.82 > > > > > > (netserver pinned to adjacent core receiving interrupts) > > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K > > > 87380 16384 262144 60.00 52531.67 > > > > > > AFTER: > > > ====== > > > (netserver pinned to core receiving interrupts) > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K > > > 87380 16384 262144 60.00 52896.06 > > > > > > (netserver pinned to adjacent core receiving interrupts) > > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K > > > 87380 16384 262144 60.00 85094.90 > > > > > > Additional tests across a larger range of parameters w/ and w/o LRO, w/ > > > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different > > > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or > > > better performance with this patch. > > > > > > Reviewed-by: Eric Dumazet <edumazet@google.com> > > > Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> > > > Signed-off-by: Christoph Paasch <cpaasch@openai.com> > > > --- > > > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++ > > > 1 file changed, 5 insertions(+) > > > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..0ac31c7fb64cd60720d390de45a5b6b453ed0a3f 100644 > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > > > dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen, > > > rq->buff.map_dir); > > > > > > + headlen = eth_get_headlen(rq->netdev, head_addr, headlen); > > > + > > > frag_offset += headlen; > > > byte_cnt -= headlen; > > > linear_hr = skb_headroom(skb); > > > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > > > pagep->frags++; > > > while (++pagep < frag_page); > > > } > > > + > > > + headlen = eth_get_headlen(rq->netdev, mxbuf->xdp.data, headlen); > > > + > > > > The size of mxbuf->xdp.data is most likely not headlen here. > > > > The driver currently generates a xdp_buff with empty linear data, pass > > it to the xdp program and assumes the layout If the xdp program does > > not change the layout of the xdp_buff through bpf_xdp_adjust_head() or > > bpf_xdp_adjust_tail(). The assumption is not correct and I am working > > on a fix. But, if we keep that assumption for now, mxbuf->xdp.data > > will not contain any headers or payload. The thing that you try to do > > probably should be: > > > > skb_frag_t *frag = &sinfo->frags[0]; > > > > headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag), > > skb_frag_size(frag)); So, when I look at the headlen I get, it is correct (even with my old code using mxbuf->xdp.data). To make sure I test the right thing, which scenario would mxbuf->xdp.data not contain any headers or payload ? What do I need to do to reproduce that ? Thanks, Christoph > > Ok, I think I understand what you mean! Thanks for taking the time to explain! > > I will do some tests on my side to make sure I get it right. > > As your change goes to net and mine to netnext, I can wait until yours > is in the tree so that there aren't any conflicts that need to be > taken care of. > > > Christoph > > > > > > > > > > __pskb_pull_tail(skb, headlen); > > > } else { > > > if (xdp_buff_has_frags(&mxbuf->xdp)) { > > > > > > -- > > > 2.50.1 > > > > > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part 2025-09-09 15:18 ` Christoph Paasch @ 2025-09-10 3:17 ` Amery Hung 2025-09-10 17:36 ` Christoph Paasch 0 siblings, 1 reply; 10+ messages in thread From: Amery Hung @ 2025-09-10 3:17 UTC (permalink / raw) To: Christoph Paasch Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev, netdev, linux-rdma, bpf On Tue, Sep 9, 2025 at 11:18 AM Christoph Paasch <cpaasch@openai.com> wrote: > > On Mon, Sep 8, 2025 at 9:00 PM Christoph Paasch <cpaasch@openai.com> wrote: > > > > On Thu, Sep 4, 2025 at 4:30 PM Amery Hung <ameryhung@gmail.com> wrote: > > > > > > On Thu, Sep 4, 2025 at 3:57 PM Christoph Paasch via B4 Relay > > > <devnull+cpaasch.openai.com@kernel.org> wrote: > > > > > > > > From: Christoph Paasch <cpaasch@openai.com> > > > > > > > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256) > > > > bytes from the page-pool to the skb's linear part. Those 256 bytes > > > > include part of the payload. > > > > > > > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset > > > > (and skb->head_frag is not set), we end up aggregating packets in the > > > > frag_list. > > > > > > > > This is of course not good when we are CPU-limited. Also causes a worse > > > > skb->len/truesize ratio,... > > > > > > > > So, let's avoid copying parts of the payload to the linear part. We use > > > > eth_get_headlen() to parse the headers and compute the length of the > > > > protocol headers, which will be used to copy the relevant bits ot the > > > > skb's linear part. > > > > > > > > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking > > > > stack needs to call pskb_may_pull() later on, we don't need to reallocate > > > > memory. > > > > > > > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and > > > > LRO enabled): > > > > > > > > BEFORE: > > > > ======= > > > > (netserver pinned to core receiving interrupts) > > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K > > > > 87380 16384 262144 60.01 32547.82 > > > > > > > > (netserver pinned to adjacent core receiving interrupts) > > > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K > > > > 87380 16384 262144 60.00 52531.67 > > > > > > > > AFTER: > > > > ====== > > > > (netserver pinned to core receiving interrupts) > > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K > > > > 87380 16384 262144 60.00 52896.06 > > > > > > > > (netserver pinned to adjacent core receiving interrupts) > > > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K > > > > 87380 16384 262144 60.00 85094.90 > > > > > > > > Additional tests across a larger range of parameters w/ and w/o LRO, w/ > > > > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different > > > > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or > > > > better performance with this patch. > > > > > > > > Reviewed-by: Eric Dumazet <edumazet@google.com> > > > > Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> > > > > Signed-off-by: Christoph Paasch <cpaasch@openai.com> > > > > --- > > > > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++ > > > > 1 file changed, 5 insertions(+) > > > > > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > > > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..0ac31c7fb64cd60720d390de45a5b6b453ed0a3f 100644 > > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > > > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > > > > dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen, > > > > rq->buff.map_dir); > > > > > > > > + headlen = eth_get_headlen(rq->netdev, head_addr, headlen); > > > > + > > > > frag_offset += headlen; > > > > byte_cnt -= headlen; > > > > linear_hr = skb_headroom(skb); > > > > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > > > > pagep->frags++; > > > > while (++pagep < frag_page); > > > > } > > > > + > > > > + headlen = eth_get_headlen(rq->netdev, mxbuf->xdp.data, headlen); > > > > + > > > > > > The size of mxbuf->xdp.data is most likely not headlen here. > > > > > > The driver currently generates a xdp_buff with empty linear data, pass > > > it to the xdp program and assumes the layout If the xdp program does > > > not change the layout of the xdp_buff through bpf_xdp_adjust_head() or > > > bpf_xdp_adjust_tail(). The assumption is not correct and I am working > > > on a fix. But, if we keep that assumption for now, mxbuf->xdp.data > > > will not contain any headers or payload. The thing that you try to do > > > probably should be: > > > > > > skb_frag_t *frag = &sinfo->frags[0]; > > > > > > headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag), > > > skb_frag_size(frag)); > > So, when I look at the headlen I get, it is correct (even with my old > code using mxbuf->xdp.data). > > To make sure I test the right thing, which scenario would > mxbuf->xdp.data not contain any headers or payload ? What do I need to > do to reproduce that ? A quick look at the code, could it be that skb_flow_dissect_flow_keys_basic() returns false so that eth_get_headlen() always returns sizeof(*eth)? The linear part contains nothing meaning before __psk_pull_tail(), so it is possible for skb_flow_dissect_flow_keys_basic() to fail. > > Thanks, > Christoph > > > > > Ok, I think I understand what you mean! Thanks for taking the time to explain! > > > > I will do some tests on my side to make sure I get it right. > > > > As your change goes to net and mine to netnext, I can wait until yours > > is in the tree so that there aren't any conflicts that need to be > > taken care of. Will Copy you in the mlx5 non-linear xdp fixing patchset. > > > > > > Christoph > > > > > > > > > > > > > > > __pskb_pull_tail(skb, headlen); > > > > } else { > > > > if (xdp_buff_has_frags(&mxbuf->xdp)) { > > > > > > > > -- > > > > 2.50.1 > > > > > > > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part 2025-09-10 3:17 ` Amery Hung @ 2025-09-10 17:36 ` Christoph Paasch 2025-09-10 19:09 ` Amery Hung 0 siblings, 1 reply; 10+ messages in thread From: Christoph Paasch @ 2025-09-10 17:36 UTC (permalink / raw) To: Amery Hung Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev, netdev, linux-rdma, bpf On Tue, Sep 9, 2025 at 8:17 PM Amery Hung <ameryhung@gmail.com> wrote: > > On Tue, Sep 9, 2025 at 11:18 AM Christoph Paasch <cpaasch@openai.com> wrote: > > > > On Mon, Sep 8, 2025 at 9:00 PM Christoph Paasch <cpaasch@openai.com> wrote: > > > > > > On Thu, Sep 4, 2025 at 4:30 PM Amery Hung <ameryhung@gmail.com> wrote: > > > > > > > > On Thu, Sep 4, 2025 at 3:57 PM Christoph Paasch via B4 Relay > > > > <devnull+cpaasch.openai.com@kernel.org> wrote: > > > > > > > > > > From: Christoph Paasch <cpaasch@openai.com> > > > > > > > > > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256) > > > > > bytes from the page-pool to the skb's linear part. Those 256 bytes > > > > > include part of the payload. > > > > > > > > > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset > > > > > (and skb->head_frag is not set), we end up aggregating packets in the > > > > > frag_list. > > > > > > > > > > This is of course not good when we are CPU-limited. Also causes a worse > > > > > skb->len/truesize ratio,... > > > > > > > > > > So, let's avoid copying parts of the payload to the linear part. We use > > > > > eth_get_headlen() to parse the headers and compute the length of the > > > > > protocol headers, which will be used to copy the relevant bits ot the > > > > > skb's linear part. > > > > > > > > > > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking > > > > > stack needs to call pskb_may_pull() later on, we don't need to reallocate > > > > > memory. > > > > > > > > > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and > > > > > LRO enabled): > > > > > > > > > > BEFORE: > > > > > ======= > > > > > (netserver pinned to core receiving interrupts) > > > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K > > > > > 87380 16384 262144 60.01 32547.82 > > > > > > > > > > (netserver pinned to adjacent core receiving interrupts) > > > > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K > > > > > 87380 16384 262144 60.00 52531.67 > > > > > > > > > > AFTER: > > > > > ====== > > > > > (netserver pinned to core receiving interrupts) > > > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K > > > > > 87380 16384 262144 60.00 52896.06 > > > > > > > > > > (netserver pinned to adjacent core receiving interrupts) > > > > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K > > > > > 87380 16384 262144 60.00 85094.90 > > > > > > > > > > Additional tests across a larger range of parameters w/ and w/o LRO, w/ > > > > > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different > > > > > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or > > > > > better performance with this patch. > > > > > > > > > > Reviewed-by: Eric Dumazet <edumazet@google.com> > > > > > Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> > > > > > Signed-off-by: Christoph Paasch <cpaasch@openai.com> > > > > > --- > > > > > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++ > > > > > 1 file changed, 5 insertions(+) > > > > > > > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > > > > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..0ac31c7fb64cd60720d390de45a5b6b453ed0a3f 100644 > > > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > > > > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > > > > > dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen, > > > > > rq->buff.map_dir); > > > > > > > > > > + headlen = eth_get_headlen(rq->netdev, head_addr, headlen); > > > > > + > > > > > frag_offset += headlen; > > > > > byte_cnt -= headlen; > > > > > linear_hr = skb_headroom(skb); > > > > > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > > > > > pagep->frags++; > > > > > while (++pagep < frag_page); > > > > > } > > > > > + > > > > > + headlen = eth_get_headlen(rq->netdev, mxbuf->xdp.data, headlen); > > > > > + > > > > > > > > The size of mxbuf->xdp.data is most likely not headlen here. > > > > > > > > The driver currently generates a xdp_buff with empty linear data, pass > > > > it to the xdp program and assumes the layout If the xdp program does > > > > not change the layout of the xdp_buff through bpf_xdp_adjust_head() or > > > > bpf_xdp_adjust_tail(). The assumption is not correct and I am working > > > > on a fix. But, if we keep that assumption for now, mxbuf->xdp.data > > > > will not contain any headers or payload. The thing that you try to do > > > > probably should be: > > > > > > > > skb_frag_t *frag = &sinfo->frags[0]; > > > > > > > > headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag), > > > > skb_frag_size(frag)); > > > > So, when I look at the headlen I get, it is correct (even with my old > > code using mxbuf->xdp.data). > > > > To make sure I test the right thing, which scenario would > > mxbuf->xdp.data not contain any headers or payload ? What do I need to > > do to reproduce that ? > > A quick look at the code, could it be that > skb_flow_dissect_flow_keys_basic() returns false so that > eth_get_headlen() always returns sizeof(*eth)? No, the headlen values were correct (meaning, it was the actual length of the headers): This is TCP-traffic with a simple print after eth_get_headlen: [130982.311088] mlx5e_skb_from_cqe_mpwrq_nonlinear xdp headlen is 86 So, eth_get_headlen was able to correctly parse things. My xdp-program is as simple as possible: SEC("xdp.frags") int xdp_pass_prog(struct xdp_md *ctx) { return XDP_PASS; } > The linear part > contains nothing meaning before __psk_pull_tail(), so it is possible > for skb_flow_dissect_flow_keys_basic() to fail. > > > > > Thanks, > > Christoph > > > > > > > > Ok, I think I understand what you mean! Thanks for taking the time to explain! > > > > > > I will do some tests on my side to make sure I get it right. > > > > > > As your change goes to net and mine to netnext, I can wait until yours > > > is in the tree so that there aren't any conflicts that need to be > > > taken care of. > > Will Copy you in the mlx5 non-linear xdp fixing patchset. Thx! Christoph ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part 2025-09-10 17:36 ` Christoph Paasch @ 2025-09-10 19:09 ` Amery Hung 2026-02-11 17:12 ` Dragos Tatulea 0 siblings, 1 reply; 10+ messages in thread From: Amery Hung @ 2025-09-10 19:09 UTC (permalink / raw) To: Christoph Paasch Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev, netdev, linux-rdma, bpf On Wed, Sep 10, 2025 at 1:36 PM Christoph Paasch <cpaasch@openai.com> wrote: > > On Tue, Sep 9, 2025 at 8:17 PM Amery Hung <ameryhung@gmail.com> wrote: > > > > On Tue, Sep 9, 2025 at 11:18 AM Christoph Paasch <cpaasch@openai.com> wrote: > > > > > > On Mon, Sep 8, 2025 at 9:00 PM Christoph Paasch <cpaasch@openai.com> wrote: > > > > > > > > On Thu, Sep 4, 2025 at 4:30 PM Amery Hung <ameryhung@gmail.com> wrote: > > > > > > > > > > On Thu, Sep 4, 2025 at 3:57 PM Christoph Paasch via B4 Relay > > > > > <devnull+cpaasch.openai.com@kernel.org> wrote: > > > > > > > > > > > > From: Christoph Paasch <cpaasch@openai.com> > > > > > > > > > > > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256) > > > > > > bytes from the page-pool to the skb's linear part. Those 256 bytes > > > > > > include part of the payload. > > > > > > > > > > > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset > > > > > > (and skb->head_frag is not set), we end up aggregating packets in the > > > > > > frag_list. > > > > > > > > > > > > This is of course not good when we are CPU-limited. Also causes a worse > > > > > > skb->len/truesize ratio,... > > > > > > > > > > > > So, let's avoid copying parts of the payload to the linear part. We use > > > > > > eth_get_headlen() to parse the headers and compute the length of the > > > > > > protocol headers, which will be used to copy the relevant bits ot the > > > > > > skb's linear part. > > > > > > > > > > > > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking > > > > > > stack needs to call pskb_may_pull() later on, we don't need to reallocate > > > > > > memory. > > > > > > > > > > > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and > > > > > > LRO enabled): > > > > > > > > > > > > BEFORE: > > > > > > ======= > > > > > > (netserver pinned to core receiving interrupts) > > > > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K > > > > > > 87380 16384 262144 60.01 32547.82 > > > > > > > > > > > > (netserver pinned to adjacent core receiving interrupts) > > > > > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K > > > > > > 87380 16384 262144 60.00 52531.67 > > > > > > > > > > > > AFTER: > > > > > > ====== > > > > > > (netserver pinned to core receiving interrupts) > > > > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K > > > > > > 87380 16384 262144 60.00 52896.06 > > > > > > > > > > > > (netserver pinned to adjacent core receiving interrupts) > > > > > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K > > > > > > 87380 16384 262144 60.00 85094.90 > > > > > > > > > > > > Additional tests across a larger range of parameters w/ and w/o LRO, w/ > > > > > > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different > > > > > > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or > > > > > > better performance with this patch. > > > > > > > > > > > > Reviewed-by: Eric Dumazet <edumazet@google.com> > > > > > > Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> > > > > > > Signed-off-by: Christoph Paasch <cpaasch@openai.com> > > > > > > --- > > > > > > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++ > > > > > > 1 file changed, 5 insertions(+) > > > > > > > > > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > > > > > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..0ac31c7fb64cd60720d390de45a5b6b453ed0a3f 100644 > > > > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > > > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > > > > > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > > > > > > dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen, > > > > > > rq->buff.map_dir); > > > > > > > > > > > > + headlen = eth_get_headlen(rq->netdev, head_addr, headlen); > > > > > > + > > > > > > frag_offset += headlen; > > > > > > byte_cnt -= headlen; > > > > > > linear_hr = skb_headroom(skb); > > > > > > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > > > > > > pagep->frags++; > > > > > > while (++pagep < frag_page); > > > > > > } > > > > > > + > > > > > > + headlen = eth_get_headlen(rq->netdev, mxbuf->xdp.data, headlen); > > > > > > + > > > > > > > > > > The size of mxbuf->xdp.data is most likely not headlen here. > > > > > > > > > > The driver currently generates a xdp_buff with empty linear data, pass > > > > > it to the xdp program and assumes the layout If the xdp program does > > > > > not change the layout of the xdp_buff through bpf_xdp_adjust_head() or > > > > > bpf_xdp_adjust_tail(). The assumption is not correct and I am working > > > > > on a fix. But, if we keep that assumption for now, mxbuf->xdp.data > > > > > will not contain any headers or payload. The thing that you try to do > > > > > probably should be: > > > > > > > > > > skb_frag_t *frag = &sinfo->frags[0]; > > > > > > > > > > headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag), > > > > > skb_frag_size(frag)); > > > > > > So, when I look at the headlen I get, it is correct (even with my old > > > code using mxbuf->xdp.data). > > > > > > To make sure I test the right thing, which scenario would > > > mxbuf->xdp.data not contain any headers or payload ? What do I need to > > > do to reproduce that ? > > > > A quick look at the code, could it be that > > skb_flow_dissect_flow_keys_basic() returns false so that > > eth_get_headlen() always returns sizeof(*eth)? > > No, the headlen values were correct (meaning, it was the actual length > of the headers): > Another possibility is that mxbuf->xdp is reused and not zeroed between calls to mlx5e_skb_from_cqe_mpwrq_nonlinear(). The stale headers might have been written to mxbuf->xdp.data before the XDP is attached. I am not sure what exactly happens, but my main question remains. when the XDP program is attached and does nothing, the linear data will be empty, so what is eth_get_headlen() parsing here...? > This is TCP-traffic with a simple print after eth_get_headlen: > [130982.311088] mlx5e_skb_from_cqe_mpwrq_nonlinear xdp headlen is 86 > > So, eth_get_headlen was able to correctly parse things. > > My xdp-program is as simple as possible: > SEC("xdp.frags") > int xdp_pass_prog(struct xdp_md *ctx) > { > return XDP_PASS; > } > > > > The linear part > > contains nothing meaning before __psk_pull_tail(), so it is possible > > for skb_flow_dissect_flow_keys_basic() to fail. > > > > > > > > Thanks, > > > Christoph > > > > > > > > > > > Ok, I think I understand what you mean! Thanks for taking the time to explain! > > > > > > > > I will do some tests on my side to make sure I get it right. > > > > > > > > As your change goes to net and mine to netnext, I can wait until yours > > > > is in the tree so that there aren't any conflicts that need to be > > > > taken care of. > > > > Will Copy you in the mlx5 non-linear xdp fixing patchset. > > Thx! > > > Christoph ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part 2025-09-10 19:09 ` Amery Hung @ 2026-02-11 17:12 ` Dragos Tatulea 0 siblings, 0 replies; 10+ messages in thread From: Dragos Tatulea @ 2026-02-11 17:12 UTC (permalink / raw) To: Amery Hung, Christoph Paasch Cc: Gal Pressman, Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev, netdev, linux-rdma, bpf Hi Christoph, On 10.09.25 21:09, Amery Hung wrote: > On Wed, Sep 10, 2025 at 1:36 PM Christoph Paasch <cpaasch@openai.com> wrote: [...] >>>>>>> This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and >>>>>>> LRO enabled): >>>>>>> >>>>>>> BEFORE: >>>>>>> ======= >>>>>>> (netserver pinned to core receiving interrupts) >>>>>>> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K >>>>>>> 87380 16384 262144 60.01 32547.82 >>>>>>> >>>>>>> (netserver pinned to adjacent core receiving interrupts) >>>>>>> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K >>>>>>> 87380 16384 262144 60.00 52531.67 >>>>>>> >>>>>>> AFTER: >>>>>>> ====== >>>>>>> (netserver pinned to core receiving interrupts) >>>>>>> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K >>>>>>> 87380 16384 262144 60.00 52896.06 >>>>>>> >>>>>>> (netserver pinned to adjacent core receiving interrupts) >>>>>>> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K >>>>>>> 87380 16384 262144 60.00 85094.90 >>>>>>> We still want this nice optimization. Seems like last discussion was stuck on reading the header length for XDP. See below for sugession. [...] >>>>>>> @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w >>>>>>> pagep->frags++; >>>>>>> while (++pagep < frag_page); >>>>>>> } >>>>>>> + >>>>>>> + headlen = eth_get_headlen(rq->netdev, mxbuf->xdp.data, headlen); >>>>>>> + >>>>>> >>>>>> The size of mxbuf->xdp.data is most likely not headlen here. >>>>>> >>>>>> The driver currently generates a xdp_buff with empty linear data, pass >>>>>> it to the xdp program and assumes the layout If the xdp program does >>>>>> not change the layout of the xdp_buff through bpf_xdp_adjust_head() or >>>>>> bpf_xdp_adjust_tail(). The assumption is not correct and I am working >>>>>> on a fix. But, if we keep that assumption for now, mxbuf->xdp.data >>>>>> will not contain any headers or payload. The thing that you try to do >>>>>> probably should be: >>>>>> >>>>>> skb_frag_t *frag = &sinfo->frags[0]; >>>>>> >>>>>> headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag), >>>>>> skb_frag_size(frag)); >>>> >>>> So, when I look at the headlen I get, it is correct (even with my old >>>> code using mxbuf->xdp.data). >>>> >>>> To make sure I test the right thing, which scenario would >>>> mxbuf->xdp.data not contain any headers or payload ? What do I need to >>>> do to reproduce that ? >>> >>> A quick look at the code, could it be that >>> skb_flow_dissect_flow_keys_basic() returns false so that >>> eth_get_headlen() always returns sizeof(*eth)? >> >> No, the headlen values were correct (meaning, it was the actual length >> of the headers): >> > > Another possibility is that mxbuf->xdp is reused and not zeroed > between calls to mlx5e_skb_from_cqe_mpwrq_nonlinear(). The stale > headers might have been written to mxbuf->xdp.data before the XDP is > attached. > > I am not sure what exactly happens, but my main question remains. when > the XDP program is attached and does nothing, the linear data will be > empty, so what is eth_get_headlen() parsing here...? > AFAIU there could be 2 cases here: 1) XDP program doesn't modify the XDP buffer so the header is in the first frag, 2) XDP program modifies the XDP buffer and pulls some len bytes into the linear part. So how about always reading headlen from first frag *before* the XDP program execution. And afterwards calculate the new headlen for __pskb_pull_tail() based on previous headlen - len? Like this (untested): ``` --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -2159,8 +2159,12 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w if (prog) { u8 nr_frags_free, old_nr_frags = sinfo->nr_frags; + skb_frag_t *frag = &sinfo->frags[0]; u32 len; + headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag), + skb_frag_size(frag)); + if (mlx5e_xdp_handle(rq, prog, mxbuf)) { if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) { @@ -2208,8 +2212,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w pagep->frags++; while (++pagep < frag_page); - headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len, - skb->data_len); + headlen = min_t(u16, headlen - len, skb->data_len); __pskb_pull_tail(skb, headlen); } } else { ``` Amery, do you agree? Would you like to re-send the patch after the merge window closes? If you are busy we can also pick it up ourselves, do the required modifications and send it keeping your authorship and the existing tags. Thanks, Dragos ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-02-11 17:12 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-09-04 22:53 [PATCH net-next v5 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch via B4 Relay 2025-09-04 22:53 ` [PATCH net-next v5 1/2] net/mlx5: DMA-sync earlier in mlx5e_skb_from_cqe_mpwrq_nonlinear Christoph Paasch via B4 Relay 2025-09-04 22:53 ` [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay 2025-09-04 23:30 ` Amery Hung 2025-09-09 4:00 ` Christoph Paasch 2025-09-09 15:18 ` Christoph Paasch 2025-09-10 3:17 ` Amery Hung 2025-09-10 17:36 ` Christoph Paasch 2025-09-10 19:09 ` Amery Hung 2026-02-11 17:12 ` Dragos Tatulea
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox