[PATCH net-next 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing
@ 2025-07-13 23:33 Christoph Paasch via B4 Relay
  2025-07-13 23:33 ` [PATCH net-next 1/2] net/mlx5: Bring back get_cqe_l3_hdr_type Christoph Paasch via B4 Relay
  2025-07-13 23:33 ` [PATCH net-next 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
  0 siblings, 2 replies; 11+ messages in thread
From: Christoph Paasch via B4 Relay @ 2025-07-13 23:33 UTC (permalink / raw)
  To: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: linux-rdma, netdev, Christoph Paasch

When LRO is enabled on the MLX, mlx5e_skb_from_cqe_mpwrq_nonlinear
copies parts of the payload to the linear part of the skb.

This triggers suboptimal processing in GRO, causing slow throughput,...

This patch series addresses this by copying a lower-bound estimate of
the protocol headers - trying to avoid the payload part. This results in
a significant throughput improvement (detailled results in the specific
patch).

Signed-off-by: Christoph Paasch <cpaasch@openai.com>
---
Christoph Paasch (2):
      net/mlx5: Bring back get_cqe_l3_hdr_type
      net/mlx5: Avoid copying payload to the skb's linear part

 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 33 ++++++++++++++++++++++++-
 include/linux/mlx5/device.h                     | 12 ++++++++-
 2 files changed, 43 insertions(+), 2 deletions(-)
---
base-commit: a52f9f0d77f20efc285908a28b5697603b6597c7
change-id: 20250712-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-6524917455a6

Best regards,
-- 
Christoph Paasch <cpaasch@openai.com>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net-next 1/2] net/mlx5: Bring back get_cqe_l3_hdr_type
  2025-07-13 23:33 [PATCH net-next 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch via B4 Relay
@ 2025-07-13 23:33 ` Christoph Paasch via B4 Relay
  2025-07-13 23:33 ` [PATCH net-next 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
  1 sibling, 0 replies; 11+ messages in thread
From: Christoph Paasch via B4 Relay @ 2025-07-13 23:33 UTC (permalink / raw)
  To: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: linux-rdma, netdev, Christoph Paasch

From: Christoph Paasch <cpaasch@openai.com>

Commit 66af4fe37119 ("net/mlx5: Remove unused functions") removed
get_cqe_l3_hdr_type. Let's bring it back.

Also, define CQE_L3_HDR_TYPE_* to identify IPv6 and IPv4 packets.

Signed-off-by: Christoph Paasch <cpaasch@openai.com>
---
 include/linux/mlx5/device.h | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 6822cfa5f4ad31ef6fb61be2dcfae3ac9a46d659..a0098df1b82bf746bf0132d547440b9dab22e57f 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -926,11 +926,16 @@ static inline u8 get_cqe_lro_tcppsh(struct mlx5_cqe64 *cqe)
 	return (cqe->lro.tcppsh_abort_dupack >> 6) & 1;
 }
 
-static inline u8 get_cqe_l4_hdr_type(struct mlx5_cqe64 *cqe)
+static inline u8 get_cqe_l4_hdr_type(const struct mlx5_cqe64 *cqe)
 {
 	return (cqe->l4_l3_hdr_type >> 4) & 0x7;
 }
 
+static inline u8 get_cqe_l3_hdr_type(const struct mlx5_cqe64 *cqe)
+{
+	return (cqe->l4_l3_hdr_type >> 2) & 0x3;
+}
+
 static inline bool cqe_is_tunneled(struct mlx5_cqe64 *cqe)
 {
 	return cqe->tls_outer_l3_tunneled & 0x1;
@@ -1011,6 +1016,11 @@ enum {
 	CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA	= 0x4,
 };
 
+enum {
+	CQE_L3_HDR_TYPE_IPV6			= 0x1,
+	CQE_L3_HDR_TYPE_IPV4			= 0x2,
+};
+
 enum {
 	CQE_RSS_HTYPE_IP	= GENMASK(3, 2),
 	/* cqe->rss_hash_type[3:2] - IP destination selected for hash

-- 
2.49.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-07-13 23:33 [PATCH net-next 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch via B4 Relay
  2025-07-13 23:33 ` [PATCH net-next 1/2] net/mlx5: Bring back get_cqe_l3_hdr_type Christoph Paasch via B4 Relay
@ 2025-07-13 23:33 ` Christoph Paasch via B4 Relay
  2025-07-14  7:29   ` Tariq Toukan
                     ` (2 more replies)
  1 sibling, 3 replies; 11+ messages in thread
From: Christoph Paasch via B4 Relay @ 2025-07-13 23:33 UTC (permalink / raw)
  To: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: linux-rdma, netdev, Christoph Paasch

From: Christoph Paasch <cpaasch@openai.com>

mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
bytes from the page-pool to the skb's linear part. Those 256 bytes
include part of the payload.

When attempting to do GRO in skb_gro_receive, if headlen > data_offset
(and skb->head_frag is not set), we end up aggregating packets in the
frag_list.

This is of course not good when we are CPU-limited. Also causes a worse
skb->len/truesize ratio,...

So, let's avoid copying parts of the payload to the linear part. The
goal here is to err on the side of caution and prefer to copy too little
instead of copying too much (because once it has been copied over, we
trigger the above described behavior in skb_gro_receive).

So, we can do a rough estimate of the header-space by looking at
cqe_l3/l4_hdr_type and kind of do a lower-bound estimate. This is now
done in mlx5e_cqe_get_min_hdr_len(). We always assume that TCP timestamps
are present, as that's the most common use-case.

That header-len is then used in mlx5e_skb_from_cqe_mpwrq_nonlinear for
the headlen (which defines what is being copied over). We still
allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking stack
needs to call pskb_may_pull() later on, we don't need to reallocate
memory.

This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
LRO enabled):

BEFORE:
=======
(netserver pinned to core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
 87380  16384 262144    60.01    32547.82

(netserver pinned to adjacent core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
 87380  16384 262144    60.00    52531.67

AFTER:
======
(netserver pinned to core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
 87380  16384 262144    60.00    52896.06

(netserver pinned to adjacent core receiving interrupts)
 $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
 87380  16384 262144    60.00    85094.90

Signed-off-by: Christoph Paasch <cpaasch@openai.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 33 ++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 2bb32082bfccdc85d26987f792eb8c1047e44dd0..2de669707623882058e3e77f82d74893e5d6fefe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1986,13 +1986,40 @@ mlx5e_shampo_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
 	} while (data_bcnt);
 }
 
+static u16
+mlx5e_cqe_get_min_hdr_len(const struct mlx5_cqe64 *cqe)
+{
+	u16 min_hdr_len = sizeof(struct ethhdr);
+	u8 l3_type = get_cqe_l3_hdr_type(cqe);
+	u8 l4_type = get_cqe_l4_hdr_type(cqe);
+
+	if (cqe_has_vlan(cqe))
+		min_hdr_len += VLAN_HLEN;
+
+	if (l3_type == CQE_L3_HDR_TYPE_IPV4)
+		min_hdr_len += sizeof(struct iphdr);
+	else if (l3_type == CQE_L3_HDR_TYPE_IPV6)
+		min_hdr_len += sizeof(struct ipv6hdr);
+
+	if (l4_type == CQE_L4_HDR_TYPE_UDP)
+		min_hdr_len += sizeof(struct udphdr);
+	else if (l4_type & (CQE_L4_HDR_TYPE_TCP_NO_ACK |
+			    CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA |
+			    CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA))
+		/* Previous condition works because we know that
+		 * l4_type != 0x2 (CQE_L4_HDR_TYPE_UDP)
+		 */
+		min_hdr_len += sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
+
+	return min_hdr_len;
+}
+
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 				   struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
 				   u32 page_idx)
 {
 	struct mlx5e_frag_page *frag_page = &wi->alloc_units.frag_pages[page_idx];
-	u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
 	struct mlx5e_frag_page *head_page = frag_page;
 	struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
 	u32 frag_offset    = head_offset;
@@ -2004,10 +2031,14 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	u32 linear_frame_sz;
 	u16 linear_data_len;
 	u16 linear_hr;
+	u16 headlen;
 	void *va;
 
 	prog = rcu_dereference(rq->xdp_prog);
 
+	headlen = min3(mlx5e_cqe_get_min_hdr_len(cqe), cqe_bcnt,
+		       (u16)MLX5E_RX_MAX_HEAD);
+
 	if (prog) {
 		/* area for bpf_xdp_[store|load]_bytes */
 		net_prefetchw(netmem_address(frag_page->netmem) + frag_offset);

-- 
2.49.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-07-13 23:33 ` [PATCH net-next 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
@ 2025-07-14  7:29   ` Tariq Toukan
  2025-07-14 21:41     ` Christoph Paasch
  2025-07-21 21:44     ` Christoph Paasch
  2025-07-14 13:59   ` Gal Pressman
  2025-07-14 14:20   ` Alexander Lobakin
  2 siblings, 2 replies; 11+ messages in thread
From: Tariq Toukan @ 2025-07-14  7:29 UTC (permalink / raw)
  To: cpaasch, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
	Mark Bloch, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: linux-rdma, netdev



On 14/07/2025 2:33, Christoph Paasch via B4 Relay wrote:
> From: Christoph Paasch <cpaasch@openai.com>
> 
> mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> bytes from the page-pool to the skb's linear part. Those 256 bytes
> include part of the payload.
> 
> When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> (and skb->head_frag is not set), we end up aggregating packets in the
> frag_list.
> 
> This is of course not good when we are CPU-limited. Also causes a worse
> skb->len/truesize ratio,...
> 
> So, let's avoid copying parts of the payload to the linear part. The
> goal here is to err on the side of caution and prefer to copy too little
> instead of copying too much (because once it has been copied over, we
> trigger the above described behavior in skb_gro_receive).
> 
> So, we can do a rough estimate of the header-space by looking at
> cqe_l3/l4_hdr_type and kind of do a lower-bound estimate. This is now
> done in mlx5e_cqe_get_min_hdr_len(). We always assume that TCP timestamps
> are present, as that's the most common use-case.
> 
> That header-len is then used in mlx5e_skb_from_cqe_mpwrq_nonlinear for
> the headlen (which defines what is being copied over). We still
> allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking stack
> needs to call pskb_may_pull() later on, we don't need to reallocate
> memory.
> 
> This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> LRO enabled):
> 
> BEFORE:
> =======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>   87380  16384 262144    60.01    32547.82
> 
> (netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>   87380  16384 262144    60.00    52531.67
> 
> AFTER:
> ======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>   87380  16384 262144    60.00    52896.06
> 
> (netserver pinned to adjacent core receiving interrupts)
>   $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>   87380  16384 262144    60.00    85094.90
> 

Nice improvement.

Did you test impact on other archs?

Did you test impact on non-LRO flows?
Specifically:
a. Large MTU, tcp stream.
b. Large MTU, small UDP packets.


> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 33 ++++++++++++++++++++++++-
>   1 file changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 2bb32082bfccdc85d26987f792eb8c1047e44dd0..2de669707623882058e3e77f82d74893e5d6fefe 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -1986,13 +1986,40 @@ mlx5e_shampo_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
>   	} while (data_bcnt);
>   }
>   
> +static u16
> +mlx5e_cqe_get_min_hdr_len(const struct mlx5_cqe64 *cqe)
> +{
> +	u16 min_hdr_len = sizeof(struct ethhdr);
> +	u8 l3_type = get_cqe_l3_hdr_type(cqe);
> +	u8 l4_type = get_cqe_l4_hdr_type(cqe);
> +
> +	if (cqe_has_vlan(cqe))
> +		min_hdr_len += VLAN_HLEN;
> +
> +	if (l3_type == CQE_L3_HDR_TYPE_IPV4)
> +		min_hdr_len += sizeof(struct iphdr);
> +	else if (l3_type == CQE_L3_HDR_TYPE_IPV6)
> +		min_hdr_len += sizeof(struct ipv6hdr);
> +
> +	if (l4_type == CQE_L4_HDR_TYPE_UDP)
> +		min_hdr_len += sizeof(struct udphdr);
> +	else if (l4_type & (CQE_L4_HDR_TYPE_TCP_NO_ACK |
> +			    CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA |
> +			    CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA))
> +		/* Previous condition works because we know that
> +		 * l4_type != 0x2 (CQE_L4_HDR_TYPE_UDP)
> +		 */
> +		min_hdr_len += sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
> +
> +	return min_hdr_len;
> +}
> +
>   static struct sk_buff *
>   mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
>   				   struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
>   				   u32 page_idx)

BTW, this function handles IPoIB as well, not only Eth.

>   {
>   	struct mlx5e_frag_page *frag_page = &wi->alloc_units.frag_pages[page_idx];
> -	u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
>   	struct mlx5e_frag_page *head_page = frag_page;
>   	struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
>   	u32 frag_offset    = head_offset;
> @@ -2004,10 +2031,14 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>   	u32 linear_frame_sz;
>   	u16 linear_data_len;
>   	u16 linear_hr;
> +	u16 headlen;
>   	void *va;
>   
>   	prog = rcu_dereference(rq->xdp_prog);
>   
> +	headlen = min3(mlx5e_cqe_get_min_hdr_len(cqe), cqe_bcnt,
> +		       (u16)MLX5E_RX_MAX_HEAD);
> +
>   	if (prog) {
>   		/* area for bpf_xdp_[store|load]_bytes */
>   		net_prefetchw(netmem_address(frag_page->netmem) + frag_offset);
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-07-13 23:33 ` [PATCH net-next 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
  2025-07-14  7:29   ` Tariq Toukan
@ 2025-07-14 13:59   ` Gal Pressman
  2025-07-14 22:06     ` Christoph Paasch
  2025-07-14 14:20   ` Alexander Lobakin
  2 siblings, 1 reply; 11+ messages in thread
From: Gal Pressman @ 2025-07-14 13:59 UTC (permalink / raw)
  To: cpaasch, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
	Mark Bloch, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: linux-rdma, netdev

On 14/07/2025 2:33, Christoph Paasch via B4 Relay wrote:
> From: Christoph Paasch <cpaasch@openai.com>
> 
> mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> bytes from the page-pool to the skb's linear part. Those 256 bytes
> include part of the payload.
> 
> When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> (and skb->head_frag is not set), we end up aggregating packets in the
> frag_list.
> 
> This is of course not good when we are CPU-limited. Also causes a worse
> skb->len/truesize ratio,...
> 
> So, let's avoid copying parts of the payload to the linear part. The
> goal here is to err on the side of caution and prefer to copy too little
> instead of copying too much (because once it has been copied over, we
> trigger the above described behavior in skb_gro_receive).
> 
> So, we can do a rough estimate of the header-space by looking at
> cqe_l3/l4_hdr_type and kind of do a lower-bound estimate. This is now
> done in mlx5e_cqe_get_min_hdr_len(). We always assume that TCP timestamps
> are present, as that's the most common use-case.
> 
> That header-len is then used in mlx5e_skb_from_cqe_mpwrq_nonlinear for
> the headlen (which defines what is being copied over). We still
> allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking stack
> needs to call pskb_may_pull() later on, we don't need to reallocate
> memory.
> 
> This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> LRO enabled):
> 
> BEFORE:
> =======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.01    32547.82
> 
> (netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.00    52531.67
> 
> AFTER:
> ======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.00    52896.06
> 
> (netserver pinned to adjacent core receiving interrupts)
>  $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.00    85094.90
> 
> Signed-off-by: Christoph Paasch <cpaasch@openai.com>

Cool change, thanks!

This patch doesn't take encapsulated packets into account, where the
l3/l4 indications apply for the inner packet, while you assume outer.

Also, for encapsulated packets we will *always* have to pull data into
the linear part, which might overshadow the improvement you're trying to
achieve?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-07-13 23:33 ` [PATCH net-next 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
  2025-07-14  7:29   ` Tariq Toukan
  2025-07-14 13:59   ` Gal Pressman
@ 2025-07-14 14:20   ` Alexander Lobakin
  2025-07-14 22:22     ` Christoph Paasch
  2 siblings, 1 reply; 11+ messages in thread
From: Alexander Lobakin @ 2025-07-14 14:20 UTC (permalink / raw)
  To: cpaasch
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-rdma, netdev

From: Christoph Paasch Via B4 Relay <devnull+cpaasch.openai.com@kernel.org>
Date: Sun, 13 Jul 2025 16:33:07 -0700

> From: Christoph Paasch <cpaasch@openai.com>
> 
> mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> bytes from the page-pool to the skb's linear part. Those 256 bytes
> include part of the payload.
> 
> When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> (and skb->head_frag is not set), we end up aggregating packets in the

How did you end up with ->head_frag not set? IIRC mlx5 uses
napi_build_skb(), which explicitly sets ->head_frag to true.
It should be false only for kmalloced linear parts.

> frag_list.
> 
> This is of course not good when we are CPU-limited. Also causes a worse
> skb->len/truesize ratio,...
> 
> So, let's avoid copying parts of the payload to the linear part. The
> goal here is to err on the side of caution and prefer to copy too little
> instead of copying too much (because once it has been copied over, we
> trigger the above described behavior in skb_gro_receive).
> 
> So, we can do a rough estimate of the header-space by looking at
> cqe_l3/l4_hdr_type and kind of do a lower-bound estimate. This is now
> done in mlx5e_cqe_get_min_hdr_len(). We always assume that TCP timestamps
> are present, as that's the most common use-case.
> 
> That header-len is then used in mlx5e_skb_from_cqe_mpwrq_nonlinear for
> the headlen (which defines what is being copied over). We still
> allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking stack
> needs to call pskb_may_pull() later on, we don't need to reallocate
> memory.
> 
> This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> LRO enabled):
> 
> BEFORE:
> =======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.01    32547.82
> 
> (netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.00    52531.67
> 
> AFTER:
> ======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.00    52896.06
> 
> (netserver pinned to adjacent core receiving interrupts)
>  $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>  87380  16384 262144    60.00    85094.90
> 
> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 33 ++++++++++++++++++++++++-
>  1 file changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 2bb32082bfccdc85d26987f792eb8c1047e44dd0..2de669707623882058e3e77f82d74893e5d6fefe 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -1986,13 +1986,40 @@ mlx5e_shampo_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
>  	} while (data_bcnt);
>  }
>  
> +static u16
> +mlx5e_cqe_get_min_hdr_len(const struct mlx5_cqe64 *cqe)
> +{
> +	u16 min_hdr_len = sizeof(struct ethhdr);
> +	u8 l3_type = get_cqe_l3_hdr_type(cqe);
> +	u8 l4_type = get_cqe_l4_hdr_type(cqe);
> +
> +	if (cqe_has_vlan(cqe))
> +		min_hdr_len += VLAN_HLEN;

Can't Q-in-Q be here?

> +
> +	if (l3_type == CQE_L3_HDR_TYPE_IPV4)
> +		min_hdr_len += sizeof(struct iphdr);
> +	else if (l3_type == CQE_L3_HDR_TYPE_IPV6)
> +		min_hdr_len += sizeof(struct ipv6hdr);

You don't account extensions and stuff here.

> +
> +	if (l4_type == CQE_L4_HDR_TYPE_UDP)
> +		min_hdr_len += sizeof(struct udphdr);
> +	else if (l4_type & (CQE_L4_HDR_TYPE_TCP_NO_ACK |
> +			    CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA |
> +			    CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA))
> +		/* Previous condition works because we know that
> +		 * l4_type != 0x2 (CQE_L4_HDR_TYPE_UDP)
> +		 */
> +		min_hdr_len += sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
> +
> +	return min_hdr_len;
> +}
> +
>  static struct sk_buff *
>  mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
>  				   struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
>  				   u32 page_idx)
>  {
>  	struct mlx5e_frag_page *frag_page = &wi->alloc_units.frag_pages[page_idx];
> -	u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
>  	struct mlx5e_frag_page *head_page = frag_page;
>  	struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
>  	u32 frag_offset    = head_offset;
> @@ -2004,10 +2031,14 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>  	u32 linear_frame_sz;
>  	u16 linear_data_len;
>  	u16 linear_hr;
> +	u16 headlen;
>  	void *va;
>  
>  	prog = rcu_dereference(rq->xdp_prog);
>  
> +	headlen = min3(mlx5e_cqe_get_min_hdr_len(cqe), cqe_bcnt,
> +		       (u16)MLX5E_RX_MAX_HEAD);

For your usecase, have you tried setting headlen to just ETH_HLEN here?
Fast GRO should still work for this case, then VLAN/IP/L4 layers will
do a couple memcpy()s to pull their headers, but even on 32-bit MIPS
this was faster than let's say eth_get_headlen() (which involves Flow
Dissector) or open-coded header length assumptions as above.

(the above was correct for 2020 when I last time played with router
 drivers, but I hope nothing's been broken since then)

> +
>  	if (prog) {
>  		/* area for bpf_xdp_[store|load]_bytes */
>  		net_prefetchw(netmem_address(frag_page->netmem) + frag_offset);

Thanks,
Olek

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-07-14  7:29   ` Tariq Toukan
@ 2025-07-14 21:41     ` Christoph Paasch
  2025-07-21 21:44     ` Christoph Paasch
  1 sibling, 0 replies; 11+ messages in thread
From: Christoph Paasch @ 2025-07-14 21:41 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-rdma, netdev

On Mon, Jul 14, 2025 at 12:29 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>
>
>
> On 14/07/2025 2:33, Christoph Paasch via B4 Relay wrote:
> > From: Christoph Paasch <cpaasch@openai.com>
> >
> > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > include part of the payload.
> >
> > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > (and skb->head_frag is not set), we end up aggregating packets in the
> > frag_list.
> >
> > This is of course not good when we are CPU-limited. Also causes a worse
> > skb->len/truesize ratio,...
> >
> > So, let's avoid copying parts of the payload to the linear part. The
> > goal here is to err on the side of caution and prefer to copy too little
> > instead of copying too much (because once it has been copied over, we
> > trigger the above described behavior in skb_gro_receive).
> >
> > So, we can do a rough estimate of the header-space by looking at
> > cqe_l3/l4_hdr_type and kind of do a lower-bound estimate. This is now
> > done in mlx5e_cqe_get_min_hdr_len(). We always assume that TCP timestamps
> > are present, as that's the most common use-case.
> >
> > That header-len is then used in mlx5e_skb_from_cqe_mpwrq_nonlinear for
> > the headlen (which defines what is being copied over). We still
> > allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking stack
> > needs to call pskb_may_pull() later on, we don't need to reallocate
> > memory.
> >
> > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > LRO enabled):
> >
> > BEFORE:
> > =======
> > (netserver pinned to core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> >   87380  16384 262144    60.01    32547.82
> >
> > (netserver pinned to adjacent core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> >   87380  16384 262144    60.00    52531.67
> >
> > AFTER:
> > ======
> > (netserver pinned to core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> >   87380  16384 262144    60.00    52896.06
> >
> > (netserver pinned to adjacent core receiving interrupts)
> >   $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> >   87380  16384 262144    60.00    85094.90
> >
>
> Nice improvement.
>
> Did you test impact on other archs?

I will see if I can get my hands on x86.

> Did you test impact on non-LRO flows?
> Specifically:
> a. Large MTU, tcp stream.
> b. Large MTU, small UDP packets.

You are right, I will extend the test-matrix and report results.

What tool do you recommend for generating the UDP-load ?

> > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > ---
> >   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 33 ++++++++++++++++++++++++-
> >   1 file changed, 32 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > index 2bb32082bfccdc85d26987f792eb8c1047e44dd0..2de669707623882058e3e77f82d74893e5d6fefe 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > @@ -1986,13 +1986,40 @@ mlx5e_shampo_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
> >       } while (data_bcnt);
> >   }
> >
> > +static u16
> > +mlx5e_cqe_get_min_hdr_len(const struct mlx5_cqe64 *cqe)
> > +{
> > +     u16 min_hdr_len = sizeof(struct ethhdr);
> > +     u8 l3_type = get_cqe_l3_hdr_type(cqe);
> > +     u8 l4_type = get_cqe_l4_hdr_type(cqe);
> > +
> > +     if (cqe_has_vlan(cqe))
> > +             min_hdr_len += VLAN_HLEN;
> > +
> > +     if (l3_type == CQE_L3_HDR_TYPE_IPV4)
> > +             min_hdr_len += sizeof(struct iphdr);
> > +     else if (l3_type == CQE_L3_HDR_TYPE_IPV6)
> > +             min_hdr_len += sizeof(struct ipv6hdr);
> > +
> > +     if (l4_type == CQE_L4_HDR_TYPE_UDP)
> > +             min_hdr_len += sizeof(struct udphdr);
> > +     else if (l4_type & (CQE_L4_HDR_TYPE_TCP_NO_ACK |
> > +                         CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA |
> > +                         CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA))
> > +             /* Previous condition works because we know that
> > +              * l4_type != 0x2 (CQE_L4_HDR_TYPE_UDP)
> > +              */
> > +             min_hdr_len += sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
> > +
> > +     return min_hdr_len;
> > +}
> > +
> >   static struct sk_buff *
> >   mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> >                                  struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
> >                                  u32 page_idx)
>
> BTW, this function handles IPoIB as well, not only Eth.\

I see - is there something in the cqe that I can key off of to know
whether this is ethernet or not ?

Alternatively, I simply not add sizeof(struct ethhdr) to the headlen.
The primary goal is to not copy any payload. If I copy less, it's
still ok (got to benchmark this though).


Thanks,
Christoph

>
> >   {
> >       struct mlx5e_frag_page *frag_page = &wi->alloc_units.frag_pages[page_idx];
> > -     u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
> >       struct mlx5e_frag_page *head_page = frag_page;
> >       struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
> >       u32 frag_offset    = head_offset;
> > @@ -2004,10 +2031,14 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> >       u32 linear_frame_sz;
> >       u16 linear_data_len;
> >       u16 linear_hr;
> > +     u16 headlen;
> >       void *va;
> >
> >       prog = rcu_dereference(rq->xdp_prog);
> >
> > +     headlen = min3(mlx5e_cqe_get_min_hdr_len(cqe), cqe_bcnt,
> > +                    (u16)MLX5E_RX_MAX_HEAD);
> > +
> >       if (prog) {
> >               /* area for bpf_xdp_[store|load]_bytes */
> >               net_prefetchw(netmem_address(frag_page->netmem) + frag_offset);
> >
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-07-14 13:59   ` Gal Pressman
@ 2025-07-14 22:06     ` Christoph Paasch
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Paasch @ 2025-07-14 22:06 UTC (permalink / raw)
  To: Gal Pressman
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-rdma, netdev

On Mon, Jul 14, 2025 at 6:59 AM Gal Pressman <gal@nvidia.com> wrote:
>
> On 14/07/2025 2:33, Christoph Paasch via B4 Relay wrote:
> > From: Christoph Paasch <cpaasch@openai.com>
> >
> > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > include part of the payload.
> >
> > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > (and skb->head_frag is not set), we end up aggregating packets in the
> > frag_list.
> >
> > This is of course not good when we are CPU-limited. Also causes a worse
> > skb->len/truesize ratio,...
> >
> > So, let's avoid copying parts of the payload to the linear part. The
> > goal here is to err on the side of caution and prefer to copy too little
> > instead of copying too much (because once it has been copied over, we
> > trigger the above described behavior in skb_gro_receive).
> >
> > So, we can do a rough estimate of the header-space by looking at
> > cqe_l3/l4_hdr_type and kind of do a lower-bound estimate. This is now
> > done in mlx5e_cqe_get_min_hdr_len(). We always assume that TCP timestamps
> > are present, as that's the most common use-case.
> >
> > That header-len is then used in mlx5e_skb_from_cqe_mpwrq_nonlinear for
> > the headlen (which defines what is being copied over). We still
> > allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking stack
> > needs to call pskb_may_pull() later on, we don't need to reallocate
> > memory.
> >
> > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > LRO enabled):
> >
> > BEFORE:
> > =======
> > (netserver pinned to core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> >  87380  16384 262144    60.01    32547.82
> >
> > (netserver pinned to adjacent core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> >  87380  16384 262144    60.00    52531.67
> >
> > AFTER:
> > ======
> > (netserver pinned to core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> >  87380  16384 262144    60.00    52896.06
> >
> > (netserver pinned to adjacent core receiving interrupts)
> >  $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> >  87380  16384 262144    60.00    85094.90
> >
> > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
>
> Cool change, thanks!
>
> This patch doesn't take encapsulated packets into account, where the
> l3/l4 indications apply for the inner packet, while you assume outer.

Yes - my goal really is to avoid copying the inner packet's payload as
that is what will "break" GRO.

Alternatively, if I can extract all the necessary info out of the cqe,
to know the real header-size, I can use that as well.

> Also, for encapsulated packets we will *always* have to pull data into
> the linear part, which might overshadow the improvement you're trying to
> achieve?

Yes, the mlx-driver will end up copying less but later on in the stack
we may have to do the slow-path in pskb_may_pull(). I would hope that
is less of an impact (given the malloc'ed size does not change and
thus we end up just copying bytes we anyways would have copied
previously).
But, let me set up some tunnelling and measure the impact.

Thanks,
Christoph

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-07-14 14:20   ` Alexander Lobakin
@ 2025-07-14 22:22     ` Christoph Paasch
  2025-07-16 11:42       ` Alexander Lobakin
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Paasch @ 2025-07-14 22:22 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-rdma, netdev

On Mon, Jul 14, 2025 at 7:23 AM Alexander Lobakin
<aleksander.lobakin@intel.com> wrote:
>
> From: Christoph Paasch Via B4 Relay <devnull+cpaasch.openai.com@kernel.org>
> Date: Sun, 13 Jul 2025 16:33:07 -0700
>
> > From: Christoph Paasch <cpaasch@openai.com>
> >
> > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > include part of the payload.
> >
> > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > (and skb->head_frag is not set), we end up aggregating packets in the
>
> How did you end up with ->head_frag not set? IIRC mlx5 uses
> napi_build_skb(), which explicitly sets ->head_frag to true.
> It should be false only for kmalloced linear parts.

This particular code-path calls napi_alloc_skb() which ends up calling
__alloc_skb() and won't set head_frag to 1.

> > frag_list.
> >
> > This is of course not good when we are CPU-limited. Also causes a worse
> > skb->len/truesize ratio,...
> >
> > So, let's avoid copying parts of the payload to the linear part. The
> > goal here is to err on the side of caution and prefer to copy too little
> > instead of copying too much (because once it has been copied over, we
> > trigger the above described behavior in skb_gro_receive).
> >
> > So, we can do a rough estimate of the header-space by looking at
> > cqe_l3/l4_hdr_type and kind of do a lower-bound estimate. This is now
> > done in mlx5e_cqe_get_min_hdr_len(). We always assume that TCP timestamps
> > are present, as that's the most common use-case.
> >
> > That header-len is then used in mlx5e_skb_from_cqe_mpwrq_nonlinear for
> > the headlen (which defines what is being copied over). We still
> > allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking stack
> > needs to call pskb_may_pull() later on, we don't need to reallocate
> > memory.
> >
> > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > LRO enabled):
> >
> > BEFORE:
> > =======
> > (netserver pinned to core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> >  87380  16384 262144    60.01    32547.82
> >
> > (netserver pinned to adjacent core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> >  87380  16384 262144    60.00    52531.67
> >
> > AFTER:
> > ======
> > (netserver pinned to core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> >  87380  16384 262144    60.00    52896.06
> >
> > (netserver pinned to adjacent core receiving interrupts)
> >  $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> >  87380  16384 262144    60.00    85094.90
> >
> > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 33 ++++++++++++++++++++++++-
> >  1 file changed, 32 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > index 2bb32082bfccdc85d26987f792eb8c1047e44dd0..2de669707623882058e3e77f82d74893e5d6fefe 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > @@ -1986,13 +1986,40 @@ mlx5e_shampo_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
> >       } while (data_bcnt);
> >  }
> >
> > +static u16
> > +mlx5e_cqe_get_min_hdr_len(const struct mlx5_cqe64 *cqe)
> > +{
> > +     u16 min_hdr_len = sizeof(struct ethhdr);
> > +     u8 l3_type = get_cqe_l3_hdr_type(cqe);
> > +     u8 l4_type = get_cqe_l4_hdr_type(cqe);
> > +
> > +     if (cqe_has_vlan(cqe))
> > +             min_hdr_len += VLAN_HLEN;
>
> Can't Q-in-Q be here?

Yes, see my reply below.

>
> > +
> > +     if (l3_type == CQE_L3_HDR_TYPE_IPV4)
> > +             min_hdr_len += sizeof(struct iphdr);
> > +     else if (l3_type == CQE_L3_HDR_TYPE_IPV6)
> > +             min_hdr_len += sizeof(struct ipv6hdr);
>
> You don't account extensions and stuff here.

Yes - see my reply below.

>
> > +
> > +     if (l4_type == CQE_L4_HDR_TYPE_UDP)
> > +             min_hdr_len += sizeof(struct udphdr);
> > +     else if (l4_type & (CQE_L4_HDR_TYPE_TCP_NO_ACK |
> > +                         CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA |
> > +                         CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA))
> > +             /* Previous condition works because we know that
> > +              * l4_type != 0x2 (CQE_L4_HDR_TYPE_UDP)
> > +              */
> > +             min_hdr_len += sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
> > +
> > +     return min_hdr_len;
> > +}
> > +
> >  static struct sk_buff *
> >  mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> >                                  struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
> >                                  u32 page_idx)
> >  {
> >       struct mlx5e_frag_page *frag_page = &wi->alloc_units.frag_pages[page_idx];
> > -     u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
> >       struct mlx5e_frag_page *head_page = frag_page;
> >       struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
> >       u32 frag_offset    = head_offset;
> > @@ -2004,10 +2031,14 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> >       u32 linear_frame_sz;
> >       u16 linear_data_len;
> >       u16 linear_hr;
> > +     u16 headlen;
> >       void *va;
> >
> >       prog = rcu_dereference(rq->xdp_prog);
> >
> > +     headlen = min3(mlx5e_cqe_get_min_hdr_len(cqe), cqe_bcnt,
> > +                    (u16)MLX5E_RX_MAX_HEAD);
>
> For your usecase, have you tried setting headlen to just ETH_HLEN here?
> Fast GRO should still work for this case, then VLAN/IP/L4 layers will
> do a couple memcpy()s to pull their headers, but even on 32-bit MIPS
> this was faster than let's say eth_get_headlen() (which involves Flow
> Dissector) or open-coded header length assumptions as above.
>
> (the above was correct for 2020 when I last time played with router
>  drivers, but I hope nothing's been broken since then)

Yes, as you correctly point out, it is all about avoiding to copy any
payload to have fast GRO.

I can give it a shot of just copying eth_hlen. And see what perf I
get. You are probably right that it won't matter much. I just thought
that as I have the bits in the cqe that give me some hints on what
headers are present, I can just be slightly more efficient.

Thanks,
Christoph

>
> > +
> >       if (prog) {
> >               /* area for bpf_xdp_[store|load]_bytes */
> >               net_prefetchw(netmem_address(frag_page->netmem) + frag_offset);
>
> Thanks,
> Olek

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-07-14 22:22     ` Christoph Paasch
@ 2025-07-16 11:42       ` Alexander Lobakin
  0 siblings, 0 replies; 11+ messages in thread
From: Alexander Lobakin @ 2025-07-16 11:42 UTC (permalink / raw)
  To: Christoph Paasch
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-rdma, netdev

From: Christoph Paasch <cpaasch@openai.com>
Date: Mon, 14 Jul 2025 15:22:34 -0700

> On Mon, Jul 14, 2025 at 7:23 AM Alexander Lobakin
> <aleksander.lobakin@intel.com> wrote:
>>
>> From: Christoph Paasch Via B4 Relay <devnull+cpaasch.openai.com@kernel.org>
>> Date: Sun, 13 Jul 2025 16:33:07 -0700
>>
>>> From: Christoph Paasch <cpaasch@openai.com>
>>>
>>> mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
>>> bytes from the page-pool to the skb's linear part. Those 256 bytes
>>> include part of the payload.
>>>
>>> When attempting to do GRO in skb_gro_receive, if headlen > data_offset
>>> (and skb->head_frag is not set), we end up aggregating packets in the
>>
>> How did you end up with ->head_frag not set? IIRC mlx5 uses
>> napi_build_skb(), which explicitly sets ->head_frag to true.
>> It should be false only for kmalloced linear parts.
> 
> This particular code-path calls napi_alloc_skb() which ends up calling
> __alloc_skb() and won't set head_frag to 1.

Hmmm. I haven't looked deep into mlx5 HW GRO internals, but
napi_alloc_skb() falls back to __alloc_skb() only in certain cases; in
most common cases, it should go "the Eric route" with allocating a small
frag for its payload.

[...]

>> (the above was correct for 2020 when I last time played with router
>>  drivers, but I hope nothing's been broken since then)
> 
> Yes, as you correctly point out, it is all about avoiding to copy any
> payload to have fast GRO.
> 
> I can give it a shot of just copying eth_hlen. And see what perf I
> get. You are probably right that it won't matter much. I just thought
> that as I have the bits in the cqe that give me some hints on what
> headers are present, I can just be slightly more efficient.

Yeah it just depends on the results. On some setups and workloads, just
copying ETH_HLEN might perform better than trying to calculate the
precise payload offset (but not always).
If you really want precise numbers, then eth_get_headlen() would do that
for you, but it introduces overhead related to Flow Dissector, so again,
only test comparison will show you.

> 
> Thanks,
> Christoph
> 
>>
>>> +
>>>       if (prog) {
>>>               /* area for bpf_xdp_[store|load]_bytes */
>>>               net_prefetchw(netmem_address(frag_page->netmem) + frag_offset);
>>
>> Thanks,
>> Olek

Thanks,
Olek

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 2/2] net/mlx5: Avoid copying payload to the skb's linear part
  2025-07-14  7:29   ` Tariq Toukan
  2025-07-14 21:41     ` Christoph Paasch
@ 2025-07-21 21:44     ` Christoph Paasch
  1 sibling, 0 replies; 11+ messages in thread
From: Christoph Paasch @ 2025-07-21 21:44 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-rdma, netdev

Hello!

On Mon, Jul 14, 2025 at 12:29 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>
>
>
> On 14/07/2025 2:33, Christoph Paasch via B4 Relay wrote:
> > From: Christoph Paasch <cpaasch@openai.com>
> >
> > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > include part of the payload.
> >
> > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > (and skb->head_frag is not set), we end up aggregating packets in the
> > frag_list.
> >
> > This is of course not good when we are CPU-limited. Also causes a worse
> > skb->len/truesize ratio,...
> >
> > So, let's avoid copying parts of the payload to the linear part. The
> > goal here is to err on the side of caution and prefer to copy too little
> > instead of copying too much (because once it has been copied over, we
> > trigger the above described behavior in skb_gro_receive).
> >
> > So, we can do a rough estimate of the header-space by looking at
> > cqe_l3/l4_hdr_type and kind of do a lower-bound estimate. This is now
> > done in mlx5e_cqe_get_min_hdr_len(). We always assume that TCP timestamps
> > are present, as that's the most common use-case.
> >
> > That header-len is then used in mlx5e_skb_from_cqe_mpwrq_nonlinear for
> > the headlen (which defines what is being copied over). We still
> > allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking stack
> > needs to call pskb_may_pull() later on, we don't need to reallocate
> > memory.
> >
> > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > LRO enabled):
> >
> > BEFORE:
> > =======
> > (netserver pinned to core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> >   87380  16384 262144    60.01    32547.82
> >
> > (netserver pinned to adjacent core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> >   87380  16384 262144    60.00    52531.67
> >
> > AFTER:
> > ======
> > (netserver pinned to core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> >   87380  16384 262144    60.00    52896.06
> >
> > (netserver pinned to adjacent core receiving interrupts)
> >   $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> >   87380  16384 262144    60.00    85094.90
> >
>
> Nice improvement.
>
> Did you test impact on other archs?
>
> Did you test impact on non-LRO flows?
> Specifically:
> a. Large MTU, tcp stream.
> b. Large MTU, small UDP packets.

took a minute, but I have extended my benchmarks to a much larger test matrix:

With / Without LRO
With / Without IPv6 encap
MTU: 1500, 4096, 9000
IRQs on same core as the app / IRQs on adjacent core as the app
TCP with write/read-size 64KB and 512KB
UDP with 64B and 1400B

A full matrix across all of the above for a total of 96 tests.

No measurable significant regressions (10% threshold).

Numerous improvements (above 10% threshold) in the TCP workloads:

  TCP 512-Kbyte, core 8, MTU 1500, LRO on, tunnel off     49810.51  ->
   61924.39  ( +24.3% ↑)
  TCP 512-Kbyte, core 8, MTU 1500, LRO on, tunnel on      24897.29  ->
   42404.18  ( +70.3% ↑)
  TCP 512-Kbyte, core 8, MTU 4096, LRO off, tunnel on     35218.00  ->
   41608.82  ( +18.1% ↑)
  TCP 512-Kbyte, core 8, MTU 4096, LRO on, tunnel on      25056.58  ->
   42231.90  ( +68.5% ↑)
  TCP 512-Kbyte, core 8, MTU 9000, LRO off, tunnel off    38688.81  ->
   50152.49  ( +29.6% ↑)
  TCP 512-Kbyte, core 8, MTU 9000, LRO off, tunnel on     23067.36  ->
   42593.14  ( +84.6% ↑)
  TCP 512-Kbyte, core 8, MTU 9000, LRO on, tunnel on      24671.25  ->
   41276.60  ( +67.3% ↑)
  TCP 512-Kbyte, core 9, MTU 1500, LRO on, tunnel on      25078.41  ->
   42473.55  ( +69.4% ↑)
  TCP 512-Kbyte, core 9, MTU 4096, LRO off, tunnel off    36962.68  ->
   40727.38  ( +10.2% ↑)
  TCP 512-Kbyte, core 9, MTU 4096, LRO on, tunnel on      24890.12  ->
   42248.13  ( +69.7% ↑)
  TCP 512-Kbyte, core 9, MTU 9000, LRO off, tunnel off    45620.36  ->
   58454.83  ( +28.1% ↑)
  TCP 512-Kbyte, core 9, MTU 9000, LRO off, tunnel on     23006.81  ->
   42985.67  ( +86.8% ↑)
  TCP 512-Kbyte, core 9, MTU 9000, LRO on, tunnel on      24539.75  ->
   42295.60  ( +72.4% ↑)
  TCP 64-Kbyte, core 8, MTU 1500, LRO on, tunnel off      38187.87  ->
   45568.38  ( +19.3% ↑)
  TCP 64-Kbyte, core 8, MTU 1500, LRO on, tunnel on       22683.89  ->
   43351.23  ( +91.1% ↑)
  TCP 64-Kbyte, core 8, MTU 4096, LRO on, tunnel on       23653.41  ->
   43988.30  ( +86.0% ↑)
  TCP 64-Kbyte, core 8, MTU 9000, LRO off, tunnel off     37677.10  ->
   48778.02  ( +29.5% ↑)
  TCP 64-Kbyte, core 8, MTU 9000, LRO off, tunnel on      23960.71  ->
   41828.04  ( +74.6% ↑)
  TCP 64-Kbyte, core 8, MTU 9000, LRO on, tunnel off      57001.62  ->
   68577.28  ( +20.3% ↑)
  TCP 64-Kbyte, core 8, MTU 9000, LRO on, tunnel on       24068.93  ->
   43836.63  ( +82.1% ↑)
  TCP 64-Kbyte, core 9, MTU 1500, LRO on, tunnel off      60887.66  ->
   68647.38  ( +12.7% ↑)
  TCP 64-Kbyte, core 9, MTU 1500, LRO on, tunnel on       22463.53  ->
   34560.19  ( +53.9% ↑)
  TCP 64-Kbyte, core 9, MTU 4096, LRO on, tunnel on       23253.21  ->
   43358.30  ( +86.5% ↑)
  TCP 64-Kbyte, core 9, MTU 9000, LRO off, tunnel off     40471.13  ->
   55189.89  ( +36.4% ↑)
  TCP 64-Kbyte, core 9, MTU 9000, LRO off, tunnel on      23880.19  ->
   42457.94  ( +77.8% ↑)
  TCP 64-Kbyte, core 9, MTU 9000, LRO on, tunnel on       22040.72  ->
   30249.36  ( +37.2% ↑)

(and I learned that even when LRO is off,
mlx5e_skb_from_cqe_mpwrq_nonlinear() is being used when MTU is large,
which is why we see improvements above even when LRO is off)

(I will include the additional benchmark data in a resubmission)

The primary remaining question is how to handle the IB-case. If
get_cqe_l3_hdr_type() will be 0x0 in case of IB, I can key off of
that.

Thoughts ?


Thanks,
Christoph



>
>
> > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > ---
> >   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 33 ++++++++++++++++++++++++-
> >   1 file changed, 32 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > index 2bb32082bfccdc85d26987f792eb8c1047e44dd0..2de669707623882058e3e77f82d74893e5d6fefe 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > @@ -1986,13 +1986,40 @@ mlx5e_shampo_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
> >       } while (data_bcnt);
> >   }
> >
> > +static u16
> > +mlx5e_cqe_get_min_hdr_len(const struct mlx5_cqe64 *cqe)
> > +{
> > +     u16 min_hdr_len = sizeof(struct ethhdr);
> > +     u8 l3_type = get_cqe_l3_hdr_type(cqe);
> > +     u8 l4_type = get_cqe_l4_hdr_type(cqe);
> > +
> > +     if (cqe_has_vlan(cqe))
> > +             min_hdr_len += VLAN_HLEN;
> > +
> > +     if (l3_type == CQE_L3_HDR_TYPE_IPV4)
> > +             min_hdr_len += sizeof(struct iphdr);
> > +     else if (l3_type == CQE_L3_HDR_TYPE_IPV6)
> > +             min_hdr_len += sizeof(struct ipv6hdr);
> > +
> > +     if (l4_type == CQE_L4_HDR_TYPE_UDP)
> > +             min_hdr_len += sizeof(struct udphdr);
> > +     else if (l4_type & (CQE_L4_HDR_TYPE_TCP_NO_ACK |
> > +                         CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA |
> > +                         CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA))
> > +             /* Previous condition works because we know that
> > +              * l4_type != 0x2 (CQE_L4_HDR_TYPE_UDP)
> > +              */
> > +             min_hdr_len += sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
> > +
> > +     return min_hdr_len;
> > +}
> > +
> >   static struct sk_buff *
> >   mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> >                                  struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
> >                                  u32 page_idx)
>
> BTW, this function handles IPoIB as well, not only Eth.
>
> >   {
> >       struct mlx5e_frag_page *frag_page = &wi->alloc_units.frag_pages[page_idx];
> > -     u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
> >       struct mlx5e_frag_page *head_page = frag_page;
> >       struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
> >       u32 frag_offset    = head_offset;
> > @@ -2004,10 +2031,14 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> >       u32 linear_frame_sz;
> >       u16 linear_data_len;
> >       u16 linear_hr;
> > +     u16 headlen;
> >       void *va;
> >
> >       prog = rcu_dereference(rq->xdp_prog);
> >
> > +     headlen = min3(mlx5e_cqe_get_min_hdr_len(cqe), cqe_bcnt,
> > +                    (u16)MLX5E_RX_MAX_HEAD);
> > +
> >       if (prog) {
> >               /* area for bpf_xdp_[store|load]_bytes */
> >               net_prefetchw(netmem_address(frag_page->netmem) + frag_offset);
> >
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-07-21 21:44 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-13 23:33 [PATCH net-next 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch via B4 Relay
2025-07-13 23:33 ` [PATCH net-next 1/2] net/mlx5: Bring back get_cqe_l3_hdr_type Christoph Paasch via B4 Relay
2025-07-13 23:33 ` [PATCH net-next 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
2025-07-14  7:29   ` Tariq Toukan
2025-07-14 21:41     ` Christoph Paasch
2025-07-21 21:44     ` Christoph Paasch
2025-07-14 13:59   ` Gal Pressman
2025-07-14 22:06     ` Christoph Paasch
2025-07-14 14:20   ` Alexander Lobakin
2025-07-14 22:22     ` Christoph Paasch
2025-07-16 11:42       ` Alexander Lobakin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).