* [PATCH net-next v3 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing
@ 2025-08-26 3:47 Christoph Paasch via B4 Relay
2025-08-26 3:47 ` [PATCH net-next v3 1/2] net/mlx5: Bring back get_cqe_l3_hdr_type Christoph Paasch via B4 Relay
2025-08-26 3:47 ` [PATCH net-next v3 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
0 siblings, 2 replies; 8+ messages in thread
From: Christoph Paasch via B4 Relay @ 2025-08-26 3:47 UTC (permalink / raw)
To: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexander Lobakin, Gal Pressman, Dragos Tatulea
Cc: linux-rdma, netdev, Christoph Paasch
When LRO is enabled on the MLX, mlx5e_skb_from_cqe_mpwrq_nonlinear
copies parts of the payload to the linear part of the skb.
This triggers suboptimal processing in GRO, causing slow throughput,...
This patch series addresses this by copying a lower-bound estimate of
the protocol headers - trying to avoid the payload part. This results in
a significant throughput improvement (detailled results in the specific
patch).
Signed-off-by: Christoph Paasch <cpaasch@openai.com>
---
Changes in v3:
- Avoid computing headlen when it is not absolutely necessary (e.g., xdp
decides to "consume" the packet) (Dragos Tatulea <dtatulea@nvidia.com> & Jakub Kicinski <kuba@kernel.org>)
- Given the above change, consolidate the check for min3(...) in the new
function to avoid code duplication.
- Make sure local variables are in reverse xmas-tree order.
- Refine comment about why the check for l4_type worsk as is.
- Link to v2: https://lore.kernel.org/r/20250816-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-v2-0-b11b30bc2d10@openai.com
Changes in v2:
- Refine commit-message with more info and testing data
- Make mlx5e_cqe_get_min_hdr_len() return MLX5E_RX_MAX_HEAD when l3_type
is neither IPv4 nor IPv6. Same for the l4_type. That way behavior is
unchanged for other traffic types.
- Rename mlx5e_cqe_get_min_hdr_len to mlx5e_cqe_estimate_hdr_len
- Link to v1: https://lore.kernel.org/r/20250713-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-v1-0-ecaed8c2844e@openai.com
---
Christoph Paasch (2):
net/mlx5: Bring back get_cqe_l3_hdr_type
net/mlx5: Avoid copying payload to the skb's linear part
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 49 ++++++++++++++++++++++++-
include/linux/mlx5/device.h | 12 +++++-
2 files changed, 59 insertions(+), 2 deletions(-)
---
base-commit: 6e8e6baf16ce7d2310959ae81d0194a56874e0d2
change-id: 20250712-cpaasch-pf-927-netmlx5-avoid-copying-the-payload-to-the-malloced-area-6524917455a6
Best regards,
--
Christoph Paasch <cpaasch@openai.com>
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH net-next v3 1/2] net/mlx5: Bring back get_cqe_l3_hdr_type
2025-08-26 3:47 [PATCH net-next v3 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch via B4 Relay
@ 2025-08-26 3:47 ` Christoph Paasch via B4 Relay
2025-08-26 3:47 ` [PATCH net-next v3 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
1 sibling, 0 replies; 8+ messages in thread
From: Christoph Paasch via B4 Relay @ 2025-08-26 3:47 UTC (permalink / raw)
To: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexander Lobakin, Gal Pressman, Dragos Tatulea
Cc: linux-rdma, netdev, Christoph Paasch
From: Christoph Paasch <cpaasch@openai.com>
Commit 66af4fe37119 ("net/mlx5: Remove unused functions") removed
get_cqe_l3_hdr_type. Let's bring it back.
Also, define CQE_L3_HDR_TYPE_* to identify IPv6 and IPv4 packets.
Signed-off-by: Christoph Paasch <cpaasch@openai.com>
---
include/linux/mlx5/device.h | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 9d2467f982ad4697f0b36f6975b820c3a41fc78a..5e4a03cff0f1d9b11c5f562c23dbf85c3302f681 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -927,11 +927,16 @@ static inline u8 get_cqe_lro_tcppsh(struct mlx5_cqe64 *cqe)
return (cqe->lro.tcppsh_abort_dupack >> 6) & 1;
}
-static inline u8 get_cqe_l4_hdr_type(struct mlx5_cqe64 *cqe)
+static inline u8 get_cqe_l4_hdr_type(const struct mlx5_cqe64 *cqe)
{
return (cqe->l4_l3_hdr_type >> 4) & 0x7;
}
+static inline u8 get_cqe_l3_hdr_type(const struct mlx5_cqe64 *cqe)
+{
+ return (cqe->l4_l3_hdr_type >> 2) & 0x3;
+}
+
static inline bool cqe_is_tunneled(struct mlx5_cqe64 *cqe)
{
return cqe->tls_outer_l3_tunneled & 0x1;
@@ -1012,6 +1017,11 @@ enum {
CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA = 0x4,
};
+enum {
+ CQE_L3_HDR_TYPE_IPV6 = 0x1,
+ CQE_L3_HDR_TYPE_IPV4 = 0x2,
+};
+
enum {
CQE_RSS_HTYPE_IP = GENMASK(3, 2),
/* cqe->rss_hash_type[3:2] - IP destination selected for hash
--
2.50.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH net-next v3 2/2] net/mlx5: Avoid copying payload to the skb's linear part
2025-08-26 3:47 [PATCH net-next v3 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch via B4 Relay
2025-08-26 3:47 ` [PATCH net-next v3 1/2] net/mlx5: Bring back get_cqe_l3_hdr_type Christoph Paasch via B4 Relay
@ 2025-08-26 3:47 ` Christoph Paasch via B4 Relay
2025-08-26 6:38 ` Eric Dumazet
2025-08-27 6:58 ` Dragos Tatulea
1 sibling, 2 replies; 8+ messages in thread
From: Christoph Paasch via B4 Relay @ 2025-08-26 3:47 UTC (permalink / raw)
To: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Alexander Lobakin, Gal Pressman, Dragos Tatulea
Cc: linux-rdma, netdev, Christoph Paasch
From: Christoph Paasch <cpaasch@openai.com>
mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
bytes from the page-pool to the skb's linear part. Those 256 bytes
include part of the payload.
When attempting to do GRO in skb_gro_receive, if headlen > data_offset
(and skb->head_frag is not set), we end up aggregating packets in the
frag_list.
This is of course not good when we are CPU-limited. Also causes a worse
skb->len/truesize ratio,...
So, let's avoid copying parts of the payload to the linear part. The
goal here is to err on the side of caution and prefer to copy too little
instead of copying too much (because once it has been copied over, we
trigger the above described behavior in skb_gro_receive).
So, we can do a rough estimate of the header-space by looking at
cqe_l3/l4_hdr_type. This is now done in mlx5e_cqe_estimate_hdr_len().
We always assume that TCP timestamps are present, as that's the most common
use-case.
That header-len is then used in mlx5e_skb_from_cqe_mpwrq_nonlinear for
the headlen (which defines what is being copied over). We still
allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking stack
needs to call pskb_may_pull() later on, we don't need to reallocate
memory.
This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
LRO enabled):
BEFORE:
=======
(netserver pinned to core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
87380 16384 262144 60.01 32547.82
(netserver pinned to adjacent core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
87380 16384 262144 60.00 52531.67
AFTER:
======
(netserver pinned to core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
87380 16384 262144 60.00 52896.06
(netserver pinned to adjacent core receiving interrupts)
$ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
87380 16384 262144 60.00 85094.90
Additional tests across a larger range of parameters w/ and w/o LRO, w/
and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
better performance with this patch.
Signed-off-by: Christoph Paasch <cpaasch@openai.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 49 ++++++++++++++++++++++++-
1 file changed, 48 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index b8c609d91d11bd315e8fb67f794a91bd37cd28c0..050f3efca34f3b8984c30f335ee43f487fef33ac 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1991,13 +1991,54 @@ mlx5e_shampo_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
} while (data_bcnt);
}
+static u16
+mlx5e_cqe_estimate_hdr_len(const struct mlx5_cqe64 *cqe, u16 cqe_bcnt)
+{
+ u8 l3_type, l4_type;
+ u16 hdr_len;
+
+ hdr_len = sizeof(struct ethhdr);
+
+ if (cqe_has_vlan(cqe))
+ hdr_len += VLAN_HLEN;
+
+ l3_type = get_cqe_l3_hdr_type(cqe);
+ if (l3_type == CQE_L3_HDR_TYPE_IPV4) {
+ hdr_len += sizeof(struct iphdr);
+ } else if (l3_type == CQE_L3_HDR_TYPE_IPV6) {
+ hdr_len += sizeof(struct ipv6hdr);
+ } else {
+ hdr_len = MLX5E_RX_MAX_HEAD;
+ goto out;
+ }
+
+ l4_type = get_cqe_l4_hdr_type(cqe);
+ if (l4_type == CQE_L4_HDR_TYPE_UDP) {
+ hdr_len += sizeof(struct udphdr);
+ } else if (l4_type & (CQE_L4_HDR_TYPE_TCP_NO_ACK |
+ CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA |
+ CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA)) {
+ /* ACK_NO_ACK | ACK_NO_DATA | ACK_AND_DATA == 0x7, but
+ * the previous condition checks for _UDP which is 0x2.
+ *
+ * As we know that l4_type != 0x2, we can simply check
+ * if any of the bits of 0x7 is set.
+ */
+ hdr_len += sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
+ } else {
+ hdr_len = MLX5E_RX_MAX_HEAD;
+ }
+
+out:
+ return min3(hdr_len, cqe_bcnt, MLX5E_RX_MAX_HEAD);
+}
+
static struct sk_buff *
mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
u32 page_idx)
{
struct mlx5e_frag_page *frag_page = &wi->alloc_units.frag_pages[page_idx];
- u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
struct mlx5e_frag_page *head_page = frag_page;
struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
u32 frag_offset = head_offset;
@@ -2009,6 +2050,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
u32 linear_frame_sz;
u16 linear_data_len;
u16 linear_hr;
+ u16 headlen;
void *va;
prog = rcu_dereference(rq->xdp_prog);
@@ -2039,6 +2081,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
net_prefetchw(va); /* xdp_frame data area */
net_prefetchw(skb->data);
+ headlen = mlx5e_cqe_estimate_hdr_len(cqe, cqe_bcnt);
+
frag_offset += headlen;
byte_cnt -= headlen;
linear_hr = skb_headroom(skb);
@@ -2115,6 +2159,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
pagep->frags++;
while (++pagep < frag_page);
}
+
+ headlen = mlx5e_cqe_estimate_hdr_len(cqe, cqe_bcnt);
+
__pskb_pull_tail(skb, headlen);
} else {
dma_addr_t addr;
--
2.50.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH net-next v3 2/2] net/mlx5: Avoid copying payload to the skb's linear part
2025-08-26 3:47 ` [PATCH net-next v3 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
@ 2025-08-26 6:38 ` Eric Dumazet
2025-08-26 20:31 ` Christoph Paasch
2025-08-27 6:58 ` Dragos Tatulea
1 sibling, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2025-08-26 6:38 UTC (permalink / raw)
To: cpaasch
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Andrew Lunn, David S. Miller, Jakub Kicinski, Paolo Abeni,
Alexander Lobakin, Gal Pressman, Dragos Tatulea, linux-rdma,
netdev
On Mon, Aug 25, 2025 at 8:47 PM Christoph Paasch via B4 Relay
<devnull+cpaasch.openai.com@kernel.org> wrote:
>
> From: Christoph Paasch <cpaasch@openai.com>
>
> mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> bytes from the page-pool to the skb's linear part. Those 256 bytes
> include part of the payload.
>
> When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> (and skb->head_frag is not set), we end up aggregating packets in the
> frag_list.
>
> This is of course not good when we are CPU-limited. Also causes a worse
> skb->len/truesize ratio,...
>
> So, let's avoid copying parts of the payload to the linear part. The
> goal here is to err on the side of caution and prefer to copy too little
> instead of copying too much (because once it has been copied over, we
> trigger the above described behavior in skb_gro_receive).
>
> So, we can do a rough estimate of the header-space by looking at
> cqe_l3/l4_hdr_type. This is now done in mlx5e_cqe_estimate_hdr_len().
> We always assume that TCP timestamps are present, as that's the most common
> use-case.
>
> That header-len is then used in mlx5e_skb_from_cqe_mpwrq_nonlinear for
> the headlen (which defines what is being copied over). We still
> allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking stack
> needs to call pskb_may_pull() later on, we don't need to reallocate
> memory.
>
> This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> LRO enabled):
>
> BEFORE:
> =======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.01 32547.82
>
> (netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.00 52531.67
>
> AFTER:
> ======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.00 52896.06
>
> (netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.00 85094.90
>
> Additional tests across a larger range of parameters w/ and w/o LRO, w/
> and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> better performance with this patch.
>
> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 49 ++++++++++++++++++++++++-
> 1 file changed, 48 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index b8c609d91d11bd315e8fb67f794a91bd37cd28c0..050f3efca34f3b8984c30f335ee43f487fef33ac 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -1991,13 +1991,54 @@ mlx5e_shampo_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
> } while (data_bcnt);
> }
>
> +static u16
> +mlx5e_cqe_estimate_hdr_len(const struct mlx5_cqe64 *cqe, u16 cqe_bcnt)
> +{
> + u8 l3_type, l4_type;
> + u16 hdr_len;
> +
> + hdr_len = sizeof(struct ethhdr);
> +
> + if (cqe_has_vlan(cqe))
> + hdr_len += VLAN_HLEN;
> +
> + l3_type = get_cqe_l3_hdr_type(cqe);
> + if (l3_type == CQE_L3_HDR_TYPE_IPV4) {
> + hdr_len += sizeof(struct iphdr);
> + } else if (l3_type == CQE_L3_HDR_TYPE_IPV6) {
> + hdr_len += sizeof(struct ipv6hdr);
> + } else {
> + hdr_len = MLX5E_RX_MAX_HEAD;
> + goto out;
> + }
> +
> + l4_type = get_cqe_l4_hdr_type(cqe);
> + if (l4_type == CQE_L4_HDR_TYPE_UDP) {
> + hdr_len += sizeof(struct udphdr);
> + } else if (l4_type & (CQE_L4_HDR_TYPE_TCP_NO_ACK |
> + CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA |
> + CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA)) {
> + /* ACK_NO_ACK | ACK_NO_DATA | ACK_AND_DATA == 0x7, but
> + * the previous condition checks for _UDP which is 0x2.
> + *
> + * As we know that l4_type != 0x2, we can simply check
> + * if any of the bits of 0x7 is set.
> + */
> + hdr_len += sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
> + } else {
> + hdr_len = MLX5E_RX_MAX_HEAD;
> + }
> +
> +out:
> + return min3(hdr_len, cqe_bcnt, MLX5E_RX_MAX_HEAD);
> +}
> +
Hi Christoph
I wonder if you have tried to use eth_get_headlen() instead of yet
another dissector ?
I doubt you will see a performance difference.
commit cfecec56ae7c7c40f23fbdac04acee027ca3bd66
Author: Eric Dumazet <edumazet@google.com>
Date: Fri Sep 5 18:29:45 2014 -0700
mlx4: only pull headers into skb head
Use the new fancy eth_get_headlen() to pull exactly the headers
into skb->head.
This speeds up GRE traffic (or more generally tunneled traffuc),
as GRO can aggregate up to 17 MSS per GRO packet instead of 8.
(Pulling too much data was forcing GRO to keep 2 frags per MSS)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
> static struct sk_buff *
> mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
> u32 page_idx)
> {
> struct mlx5e_frag_page *frag_page = &wi->alloc_units.frag_pages[page_idx];
> - u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
> struct mlx5e_frag_page *head_page = frag_page;
> struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
> u32 frag_offset = head_offset;
> @@ -2009,6 +2050,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> u32 linear_frame_sz;
> u16 linear_data_len;
> u16 linear_hr;
> + u16 headlen;
> void *va;
>
> prog = rcu_dereference(rq->xdp_prog);
> @@ -2039,6 +2081,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> net_prefetchw(va); /* xdp_frame data area */
> net_prefetchw(skb->data);
>
> + headlen = mlx5e_cqe_estimate_hdr_len(cqe, cqe_bcnt);
> +
> frag_offset += headlen;
> byte_cnt -= headlen;
> linear_hr = skb_headroom(skb);
> @@ -2115,6 +2159,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> pagep->frags++;
> while (++pagep < frag_page);
> }
> +
> + headlen = mlx5e_cqe_estimate_hdr_len(cqe, cqe_bcnt);
> +
> __pskb_pull_tail(skb, headlen);
> } else {
> dma_addr_t addr;
>
> --
> 2.50.1
>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net-next v3 2/2] net/mlx5: Avoid copying payload to the skb's linear part
2025-08-26 6:38 ` Eric Dumazet
@ 2025-08-26 20:31 ` Christoph Paasch
2025-08-27 7:08 ` Dragos Tatulea
0 siblings, 1 reply; 8+ messages in thread
From: Christoph Paasch @ 2025-08-26 20:31 UTC (permalink / raw)
To: Eric Dumazet
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Andrew Lunn, David S. Miller, Jakub Kicinski, Paolo Abeni,
Alexander Lobakin, Gal Pressman, Dragos Tatulea, linux-rdma,
netdev
On Mon, Aug 25, 2025 at 11:38 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Aug 25, 2025 at 8:47 PM Christoph Paasch via B4 Relay
> <devnull+cpaasch.openai.com@kernel.org> wrote:
> >
> > From: Christoph Paasch <cpaasch@openai.com>
> >
> > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > include part of the payload.
> >
> > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > (and skb->head_frag is not set), we end up aggregating packets in the
> > frag_list.
> >
> > This is of course not good when we are CPU-limited. Also causes a worse
> > skb->len/truesize ratio,...
> >
> > So, let's avoid copying parts of the payload to the linear part. The
> > goal here is to err on the side of caution and prefer to copy too little
> > instead of copying too much (because once it has been copied over, we
> > trigger the above described behavior in skb_gro_receive).
> >
> > So, we can do a rough estimate of the header-space by looking at
> > cqe_l3/l4_hdr_type. This is now done in mlx5e_cqe_estimate_hdr_len().
> > We always assume that TCP timestamps are present, as that's the most common
> > use-case.
> >
> > That header-len is then used in mlx5e_skb_from_cqe_mpwrq_nonlinear for
> > the headlen (which defines what is being copied over). We still
> > allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking stack
> > needs to call pskb_may_pull() later on, we don't need to reallocate
> > memory.
> >
> > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > LRO enabled):
> >
> > BEFORE:
> > =======
> > (netserver pinned to core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > 87380 16384 262144 60.01 32547.82
> >
> > (netserver pinned to adjacent core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > 87380 16384 262144 60.00 52531.67
> >
> > AFTER:
> > ======
> > (netserver pinned to core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > 87380 16384 262144 60.00 52896.06
> >
> > (netserver pinned to adjacent core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > 87380 16384 262144 60.00 85094.90
> >
> > Additional tests across a larger range of parameters w/ and w/o LRO, w/
> > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> > better performance with this patch.
> >
> > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > ---
> > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 49 ++++++++++++++++++++++++-
> > 1 file changed, 48 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > index b8c609d91d11bd315e8fb67f794a91bd37cd28c0..050f3efca34f3b8984c30f335ee43f487fef33ac 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > @@ -1991,13 +1991,54 @@ mlx5e_shampo_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
> > } while (data_bcnt);
> > }
> >
> > +static u16
> > +mlx5e_cqe_estimate_hdr_len(const struct mlx5_cqe64 *cqe, u16 cqe_bcnt)
> > +{
> > + u8 l3_type, l4_type;
> > + u16 hdr_len;
> > +
> > + hdr_len = sizeof(struct ethhdr);
> > +
> > + if (cqe_has_vlan(cqe))
> > + hdr_len += VLAN_HLEN;
> > +
> > + l3_type = get_cqe_l3_hdr_type(cqe);
> > + if (l3_type == CQE_L3_HDR_TYPE_IPV4) {
> > + hdr_len += sizeof(struct iphdr);
> > + } else if (l3_type == CQE_L3_HDR_TYPE_IPV6) {
> > + hdr_len += sizeof(struct ipv6hdr);
> > + } else {
> > + hdr_len = MLX5E_RX_MAX_HEAD;
> > + goto out;
> > + }
> > +
> > + l4_type = get_cqe_l4_hdr_type(cqe);
> > + if (l4_type == CQE_L4_HDR_TYPE_UDP) {
> > + hdr_len += sizeof(struct udphdr);
> > + } else if (l4_type & (CQE_L4_HDR_TYPE_TCP_NO_ACK |
> > + CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA |
> > + CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA)) {
> > + /* ACK_NO_ACK | ACK_NO_DATA | ACK_AND_DATA == 0x7, but
> > + * the previous condition checks for _UDP which is 0x2.
> > + *
> > + * As we know that l4_type != 0x2, we can simply check
> > + * if any of the bits of 0x7 is set.
> > + */
> > + hdr_len += sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
> > + } else {
> > + hdr_len = MLX5E_RX_MAX_HEAD;
> > + }
> > +
> > +out:
> > + return min3(hdr_len, cqe_bcnt, MLX5E_RX_MAX_HEAD);
> > +}
> > +
>
> Hi Christoph
>
> I wonder if you have tried to use eth_get_headlen() instead of yet
> another dissector ?
I just tried eth_get_headlen() out - and indeed, no measurable perf difference.
I will submit a new version.
Christoph
>
> I doubt you will see a performance difference.
>
> commit cfecec56ae7c7c40f23fbdac04acee027ca3bd66
> Author: Eric Dumazet <edumazet@google.com>
> Date: Fri Sep 5 18:29:45 2014 -0700
>
> mlx4: only pull headers into skb head
>
> Use the new fancy eth_get_headlen() to pull exactly the headers
> into skb->head.
>
> This speeds up GRE traffic (or more generally tunneled traffuc),
> as GRO can aggregate up to 17 MSS per GRO packet instead of 8.
>
> (Pulling too much data was forcing GRO to keep 2 frags per MSS)
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Amir Vadai <amirv@mellanox.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
>
>
> > static struct sk_buff *
> > mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> > struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
> > u32 page_idx)
> > {
> > struct mlx5e_frag_page *frag_page = &wi->alloc_units.frag_pages[page_idx];
> > - u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
> > struct mlx5e_frag_page *head_page = frag_page;
> > struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
> > u32 frag_offset = head_offset;
> > @@ -2009,6 +2050,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > u32 linear_frame_sz;
> > u16 linear_data_len;
> > u16 linear_hr;
> > + u16 headlen;
> > void *va;
> >
> > prog = rcu_dereference(rq->xdp_prog);
> > @@ -2039,6 +2081,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > net_prefetchw(va); /* xdp_frame data area */
> > net_prefetchw(skb->data);
> >
> > + headlen = mlx5e_cqe_estimate_hdr_len(cqe, cqe_bcnt);
> > +
> > frag_offset += headlen;
> > byte_cnt -= headlen;
> > linear_hr = skb_headroom(skb);
> > @@ -2115,6 +2159,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > pagep->frags++;
> > while (++pagep < frag_page);
> > }
> > +
> > + headlen = mlx5e_cqe_estimate_hdr_len(cqe, cqe_bcnt);
> > +
> > __pskb_pull_tail(skb, headlen);
> > } else {
> > dma_addr_t addr;
> >
> > --
> > 2.50.1
> >
> >
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net-next v3 2/2] net/mlx5: Avoid copying payload to the skb's linear part
2025-08-26 3:47 ` [PATCH net-next v3 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
2025-08-26 6:38 ` Eric Dumazet
@ 2025-08-27 6:58 ` Dragos Tatulea
1 sibling, 0 replies; 8+ messages in thread
From: Dragos Tatulea @ 2025-08-27 6:58 UTC (permalink / raw)
To: cpaasch, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Alexander Lobakin, Gal Pressman
Cc: linux-rdma, netdev
On Mon, Aug 25, 2025 at 08:47:13PM -0700, Christoph Paasch via B4 Relay wrote:
> From: Christoph Paasch <cpaasch@openai.com>
>
> mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> bytes from the page-pool to the skb's linear part. Those 256 bytes
> include part of the payload.
>
> When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> (and skb->head_frag is not set), we end up aggregating packets in the
> frag_list.
>
> This is of course not good when we are CPU-limited. Also causes a worse
> skb->len/truesize ratio,...
>
> So, let's avoid copying parts of the payload to the linear part. The
> goal here is to err on the side of caution and prefer to copy too little
> instead of copying too much (because once it has been copied over, we
> trigger the above described behavior in skb_gro_receive).
>
> So, we can do a rough estimate of the header-space by looking at
> cqe_l3/l4_hdr_type. This is now done in mlx5e_cqe_estimate_hdr_len().
> We always assume that TCP timestamps are present, as that's the most common
> use-case.
>
> That header-len is then used in mlx5e_skb_from_cqe_mpwrq_nonlinear for
> the headlen (which defines what is being copied over). We still
> allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking stack
> needs to call pskb_may_pull() later on, we don't need to reallocate
> memory.
>
> This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> LRO enabled):
>
> BEFORE:
> =======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.01 32547.82
>
> (netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.00 52531.67
>
> AFTER:
> ======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.00 52896.06
>
> (netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.00 85094.90
>
> Additional tests across a larger range of parameters w/ and w/o LRO, w/
> and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> better performance with this patch.
>
> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 49 ++++++++++++++++++++++++-
> 1 file changed, 48 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index b8c609d91d11bd315e8fb67f794a91bd37cd28c0..050f3efca34f3b8984c30f335ee43f487fef33ac 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -1991,13 +1991,54 @@ mlx5e_shampo_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
> } while (data_bcnt);
> }
>
> +static u16
> +mlx5e_cqe_estimate_hdr_len(const struct mlx5_cqe64 *cqe, u16 cqe_bcnt)
> +{
> + u8 l3_type, l4_type;
> + u16 hdr_len;
> +
> + hdr_len = sizeof(struct ethhdr);
> +
> + if (cqe_has_vlan(cqe))
> + hdr_len += VLAN_HLEN;
> +
> + l3_type = get_cqe_l3_hdr_type(cqe);
> + if (l3_type == CQE_L3_HDR_TYPE_IPV4) {
> + hdr_len += sizeof(struct iphdr);
> + } else if (l3_type == CQE_L3_HDR_TYPE_IPV6) {
> + hdr_len += sizeof(struct ipv6hdr);
> + } else {
> + hdr_len = MLX5E_RX_MAX_HEAD;
> + goto out;
> + }
> +
> + l4_type = get_cqe_l4_hdr_type(cqe);
> + if (l4_type == CQE_L4_HDR_TYPE_UDP) {
> + hdr_len += sizeof(struct udphdr);
> + } else if (l4_type & (CQE_L4_HDR_TYPE_TCP_NO_ACK |
> + CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA |
> + CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA)) {
> + /* ACK_NO_ACK | ACK_NO_DATA | ACK_AND_DATA == 0x7, but
> + * the previous condition checks for _UDP which is 0x2.
> + *
> + * As we know that l4_type != 0x2, we can simply check
> + * if any of the bits of 0x7 is set.
> + */
> + hdr_len += sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
> + } else {
> + hdr_len = MLX5E_RX_MAX_HEAD;
> + }
> +
> +out:
> + return min3(hdr_len, cqe_bcnt, MLX5E_RX_MAX_HEAD);
> +}
> +
> static struct sk_buff *
> mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
> u32 page_idx)
> {
> struct mlx5e_frag_page *frag_page = &wi->alloc_units.frag_pages[page_idx];
> - u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
> struct mlx5e_frag_page *head_page = frag_page;
> struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
> u32 frag_offset = head_offset;
> @@ -2009,6 +2050,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> u32 linear_frame_sz;
> u16 linear_data_len;
> u16 linear_hr;
> + u16 headlen;
> void *va;
>
> prog = rcu_dereference(rq->xdp_prog);
> @@ -2039,6 +2081,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> net_prefetchw(va); /* xdp_frame data area */
> net_prefetchw(skb->data);
>
> + headlen = mlx5e_cqe_estimate_hdr_len(cqe, cqe_bcnt);
> +
> frag_offset += headlen;
> byte_cnt -= headlen;
> linear_hr = skb_headroom(skb);
> @@ -2115,6 +2159,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> pagep->frags++;
> while (++pagep < frag_page);
> }
> +
> + headlen = mlx5e_cqe_estimate_hdr_len(cqe, cqe_bcnt);
> +
You moved it even further down. Nice!
Patch LGTM. But I would like to wait for Tariq's input as well. He'll be
back next week.
Thanks,
Dragos
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net-next v3 2/2] net/mlx5: Avoid copying payload to the skb's linear part
2025-08-26 20:31 ` Christoph Paasch
@ 2025-08-27 7:08 ` Dragos Tatulea
2025-08-27 7:50 ` Eric Dumazet
0 siblings, 1 reply; 8+ messages in thread
From: Dragos Tatulea @ 2025-08-27 7:08 UTC (permalink / raw)
To: Christoph Paasch, Eric Dumazet
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Andrew Lunn, David S. Miller, Jakub Kicinski, Paolo Abeni,
Alexander Lobakin, Gal Pressman, linux-rdma, netdev
On Tue, Aug 26, 2025 at 01:31:44PM -0700, Christoph Paasch wrote:
> On Mon, Aug 25, 2025 at 11:38 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Aug 25, 2025 at 8:47 PM Christoph Paasch via B4 Relay
> > <devnull+cpaasch.openai.com@kernel.org> wrote:
> > >
> > > From: Christoph Paasch <cpaasch@openai.com>
> > >
> > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > > include part of the payload.
> > >
> > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > > (and skb->head_frag is not set), we end up aggregating packets in the
> > > frag_list.
> > >
> > > This is of course not good when we are CPU-limited. Also causes a worse
> > > skb->len/truesize ratio,...
> > >
> > > So, let's avoid copying parts of the payload to the linear part. The
> > > goal here is to err on the side of caution and prefer to copy too little
> > > instead of copying too much (because once it has been copied over, we
> > > trigger the above described behavior in skb_gro_receive).
> > >
> > > So, we can do a rough estimate of the header-space by looking at
> > > cqe_l3/l4_hdr_type. This is now done in mlx5e_cqe_estimate_hdr_len().
> > > We always assume that TCP timestamps are present, as that's the most common
> > > use-case.
> > >
> > > That header-len is then used in mlx5e_skb_from_cqe_mpwrq_nonlinear for
> > > the headlen (which defines what is being copied over). We still
> > > allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking stack
> > > needs to call pskb_may_pull() later on, we don't need to reallocate
> > > memory.
> > >
> > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > > LRO enabled):
> > >
> > > BEFORE:
> > > =======
> > > (netserver pinned to core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > > 87380 16384 262144 60.01 32547.82
> > >
> > > (netserver pinned to adjacent core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > > 87380 16384 262144 60.00 52531.67
> > >
> > > AFTER:
> > > ======
> > > (netserver pinned to core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > > 87380 16384 262144 60.00 52896.06
> > >
> > > (netserver pinned to adjacent core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > > 87380 16384 262144 60.00 85094.90
> > >
> > > Additional tests across a larger range of parameters w/ and w/o LRO, w/
> > > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> > > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> > > better performance with this patch.
> > >
> > > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > > ---
> > > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 49 ++++++++++++++++++++++++-
> > > 1 file changed, 48 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > index b8c609d91d11bd315e8fb67f794a91bd37cd28c0..050f3efca34f3b8984c30f335ee43f487fef33ac 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > @@ -1991,13 +1991,54 @@ mlx5e_shampo_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
> > > } while (data_bcnt);
> > > }
> > >
> > > +static u16
> > > +mlx5e_cqe_estimate_hdr_len(const struct mlx5_cqe64 *cqe, u16 cqe_bcnt)
> > > +{
> > > + u8 l3_type, l4_type;
> > > + u16 hdr_len;
> > > +
> > > + hdr_len = sizeof(struct ethhdr);
> > > +
> > > + if (cqe_has_vlan(cqe))
> > > + hdr_len += VLAN_HLEN;
> > > +
> > > + l3_type = get_cqe_l3_hdr_type(cqe);
> > > + if (l3_type == CQE_L3_HDR_TYPE_IPV4) {
> > > + hdr_len += sizeof(struct iphdr);
> > > + } else if (l3_type == CQE_L3_HDR_TYPE_IPV6) {
> > > + hdr_len += sizeof(struct ipv6hdr);
> > > + } else {
> > > + hdr_len = MLX5E_RX_MAX_HEAD;
> > > + goto out;
> > > + }
> > > +
> > > + l4_type = get_cqe_l4_hdr_type(cqe);
> > > + if (l4_type == CQE_L4_HDR_TYPE_UDP) {
> > > + hdr_len += sizeof(struct udphdr);
> > > + } else if (l4_type & (CQE_L4_HDR_TYPE_TCP_NO_ACK |
> > > + CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA |
> > > + CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA)) {
> > > + /* ACK_NO_ACK | ACK_NO_DATA | ACK_AND_DATA == 0x7, but
> > > + * the previous condition checks for _UDP which is 0x2.
> > > + *
> > > + * As we know that l4_type != 0x2, we can simply check
> > > + * if any of the bits of 0x7 is set.
> > > + */
> > > + hdr_len += sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
> > > + } else {
> > > + hdr_len = MLX5E_RX_MAX_HEAD;
> > > + }
> > > +
> > > +out:
> > > + return min3(hdr_len, cqe_bcnt, MLX5E_RX_MAX_HEAD);
> > > +}
> > > +
> >
> > Hi Christoph
> >
> > I wonder if you have tried to use eth_get_headlen() instead of yet
> > another dissector ?
>
> I just tried eth_get_headlen() out - and indeed, no measurable perf difference.
>
> I will submit a new version.
>
What are the advantages of using eth_get_headlen() besides the fact that
it is more exhaustive? It seems quite expensive compared to reading some
bits in the CQE and doing a few comparisons. Even if this cost is amortized
by the benefits in the good cases, in the non-aggregation cases it seems
more costly. What am I missing here?
Thanks,
Dragos
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net-next v3 2/2] net/mlx5: Avoid copying payload to the skb's linear part
2025-08-27 7:08 ` Dragos Tatulea
@ 2025-08-27 7:50 ` Eric Dumazet
0 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2025-08-27 7:50 UTC (permalink / raw)
To: Dragos Tatulea
Cc: Christoph Paasch, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Andrew Lunn, David S. Miller, Jakub Kicinski,
Paolo Abeni, Alexander Lobakin, Gal Pressman, linux-rdma, netdev
On Wed, Aug 27, 2025 at 12:08 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> What are the advantages of using eth_get_headlen() besides the fact that
> it is more exhaustive? It seems quite expensive compared to reading some
> bits in the CQE and doing a few comparisons. Even if this cost is amortized
> by the benefits in the good cases, in the non-aggregation cases it seems
> more costly. What am I missing here?
Let's not reinvent the wheel, and add more potential bugs. Please ?
Why spending hours and ending with some ugly/wrong things like 'TCP
headers have at least 12 bytes of options'
} else if (l4_type & (CQE_L4_HDR_TYPE_TCP_NO_ACK |
+ CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA |
+ CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA)) {
+ /* ACK_NO_ACK | ACK_NO_DATA | ACK_AND_DATA == 0x7, but
+ * the previous condition checks for _UDP which is 0x2.
+ *
+ * As we know that l4_type != 0x2, we can simply check
+ * if any of the bits of 0x7 is set.
+ */
+ hdr_len += sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
Either do not pull anything and let upper stacks pull one header at a time,
or call the generic helper to make a single memcpy()
Real costs are the cache line misses, analyzing the 'CQE' will not help at all.
Really vendors need to stop adding useless stuff in their receive descriptors,
there is no way they can cover all encapsulations in modern networking.
I find it very annoying that Mellanox in 2025 still doed the overpull
thing in mlx5, considering
I fixed mlx4 driver in 2014.
This is the major and well known issue.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-08-27 7:50 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-26 3:47 [PATCH net-next v3 0/2] net/mlx5: Avoid payload in skb's linear part for better GRO-processing Christoph Paasch via B4 Relay
2025-08-26 3:47 ` [PATCH net-next v3 1/2] net/mlx5: Bring back get_cqe_l3_hdr_type Christoph Paasch via B4 Relay
2025-08-26 3:47 ` [PATCH net-next v3 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
2025-08-26 6:38 ` Eric Dumazet
2025-08-26 20:31 ` Christoph Paasch
2025-08-27 7:08 ` Dragos Tatulea
2025-08-27 7:50 ` Eric Dumazet
2025-08-27 6:58 ` Dragos Tatulea
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).