netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net] net/mlx5: Correctly set gso_size when LRO is used
@ 2025-07-10 18:26 christoph.paasch
  2025-07-14  6:49 ` Tariq Toukan
  2025-07-14  8:23 ` Gal Pressman
  0 siblings, 2 replies; 6+ messages in thread
From: christoph.paasch @ 2025-07-10 18:26 UTC (permalink / raw)
  To: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Amir Vadai
  Cc: Christoph Paasch, netdev, linux-rdma

From: Christoph Paasch <cpaasch@openai.com>

gso_size is expected by the networking stack to be the size of the
payload (thus, not including ethernet/IP/TCP-headers). However, cqe_bcnt
is the full sized frame (including the headers). Dividing cqe_bcnt by
lro_num_seg will then give incorrect results.

For example, running a bpftrace higher up in the TCP-stack
(tcp_event_data_recv), we commonly have gso_size set to 1450 or 1451 even
though in reality the payload was only 1448 bytes.

So, we need to discount the protocol headers from cqe_bcnt so we can
actually divide the payload by lro_num_seg to get the real gso_size.

Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files")
Signed-off-by: Christoph Paasch <cpaasch@openai.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 20 +++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 84b1ab8233b8..e23bb80b0e0d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1154,12 +1154,14 @@ static void mlx5e_lro_update_tcp_hdr(struct mlx5_cqe64 *cqe, struct tcphdr *tcp)
 	}
 }
 
-static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe,
-				 u32 cqe_bcnt)
+static unsigned int mlx5e_lro_update_hdr(struct sk_buff *skb,
+					 struct mlx5_cqe64 *cqe,
+					 u32 cqe_bcnt)
 {
 	struct ethhdr	*eth = (struct ethhdr *)(skb->data);
 	struct tcphdr	*tcp;
 	int network_depth = 0;
+	unsigned int hdrlen;
 	__wsum check;
 	__be16 proto;
 	u16 tot_len;
@@ -1169,11 +1171,14 @@ static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe,
 
 	tot_len = cqe_bcnt - network_depth;
 	ip_p = skb->data + network_depth;
+	hdrlen = network_depth;
 
 	if (proto == htons(ETH_P_IP)) {
 		struct iphdr *ipv4 = ip_p;
 
 		tcp = ip_p + sizeof(struct iphdr);
+		hdrlen += sizeof(struct iphdr);
+
 		skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
 
 		ipv4->ttl               = cqe->lro.min_ttl;
@@ -1193,6 +1198,8 @@ static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe,
 		struct ipv6hdr *ipv6 = ip_p;
 
 		tcp = ip_p + sizeof(struct ipv6hdr);
+		hdrlen += sizeof(struct ipv6hdr);
+
 		skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
 
 		ipv6->hop_limit         = cqe->lro.min_ttl;
@@ -1205,6 +1212,10 @@ static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe,
 		tcp->check = tcp_v6_check(payload_len, &ipv6->saddr,
 					  &ipv6->daddr, check);
 	}
+
+	hdrlen += tcp->doff * 4;
+
+	return hdrlen;
 }
 
 static void *mlx5e_shampo_get_packet_hd(struct mlx5e_rq *rq, u16 header_index)
@@ -1561,8 +1572,9 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
 		mlx5e_macsec_offload_handle_rx_skb(netdev, skb, cqe);
 
 	if (lro_num_seg > 1) {
-		mlx5e_lro_update_hdr(skb, cqe, cqe_bcnt);
-		skb_shinfo(skb)->gso_size = DIV_ROUND_UP(cqe_bcnt, lro_num_seg);
+		unsigned int hdrlen = mlx5e_lro_update_hdr(skb, cqe, cqe_bcnt);
+
+		skb_shinfo(skb)->gso_size = DIV_ROUND_UP(cqe_bcnt - hdrlen, lro_num_seg);
 		/* Subtract one since we already counted this as one
 		 * "regular" packet in mlx5e_complete_rx_cqe()
 		 */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net] net/mlx5: Correctly set gso_size when LRO is used
  2025-07-10 18:26 [PATCH net] net/mlx5: Correctly set gso_size when LRO is used christoph.paasch
@ 2025-07-14  6:49 ` Tariq Toukan
  2025-07-14 16:49   ` Christoph Paasch
  2025-07-14  8:23 ` Gal Pressman
  1 sibling, 1 reply; 6+ messages in thread
From: Tariq Toukan @ 2025-07-14  6:49 UTC (permalink / raw)
  To: cpaasch, Saeed Mahameed, Tariq Toukan, Mark Bloch,
	Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Amir Vadai
  Cc: netdev, linux-rdma



On 10/07/2025 21:26, christoph.paasch@gmail.com wrote:
> From: Christoph Paasch <cpaasch@openai.com>
> 
> gso_size is expected by the networking stack to be the size of the
> payload (thus, not including ethernet/IP/TCP-headers). However, cqe_bcnt
> is the full sized frame (including the headers). Dividing cqe_bcnt by
> lro_num_seg will then give incorrect results.
> 
> For example, running a bpftrace higher up in the TCP-stack
> (tcp_event_data_recv), we commonly have gso_size set to 1450 or 1451 even
> though in reality the payload was only 1448 bytes.
> 
> So, we need to discount the protocol headers from cqe_bcnt so we can
> actually divide the payload by lro_num_seg to get the real gso_size.
> 
> Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files")
> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> ---
>   .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 20 +++++++++++++++----
>   1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 84b1ab8233b8..e23bb80b0e0d 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -1154,12 +1154,14 @@ static void mlx5e_lro_update_tcp_hdr(struct mlx5_cqe64 *cqe, struct tcphdr *tcp)
>   	}
>   }
>   
> -static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe,
> -				 u32 cqe_bcnt)
> +static unsigned int mlx5e_lro_update_hdr(struct sk_buff *skb,
> +					 struct mlx5_cqe64 *cqe,
> +					 u32 cqe_bcnt)
>   {
>   	struct ethhdr	*eth = (struct ethhdr *)(skb->data);
>   	struct tcphdr	*tcp;
>   	int network_depth = 0;
> +	unsigned int hdrlen;
>   	__wsum check;
>   	__be16 proto;
>   	u16 tot_len;
> @@ -1169,11 +1171,14 @@ static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe,
>   
>   	tot_len = cqe_bcnt - network_depth;
>   	ip_p = skb->data + network_depth;
> +	hdrlen = network_depth;
>   
>   	if (proto == htons(ETH_P_IP)) {
>   		struct iphdr *ipv4 = ip_p;
>   
>   		tcp = ip_p + sizeof(struct iphdr);
> +		hdrlen += sizeof(struct iphdr);
> +
>   		skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
>   
>   		ipv4->ttl               = cqe->lro.min_ttl;
> @@ -1193,6 +1198,8 @@ static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe,
>   		struct ipv6hdr *ipv6 = ip_p;
>   
>   		tcp = ip_p + sizeof(struct ipv6hdr);
> +		hdrlen += sizeof(struct ipv6hdr);
> +
>   		skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
>   
>   		ipv6->hop_limit         = cqe->lro.min_ttl;
> @@ -1205,6 +1212,10 @@ static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe,
>   		tcp->check = tcp_v6_check(payload_len, &ipv6->saddr,
>   					  &ipv6->daddr, check);
>   	}
> +
> +	hdrlen += tcp->doff * 4;
> +


Thanks for your patch!

Calculations seem correct.
Wouldn't it be simpler to just return the below?

(void *)tcp + tcp->doff * 4 - skb->data

> +	return hdrlen;
>   }
>   
>   static void *mlx5e_shampo_get_packet_hd(struct mlx5e_rq *rq, u16 header_index)
> @@ -1561,8 +1572,9 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
>   		mlx5e_macsec_offload_handle_rx_skb(netdev, skb, cqe);
>   
>   	if (lro_num_seg > 1) {
> -		mlx5e_lro_update_hdr(skb, cqe, cqe_bcnt);
> -		skb_shinfo(skb)->gso_size = DIV_ROUND_UP(cqe_bcnt, lro_num_seg);
> +		unsigned int hdrlen = mlx5e_lro_update_hdr(skb, cqe, cqe_bcnt);
> +
> +		skb_shinfo(skb)->gso_size = DIV_ROUND_UP(cqe_bcnt - hdrlen, lro_num_seg);
>   		/* Subtract one since we already counted this as one
>   		 * "regular" packet in mlx5e_complete_rx_cqe()
>   		 */



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] net/mlx5: Correctly set gso_size when LRO is used
  2025-07-10 18:26 [PATCH net] net/mlx5: Correctly set gso_size when LRO is used christoph.paasch
  2025-07-14  6:49 ` Tariq Toukan
@ 2025-07-14  8:23 ` Gal Pressman
  2025-07-14 16:54   ` Christoph Paasch
  1 sibling, 1 reply; 6+ messages in thread
From: Gal Pressman @ 2025-07-14  8:23 UTC (permalink / raw)
  To: cpaasch, Saeed Mahameed, Tariq Toukan, Mark Bloch,
	Leon Romanovsky, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Amir Vadai
  Cc: netdev, linux-rdma

Hi Christoph,

On 10/07/2025 21:26, christoph.paasch@gmail.com wrote:
> From: Christoph Paasch <cpaasch@openai.com>
> 
> gso_size is expected by the networking stack to be the size of the
> payload (thus, not including ethernet/IP/TCP-headers). However, cqe_bcnt
> is the full sized frame (including the headers). Dividing cqe_bcnt by
> lro_num_seg will then give incorrect results.
> 
> For example, running a bpftrace higher up in the TCP-stack
> (tcp_event_data_recv), we commonly have gso_size set to 1450 or 1451 even
> though in reality the payload was only 1448 bytes.
Other than introspecting the wrong gso_size value, is there a functional
breakage that can be observed?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] net/mlx5: Correctly set gso_size when LRO is used
  2025-07-14  6:49 ` Tariq Toukan
@ 2025-07-14 16:49   ` Christoph Paasch
  0 siblings, 0 replies; 6+ messages in thread
From: Christoph Paasch @ 2025-07-14 16:49 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Amir Vadai, netdev, linux-rdma

On Sun, Jul 13, 2025 at 11:49 PM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>
>
>
> On 10/07/2025 21:26, christoph.paasch@gmail.com wrote:
> > From: Christoph Paasch <cpaasch@openai.com>
> >
> > gso_size is expected by the networking stack to be the size of the
> > payload (thus, not including ethernet/IP/TCP-headers). However, cqe_bcnt
> > is the full sized frame (including the headers). Dividing cqe_bcnt by
> > lro_num_seg will then give incorrect results.
> >
> > For example, running a bpftrace higher up in the TCP-stack
> > (tcp_event_data_recv), we commonly have gso_size set to 1450 or 1451 even
> > though in reality the payload was only 1448 bytes.
> >
> > So, we need to discount the protocol headers from cqe_bcnt so we can
> > actually divide the payload by lro_num_seg to get the real gso_size.
> >
> > Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files")
> > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > ---
> >   .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 20 +++++++++++++++----
> >   1 file changed, 16 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > index 84b1ab8233b8..e23bb80b0e0d 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > @@ -1154,12 +1154,14 @@ static void mlx5e_lro_update_tcp_hdr(struct mlx5_cqe64 *cqe, struct tcphdr *tcp)
> >       }
> >   }
> >
> > -static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe,
> > -                              u32 cqe_bcnt)
> > +static unsigned int mlx5e_lro_update_hdr(struct sk_buff *skb,
> > +                                      struct mlx5_cqe64 *cqe,
> > +                                      u32 cqe_bcnt)
> >   {
> >       struct ethhdr   *eth = (struct ethhdr *)(skb->data);
> >       struct tcphdr   *tcp;
> >       int network_depth = 0;
> > +     unsigned int hdrlen;
> >       __wsum check;
> >       __be16 proto;
> >       u16 tot_len;
> > @@ -1169,11 +1171,14 @@ static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe,
> >
> >       tot_len = cqe_bcnt - network_depth;
> >       ip_p = skb->data + network_depth;
> > +     hdrlen = network_depth;
> >
> >       if (proto == htons(ETH_P_IP)) {
> >               struct iphdr *ipv4 = ip_p;
> >
> >               tcp = ip_p + sizeof(struct iphdr);
> > +             hdrlen += sizeof(struct iphdr);
> > +
> >               skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
> >
> >               ipv4->ttl               = cqe->lro.min_ttl;
> > @@ -1193,6 +1198,8 @@ static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe,
> >               struct ipv6hdr *ipv6 = ip_p;
> >
> >               tcp = ip_p + sizeof(struct ipv6hdr);
> > +             hdrlen += sizeof(struct ipv6hdr);
> > +
> >               skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
> >
> >               ipv6->hop_limit         = cqe->lro.min_ttl;
> > @@ -1205,6 +1212,10 @@ static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe,
> >               tcp->check = tcp_v6_check(payload_len, &ipv6->saddr,
> >                                         &ipv6->daddr, check);
> >       }
> > +
> > +     hdrlen += tcp->doff * 4;
> > +
>
>
> Thanks for your patch!
>
> Calculations seem correct.
> Wouldn't it be simpler to just return the below?
>
> (void *)tcp + tcp->doff * 4 - skb->data

Absolutely! I can do that!


Christoph

>
> > +     return hdrlen;
> >   }
> >
> >   static void *mlx5e_shampo_get_packet_hd(struct mlx5e_rq *rq, u16 header_index)
> > @@ -1561,8 +1572,9 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
> >               mlx5e_macsec_offload_handle_rx_skb(netdev, skb, cqe);
> >
> >       if (lro_num_seg > 1) {
> > -             mlx5e_lro_update_hdr(skb, cqe, cqe_bcnt);
> > -             skb_shinfo(skb)->gso_size = DIV_ROUND_UP(cqe_bcnt, lro_num_seg);
> > +             unsigned int hdrlen = mlx5e_lro_update_hdr(skb, cqe, cqe_bcnt);
> > +
> > +             skb_shinfo(skb)->gso_size = DIV_ROUND_UP(cqe_bcnt - hdrlen, lro_num_seg);
> >               /* Subtract one since we already counted this as one
> >                * "regular" packet in mlx5e_complete_rx_cqe()
> >                */
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] net/mlx5: Correctly set gso_size when LRO is used
  2025-07-14  8:23 ` Gal Pressman
@ 2025-07-14 16:54   ` Christoph Paasch
  2025-07-15 15:31     ` Gal Pressman
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Paasch @ 2025-07-14 16:54 UTC (permalink / raw)
  To: Gal Pressman
  Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Amir Vadai, netdev, linux-rdma

On Mon, Jul 14, 2025 at 1:24 AM Gal Pressman <gal@nvidia.com> wrote:
>
> Hi Christoph,
>
> On 10/07/2025 21:26, christoph.paasch@gmail.com wrote:
> > From: Christoph Paasch <cpaasch@openai.com>
> >
> > gso_size is expected by the networking stack to be the size of the
> > payload (thus, not including ethernet/IP/TCP-headers). However, cqe_bcnt
> > is the full sized frame (including the headers). Dividing cqe_bcnt by
> > lro_num_seg will then give incorrect results.
> >
> > For example, running a bpftrace higher up in the TCP-stack
> > (tcp_event_data_recv), we commonly have gso_size set to 1450 or 1451 even
> > though in reality the payload was only 1448 bytes.
> Other than introspecting the wrong gso_size value, is there a functional
> breakage that can be observed?

I wouldn't call it "functional breakage", but definitely unintended
consequences / lower perf :
- In tcp_measure_rcv_mss() len will be for example 1450, but. rcv_mss
will be 1448 (because tp->advmss is 1448). Thus, we will always
recompute scaling_ratio each time an LRO-packet is received.
- In tcp_gro_receive, it will interfere with the decision whether or
not to flush and thus potentially result in less gro'ed packets.


Christoph

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] net/mlx5: Correctly set gso_size when LRO is used
  2025-07-14 16:54   ` Christoph Paasch
@ 2025-07-15 15:31     ` Gal Pressman
  0 siblings, 0 replies; 6+ messages in thread
From: Gal Pressman @ 2025-07-15 15:31 UTC (permalink / raw)
  To: Christoph Paasch
  Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Amir Vadai, netdev, linux-rdma

On 14/07/2025 19:54, Christoph Paasch wrote:
> On Mon, Jul 14, 2025 at 1:24 AM Gal Pressman <gal@nvidia.com> wrote:
>>
>> Hi Christoph,
>>
>> On 10/07/2025 21:26, christoph.paasch@gmail.com wrote:
>>> From: Christoph Paasch <cpaasch@openai.com>
>>>
>>> gso_size is expected by the networking stack to be the size of the
>>> payload (thus, not including ethernet/IP/TCP-headers). However, cqe_bcnt
>>> is the full sized frame (including the headers). Dividing cqe_bcnt by
>>> lro_num_seg will then give incorrect results.
>>>
>>> For example, running a bpftrace higher up in the TCP-stack
>>> (tcp_event_data_recv), we commonly have gso_size set to 1450 or 1451 even
>>> though in reality the payload was only 1448 bytes.
>> Other than introspecting the wrong gso_size value, is there a functional
>> breakage that can be observed?
> 
> I wouldn't call it "functional breakage", but definitely unintended
> consequences / lower perf :
> - In tcp_measure_rcv_mss() len will be for example 1450, but. rcv_mss
> will be 1448 (because tp->advmss is 1448). Thus, we will always
> recompute scaling_ratio each time an LRO-packet is received.
> - In tcp_gro_receive, it will interfere with the decision whether or
> not to flush and thus potentially result in less gro'ed packets.

Thanks!
Please put that in the commit message.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-07-15 15:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-10 18:26 [PATCH net] net/mlx5: Correctly set gso_size when LRO is used christoph.paasch
2025-07-14  6:49 ` Tariq Toukan
2025-07-14 16:49   ` Christoph Paasch
2025-07-14  8:23 ` Gal Pressman
2025-07-14 16:54   ` Christoph Paasch
2025-07-15 15:31     ` Gal Pressman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).