[PATCH net] xfrm: Force software GSO only in tunnel mode

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net] xfrm: Force software GSO only in tunnel mode
@ 2025-03-17 10:32 Mark Bloch
  2025-03-17 19:30 ` Xin Long
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Mark Bloch @ 2025-03-17 10:32 UTC (permalink / raw)
  To: Steffen Klassert, Herbert Xu, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: Christian Hopps, Cosmin Ratiu, Yael Chemla, wangfe, Xin Long,
	Dragos Tatulea, Mark Bloch, netdev

From: Cosmin Ratiu <cratiu@nvidia.com>

The cited commit fixed a software GSO bug with VXLAN + IPSec in tunnel
mode. Unfortunately, it is slightly broader than necessary, as it also
severely affects performance for Geneve + IPSec transport mode over a
device capable of both HW GSO and IPSec crypto offload. In this case,
xfrm_output unnecessarily triggers software GSO instead of letting the
HW do it. In simple iperf3 tests over Geneve + IPSec transport mode over
a back-2-back pair of NICs with MTU 1500, the performance was observed
to be up to 6x worse when doing software GSO compared to leaving it to
the hardware.

This commit makes xfrm_output only trigger software GSO in crypto
offload cases for already encapsulated packets in tunnel mode, as not
doing so would then cause the inner tunnel skb->inner_networking_header
to be overwritten and break software GSO for that packet later if the
device turns out to not be capable of HW GSO.

Taking a closer look at the conditions for the original bug, to better
understand the reasons for this change:
- vxlan_build_skb -> iptunnel_handle_offloads sets inner_protocol and
  inner network header.
- then, udp_tunnel_xmit_skb -> ip_tunnel_xmit adds outer transport and
  network headers.
- later in the xmit path, xfrm_output -> xfrm_outer_mode_output ->
  xfrm4_prepare_output -> xfrm4_tunnel_encap_add overwrites the inner
  network header with the one set in ip_tunnel_xmit before adding the
  second outer header.
- __dev_queue_xmit -> validate_xmit_skb checks whether GSO segmentation
  needs to happen based on dev features. In the original bug, the hw
  couldn't segment the packets, so skb_gso_segment was invoked.
- deep in the .gso_segment callback machinery, __skb_udp_tunnel_segment
  tries to use the wrong inner network header, expecting the one set in
  iptunnel_handle_offloads but getting the one set by xfrm instead.
- a bit later, ipv6_gso_segment accesses the wrong memory based on that
  wrong inner network header.

With the new change, the original bug (or similar ones) cannot happen
again, as xfrm will now trigger software GSO before applying a tunnel.
This concern doesn't exist in packet offload mode, when the HW adds
encapsulation headers. For the non-offloaded packets (crypto in SW),
software GSO is still done unconditionally in the else branch.

Fixes: a204aef9fd77 ("xfrm: call xfrm_output_gso when inner_protocol is set in xfrm_output")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Yael Chemla <ychemla@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
---
 net/xfrm/xfrm_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index f7abd42c077d..42f1ca513879 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -758,7 +758,7 @@ int xfrm_output(struct sock *sk, struct sk_buff *skb)
 		skb->encapsulation = 1;

 		if (skb_is_gso(skb)) {
-			if (skb->inner_protocol)
+			if (skb->inner_protocol && x->props.mode == XFRM_MODE_TUNNEL)
 				return xfrm_output_gso(net, sk, skb);

 			skb_shinfo(skb)->gso_type |= SKB_GSO_ESP;
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net] xfrm: Force software GSO only in tunnel mode
  2025-03-17 10:32 [PATCH net] xfrm: Force software GSO only in tunnel mode Mark Bloch
@ 2025-03-17 19:30 ` Xin Long
  2025-03-20 15:24   ` Cosmin Ratiu
  2025-03-18 12:37 ` Mark Bloch
  2025-03-20 14:50 ` patchwork-bot+netdevbpf
  2 siblings, 1 reply; 6+ messages in thread
From: Xin Long @ 2025-03-17 19:30 UTC (permalink / raw)
  To: Mark Bloch
  Cc: Steffen Klassert, Herbert Xu, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Christian Hopps,
	Cosmin Ratiu, Yael Chemla, wangfe, Dragos Tatulea, netdev

On Mon, Mar 17, 2025 at 6:32 AM Mark Bloch <mbloch@nvidia.com> wrote:
>
> From: Cosmin Ratiu <cratiu@nvidia.com>
>
> The cited commit fixed a software GSO bug with VXLAN + IPSec in tunnel
> mode. Unfortunately, it is slightly broader than necessary, as it also
> severely affects performance for Geneve + IPSec transport mode over a
> device capable of both HW GSO and IPSec crypto offload. In this case,
> xfrm_output unnecessarily triggers software GSO instead of letting the
> HW do it. In simple iperf3 tests over Geneve + IPSec transport mode over
> a back-2-back pair of NICs with MTU 1500, the performance was observed
> to be up to 6x worse when doing software GSO compared to leaving it to
> the hardware.
>
> This commit makes xfrm_output only trigger software GSO in crypto
> offload cases for already encapsulated packets in tunnel mode, as not
> doing so would then cause the inner tunnel skb->inner_networking_header
> to be overwritten and break software GSO for that packet later if the
> device turns out to not be capable of HW GSO.
>
For UDP tunnels, there are two types:

- ENCAP_TYPE_ETHER encaps an ether packet (e.g., VXLAN, Geneve).
- ENCAP_TYPE_IPPROTO encaps an ipproto packet (e.g., SCTP over UDP).

When performing GSO via skb_udp_tunnel_segment():

- ENCAP_TYPE_ETHER relies on inner_network_header to locate the
  network header.
- ENCAP_TYPE_IPPROTO relies on inner_transport_header to locate
  the transport header.

However, both IPsec transport and tunnel modes modify
inner_transport_header. This patch raises a concern that GSO may
not work correctly for ENCAP_TYPE_IPPROTO UDP tunnels over IPsec
in transport mode.

> Taking a closer look at the conditions for the original bug, to better
> understand the reasons for this change:
> - vxlan_build_skb -> iptunnel_handle_offloads sets inner_protocol and
>   inner network header.
> - then, udp_tunnel_xmit_skb -> ip_tunnel_xmit adds outer transport and
>   network headers.
> - later in the xmit path, xfrm_output -> xfrm_outer_mode_output ->
>   xfrm4_prepare_output -> xfrm4_tunnel_encap_add overwrites the inner
>   network header with the one set in ip_tunnel_xmit before adding the
>   second outer header.
> - __dev_queue_xmit -> validate_xmit_skb checks whether GSO segmentation
>   needs to happen based on dev features. In the original bug, the hw
>   couldn't segment the packets, so skb_gso_segment was invoked.
> - deep in the .gso_segment callback machinery, __skb_udp_tunnel_segment
>   tries to use the wrong inner network header, expecting the one set in
>   iptunnel_handle_offloads but getting the one set by xfrm instead.
> - a bit later, ipv6_gso_segment accesses the wrong memory based on that
>   wrong inner network header.
>
> With the new change, the original bug (or similar ones) cannot happen
> again, as xfrm will now trigger software GSO before applying a tunnel.
> This concern doesn't exist in packet offload mode, when the HW adds
> encapsulation headers. For the non-offloaded packets (crypto in SW),
> software GSO is still done unconditionally in the else branch.
>
> Fixes: a204aef9fd77 ("xfrm: call xfrm_output_gso when inner_protocol is set in xfrm_output")
> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
> Reviewed-by: Yael Chemla <ychemla@nvidia.com>
> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
> ---
>  net/xfrm/xfrm_output.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
> index f7abd42c077d..42f1ca513879 100644
> --- a/net/xfrm/xfrm_output.c
> +++ b/net/xfrm/xfrm_output.c
> @@ -758,7 +758,7 @@ int xfrm_output(struct sock *sk, struct sk_buff *skb)
>                 skb->encapsulation = 1;
>
>                 if (skb_is_gso(skb)) {
> -                       if (skb->inner_protocol)
> +                       if (skb->inner_protocol && x->props.mode == XFRM_MODE_TUNNEL)
>                                 return xfrm_output_gso(net, sk, skb);
>
>                         skb_shinfo(skb)->gso_type |= SKB_GSO_ESP;
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] xfrm: Force software GSO only in tunnel mode
  2025-03-17 10:32 [PATCH net] xfrm: Force software GSO only in tunnel mode Mark Bloch
  2025-03-17 19:30 ` Xin Long
@ 2025-03-18 12:37 ` Mark Bloch
  2025-03-20 14:50 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 6+ messages in thread
From: Mark Bloch @ 2025-03-18 12:37 UTC (permalink / raw)
  To: Steffen Klassert, Herbert Xu, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: Christian Hopps, Cosmin Ratiu, Yael Chemla, wangfe, Xin Long,
	Dragos Tatulea, netdev

Apologies for the confusion, this patch was already sent
and accepted by Steffen into the IPsec tree. Our testing
didn’t detect it landing in net, which led to the
mistaken resubmission with the “send it out before the
merge window is open”.

We’re fine with waiting for it to land via the IPsec PR,
so please disregard this version.

Mark

On 17/03/2025 12:32, Mark Bloch wrote:
> From: Cosmin Ratiu <cratiu@nvidia.com>
> 
> The cited commit fixed a software GSO bug with VXLAN + IPSec in tunnel
> mode. Unfortunately, it is slightly broader than necessary, as it also
> severely affects performance for Geneve + IPSec transport mode over a
> device capable of both HW GSO and IPSec crypto offload. In this case,
> xfrm_output unnecessarily triggers software GSO instead of letting the
> HW do it. In simple iperf3 tests over Geneve + IPSec transport mode over
> a back-2-back pair of NICs with MTU 1500, the performance was observed
> to be up to 6x worse when doing software GSO compared to leaving it to
> the hardware.
> 
> This commit makes xfrm_output only trigger software GSO in crypto
> offload cases for already encapsulated packets in tunnel mode, as not
> doing so would then cause the inner tunnel skb->inner_networking_header
> to be overwritten and break software GSO for that packet later if the
> device turns out to not be capable of HW GSO.
> 
> Taking a closer look at the conditions for the original bug, to better
> understand the reasons for this change:
> - vxlan_build_skb -> iptunnel_handle_offloads sets inner_protocol and
>   inner network header.
> - then, udp_tunnel_xmit_skb -> ip_tunnel_xmit adds outer transport and
>   network headers.
> - later in the xmit path, xfrm_output -> xfrm_outer_mode_output ->
>   xfrm4_prepare_output -> xfrm4_tunnel_encap_add overwrites the inner
>   network header with the one set in ip_tunnel_xmit before adding the
>   second outer header.
> - __dev_queue_xmit -> validate_xmit_skb checks whether GSO segmentation
>   needs to happen based on dev features. In the original bug, the hw
>   couldn't segment the packets, so skb_gso_segment was invoked.
> - deep in the .gso_segment callback machinery, __skb_udp_tunnel_segment
>   tries to use the wrong inner network header, expecting the one set in
>   iptunnel_handle_offloads but getting the one set by xfrm instead.
> - a bit later, ipv6_gso_segment accesses the wrong memory based on that
>   wrong inner network header.
> 
> With the new change, the original bug (or similar ones) cannot happen
> again, as xfrm will now trigger software GSO before applying a tunnel.
> This concern doesn't exist in packet offload mode, when the HW adds
> encapsulation headers. For the non-offloaded packets (crypto in SW),
> software GSO is still done unconditionally in the else branch.
> 
> Fixes: a204aef9fd77 ("xfrm: call xfrm_output_gso when inner_protocol is set in xfrm_output")
> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
> Reviewed-by: Yael Chemla <ychemla@nvidia.com>
> Signed-off-by: Mark Bloch <mbloch@nvidia.com>
> ---
>  net/xfrm/xfrm_output.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
> index f7abd42c077d..42f1ca513879 100644
> --- a/net/xfrm/xfrm_output.c
> +++ b/net/xfrm/xfrm_output.c
> @@ -758,7 +758,7 @@ int xfrm_output(struct sock *sk, struct sk_buff *skb)
>  		skb->encapsulation = 1;
>  
>  		if (skb_is_gso(skb)) {
> -			if (skb->inner_protocol)
> +			if (skb->inner_protocol && x->props.mode == XFRM_MODE_TUNNEL)
>  				return xfrm_output_gso(net, sk, skb);
>  
>  			skb_shinfo(skb)->gso_type |= SKB_GSO_ESP;


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] xfrm: Force software GSO only in tunnel mode
  2025-03-17 10:32 [PATCH net] xfrm: Force software GSO only in tunnel mode Mark Bloch
  2025-03-17 19:30 ` Xin Long
  2025-03-18 12:37 ` Mark Bloch
@ 2025-03-20 14:50 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 6+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-03-20 14:50 UTC (permalink / raw)
  To: Mark Bloch
  Cc: steffen.klassert, herbert, davem, edumazet, kuba, pabeni, horms,
	chopps, cratiu, ychemla, wangfe, lucien.xin, dtatulea, netdev

Hello:

This patch was applied to netdev/net.git (main)
by Steffen Klassert <steffen.klassert@secunet.com>:

On Mon, 17 Mar 2025 12:32:05 +0200 you wrote:
> From: Cosmin Ratiu <cratiu@nvidia.com>
> 
> The cited commit fixed a software GSO bug with VXLAN + IPSec in tunnel
> mode. Unfortunately, it is slightly broader than necessary, as it also
> severely affects performance for Geneve + IPSec transport mode over a
> device capable of both HW GSO and IPSec crypto offload. In this case,
> xfrm_output unnecessarily triggers software GSO instead of letting the
> HW do it. In simple iperf3 tests over Geneve + IPSec transport mode over
> a back-2-back pair of NICs with MTU 1500, the performance was observed
> to be up to 6x worse when doing software GSO compared to leaving it to
> the hardware.
> 
> [...]

Here is the summary with links:
  - [net] xfrm: Force software GSO only in tunnel mode
    https://git.kernel.org/netdev/net/c/0aae2867aa60

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] xfrm: Force software GSO only in tunnel mode
  2025-03-17 19:30 ` Xin Long
@ 2025-03-20 15:24   ` Cosmin Ratiu
  2025-03-20 18:37     ` Xin Long
  0 siblings, 1 reply; 6+ messages in thread
From: Cosmin Ratiu @ 2025-03-20 15:24 UTC (permalink / raw)
  To: Mark Bloch, lucien.xin@gmail.com
  Cc: chopps@labn.net, davem@davemloft.net, herbert@gondor.apana.org.au,
	Dragos Tatulea, steffen.klassert@secunet.com,
	netdev@vger.kernel.org, pabeni@redhat.com, horms@kernel.org,
	edumazet@google.com, kuba@kernel.org, Yael Chemla,
	wangfe@google.com

On Mon, 2025-03-17 at 15:30 -0400, Xin Long wrote:
> For UDP tunnels, there are two types:
> 
> - ENCAP_TYPE_ETHER encaps an ether packet (e.g., VXLAN, Geneve).
> - ENCAP_TYPE_IPPROTO encaps an ipproto packet (e.g., SCTP over UDP).
> 
> When performing GSO via skb_udp_tunnel_segment():
> 
> - ENCAP_TYPE_ETHER relies on inner_network_header to locate the
>   network header.
> - ENCAP_TYPE_IPPROTO relies on inner_transport_header to locate
>   the transport header.
> 
> However, both IPsec transport and tunnel modes modify
> inner_transport_header. This patch raises a concern that GSO may
> not work correctly for ENCAP_TYPE_IPPROTO UDP tunnels over IPsec
> in transport mode.

skb_udp_tunnel_segment -> __skb_udp_tunnel_segment does:

tnl_hlen = skb_inner_mac_header(skb) - skb_transport_header(skb);
__skb_pull(skb, tnl_hlen);
skb_reset_mac_header(skb);
skb_set_network_header(skb, skb_inner_network_offset(skb));
skb_set_transport_header(skb, skb_inner_transport_offset(skb));

As a concrete example, in case of TCP over Geneve over IPsec in
transport mode, this is the sequence of header manipulations done:
geneve_build_skb: mh 190 nh 204 th 224 imh 190 inh 204 ith 224 data 182
iptunnel_xmit: mh 190 nh 154 th 174 imh 190 inh 204 ith 224 data 154
xfrm4_transport_output: mh 147 nh 138 th 158 imh 190 inh 204 ith 174
data 174
skb_mac_gso_segment: mh 124 nh 138 th 158 imh 190 inh 204 ith 174 data
124
inet_gso_segment: mh 124 nh 138 th 158 imh 190 inh 204 ith 174 data 138
esp4_gso_segment: mh 124 nh 138 th 158 imh 190 inh 204 ith 174 data 158
__skb_udp_tunnel_segment: mh 124 nh 138 th 174 imh 190 inh 204 ith 174
data 174
__skb_udp_tunnel_segment: tnl_hlen 16
__skb_udp_tunnel_segment: inner skb mh 190 nh 204 th 174 imh 190 inh
204 ith 174 data 190
skb_mac_gso_segment: mh 190 nh 204 th 174 imh 190 inh 204 ith 174 data
190
inet_gso_segment: mh 190 nh 204 th 174 imh 190 inh 204 ith 174 data 204
tcp_gso_segment: mh 190 nh 204 th 224 imh 190 inh 204 ith 174 data 224

All numbers are offsets from skb->head printed at function start
(except for '__skb_udp_tunnel_segment: inner', printed after the code
block mentioned above).
I see that xfrm4_transport_output moves the inner transport header
forward (to 174) and that __skb_udp_tunnel_segment incorrectly sets
transport header to it, but fortunately inet_gso_segment resets it to
the correct value after pulling the ip header.

In case of ENCAP_TYPE_IPPROTO, inet_gso_segment/ipv6_gso_segment would
be invoked directly by __skb_udp_tunnel_segment and it would see the
network header set correctly. But both compute nhoff like this:
nhoff = skb_network_header(skb) - skb_mac_header(skb);
Which would be 0 given mac_header is set to the same offset as the ip
header.
But that only makes the pskb_may_pull check & the skb_pull not do
anything. The functions then proceed to set up the transport header
correctly
I think the code might still work but I haven't verified with a
ENCAP_TYPE_IPPROTO protocol.

In general, while staring at this code, I got the impression that these
functions are brittle, relying on assumptions made in completely
different areas that might easily be broken given a different
combination of protocols.

I think that the code block in __skb_udp_tunnel_segment could be made
less brittle if it stops relying on the saved inner headers (which
might not be set to the actual inner protocols about to be handled),
and instead parse the mac or network headers and set the next headers
accordingly. This might even allow back HW GSO for XFRM in tunnel mode
with crypto offload, like before your original 2020 patch. WDYT?

Cosmin.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] xfrm: Force software GSO only in tunnel mode
  2025-03-20 15:24   ` Cosmin Ratiu
@ 2025-03-20 18:37     ` Xin Long
  0 siblings, 0 replies; 6+ messages in thread
From: Xin Long @ 2025-03-20 18:37 UTC (permalink / raw)
  To: Cosmin Ratiu
  Cc: Mark Bloch, chopps@labn.net, davem@davemloft.net,
	herbert@gondor.apana.org.au, Dragos Tatulea,
	steffen.klassert@secunet.com, netdev@vger.kernel.org,
	pabeni@redhat.com, horms@kernel.org, edumazet@google.com,
	kuba@kernel.org, Yael Chemla, wangfe@google.com

On Thu, Mar 20, 2025 at 11:24 AM Cosmin Ratiu <cratiu@nvidia.com> wrote:
>
> On Mon, 2025-03-17 at 15:30 -0400, Xin Long wrote:
> > For UDP tunnels, there are two types:
> >
> > - ENCAP_TYPE_ETHER encaps an ether packet (e.g., VXLAN, Geneve).
> > - ENCAP_TYPE_IPPROTO encaps an ipproto packet (e.g., SCTP over UDP).
> >
> > When performing GSO via skb_udp_tunnel_segment():
> >
> > - ENCAP_TYPE_ETHER relies on inner_network_header to locate the
> >   network header.
> > - ENCAP_TYPE_IPPROTO relies on inner_transport_header to locate
> >   the transport header.
> >
> > However, both IPsec transport and tunnel modes modify
> > inner_transport_header. This patch raises a concern that GSO may
> > not work correctly for ENCAP_TYPE_IPPROTO UDP tunnels over IPsec
> > in transport mode.
>
> skb_udp_tunnel_segment -> __skb_udp_tunnel_segment does:
>
> tnl_hlen = skb_inner_mac_header(skb) - skb_transport_header(skb);
> __skb_pull(skb, tnl_hlen);
> skb_reset_mac_header(skb);
> skb_set_network_header(skb, skb_inner_network_offset(skb));
> skb_set_transport_header(skb, skb_inner_transport_offset(skb));
>
> As a concrete example, in case of TCP over Geneve over IPsec in
> transport mode, this is the sequence of header manipulations done:
> geneve_build_skb: mh 190 nh 204 th 224 imh 190 inh 204 ith 224 data 182
> iptunnel_xmit: mh 190 nh 154 th 174 imh 190 inh 204 ith 224 data 154
> xfrm4_transport_output: mh 147 nh 138 th 158 imh 190 inh 204 ith 174
> data 174
> skb_mac_gso_segment: mh 124 nh 138 th 158 imh 190 inh 204 ith 174 data
> 124
> inet_gso_segment: mh 124 nh 138 th 158 imh 190 inh 204 ith 174 data 138
> esp4_gso_segment: mh 124 nh 138 th 158 imh 190 inh 204 ith 174 data 158
> __skb_udp_tunnel_segment: mh 124 nh 138 th 174 imh 190 inh 204 ith 174
> data 174
> __skb_udp_tunnel_segment: tnl_hlen 16
> __skb_udp_tunnel_segment: inner skb mh 190 nh 204 th 174 imh 190 inh
> 204 ith 174 data 190
> skb_mac_gso_segment: mh 190 nh 204 th 174 imh 190 inh 204 ith 174 data
> 190
> inet_gso_segment: mh 190 nh 204 th 174 imh 190 inh 204 ith 174 data 204
> tcp_gso_segment: mh 190 nh 204 th 224 imh 190 inh 204 ith 174 data 224
>
> All numbers are offsets from skb->head printed at function start
> (except for '__skb_udp_tunnel_segment: inner', printed after the code
> block mentioned above).
> I see that xfrm4_transport_output moves the inner transport header
> forward (to 174) and that __skb_udp_tunnel_segment incorrectly sets
> transport header to it, but fortunately inet_gso_segment resets it to
> the correct value after pulling the ip header.
>

I agree that it works for ENCAP_TYPE_ETHER tunnels in transport mode
since they do not rely on the incorrect transport header set in
__skb_udp_tunnel_segment().

> In case of ENCAP_TYPE_IPPROTO, inet_gso_segment/ipv6_gso_segment would
> be invoked directly by __skb_udp_tunnel_segment and it would see the
> network header set correctly. But both compute nhoff like this:
> nhoff = skb_network_header(skb) - skb_mac_header(skb);
> Which would be 0 given mac_header is set to the same offset as the ip
> header.
> But that only makes the pskb_may_pull check & the skb_pull not do
> anything. The functions then proceed to set up the transport header
> correctly
> I think the code might still work but I haven't verified with a
> ENCAP_TYPE_IPPROTO protocol.

For ENCAP_TYPE_IPPROTO, I believe it does not invoke inet_gso_segment()
or ipv6_gso_segment(). Instead, it directly calls the transport layer's
.gso_segment function, such as sctp_gso_segment() for SCTP over a UDP
tunnel. This behavior can be seen in skb_udp_tunnel_segment():

  case ENCAP_TYPE_IPPROTO:
      offloads = is_ipv6 ? inet6_offloads : inet_offloads;
      ops = rcu_dereference(offloads[skb->inner_ipproto]);

skb->inner_ipproto == IPPROTO_SCTP and
inet(6)_offloads[IPPROTO_SCTP] == sctp_gso_segment

sctp_gso_segment() assumes the transport header is correctly set and
retrieves the SCTP header using skb_transport_header(skb). However,
the transport header is incorrectly set earlier in the code:

  skb_set_network_header(skb, skb_inner_network_offset(skb));
  skb_set_transport_header(skb, skb_inner_transport_offset(skb));  <---
  ...
  segs = gso_inner_segment(skb, features);

With an incorrect sctphdr, this could lead to unexpected behavior or
even a crash.

I haven't had time to verify it either, but there are perhaps other
ENCAP_TYPE_IPPROTO tunnels easier to set up for testing.

>
> In general, while staring at this code, I got the impression that these
> functions are brittle, relying on assumptions made in completely
> different areas that might easily be broken given a different
> combination of protocols.
>
> I think that the code block in __skb_udp_tunnel_segment could be made
> less brittle if it stops relying on the saved inner headers (which
> might not be set to the actual inner protocols about to be handled),
> and instead parse the mac or network headers and set the next headers
> accordingly. This might even allow back HW GSO for XFRM in tunnel mode
> with crypto offload, like before your original 2020 patch. WDYT?
>

The inner_network_header and inner_transport_header were introduced
to support hardware-offloaded encapsulation. skb_udp_tunnel_segment()
may be just simply mimicking the hardware behavior and wasn’t
designed to handle nested encapsulation.

From what I can see:

For ENCAP_TYPE_ETHER, these headers help avoid extra CPU work by
skipping link-layer parsing to locate the network header, making
the link layer transparent to tunnel segmentation.

For ENCAP_TYPE_IPPROTO, without inner_transport_header, identifying
the transport header solely by parsing the inner packet may not be
possible. For example, distinguishing between these two UDP tunnels
in tunnel segmentation:

  FOU: ETH | IP | UDP | IPPROTO_XXX
  GUE: ETH | IP | UDP | GUE HDR | IPPROTO_XXX

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-03-20 18:37 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-17 10:32 [PATCH net] xfrm: Force software GSO only in tunnel mode Mark Bloch
2025-03-17 19:30 ` Xin Long
2025-03-20 15:24   ` Cosmin Ratiu
2025-03-20 18:37     ` Xin Long
2025-03-18 12:37 ` Mark Bloch
2025-03-20 14:50 ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).