[PATCH bpf v2 1/2] bpf: Scrub packet on bpf_redirect

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf v2 1/2] bpf: Scrub packet on bpf_redirect_peer
@ 2025-05-05 19:58 Paul Chaignon
  2025-05-05 19:58 ` [PATCH bpf v2 2/2] bpf: Clarify handling of mark and tstamp by redirect_peer Paul Chaignon
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Paul Chaignon @ 2025-05-05 19:58 UTC (permalink / raw)
  To: bpf; +Cc: Martin KaFai Lau, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko

When bpf_redirect_peer is used to redirect packets to a device in
another network namespace, the skb isn't scrubbed. That can lead skb
information from one namespace to be "misused" in another namespace.

As one example, this is causing Cilium to drop traffic when using
bpf_redirect_peer to redirect packets that just went through IPsec
decryption to a container namespace. The following pwru trace shows (1)
the packet path from the host's XFRM layer to the container's XFRM
layer where it's dropped and (2) the number of active skb extensions at
each function.

    NETNS       MARK  IFACE  TUPLE                                FUNC
    4026533547  d00   eth0   10.244.3.124:35473->10.244.2.158:53  xfrm_rcv_cb
                             .active_extensions = (__u8)2,
    4026533547  d00   eth0   10.244.3.124:35473->10.244.2.158:53  xfrm4_rcv_cb
                             .active_extensions = (__u8)2,
    4026533547  d00   eth0   10.244.3.124:35473->10.244.2.158:53  gro_cells_receive
                             .active_extensions = (__u8)2,
    [...]
    4026533547  0     eth0   10.244.3.124:35473->10.244.2.158:53  skb_do_redirect
                             .active_extensions = (__u8)2,
    4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  ip_rcv
                             .active_extensions = (__u8)2,
    4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  ip_rcv_core
                             .active_extensions = (__u8)2,
    [...]
    4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  udp_queue_rcv_one_skb
                             .active_extensions = (__u8)2,
    4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  __xfrm_policy_check
                             .active_extensions = (__u8)2,
    4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  __xfrm_decode_session
                             .active_extensions = (__u8)2,
    4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  security_xfrm_decode_session
                             .active_extensions = (__u8)2,
    4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  kfree_skb_reason(SKB_DROP_REASON_XFRM_POLICY)
                             .active_extensions = (__u8)2,

In this case, there are no XFRM policies in the container's network
namespace so the drop is unexpected. When we decrypt the IPsec packet,
the XFRM state used for decryption is set in the skb extensions. This
information is preserved across the netns switch. When we reach the
XFRM policy check in the container's netns, __xfrm_policy_check drops
the packet with LINUX_MIB_XFRMINNOPOLS because a (container-side) XFRM
policy can't be found that matches the (host-side) XFRM state used for
decryption.

This patch fixes this by scrubbing the packet when using
bpf_redirect_peer, as is done on typical netns switches via veth
devices except skb->mark and skb->tstamp are not zeroed.

Fixes: 9aa1206e8f482 ("bpf: Add redirect_peer helper")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
---
Changes in v2:
  - Avoid scrubbing skb->mark and skb->tstamp as suggested by Daniel
    because existing applications may already use those.
  - Add second commit to note the above in the helper's description.

 net/core/filter.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 79cab4d78dc3..577a4504e26f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2509,6 +2509,7 @@ int skb_do_redirect(struct sk_buff *skb)
 			goto out_drop;
 		skb->dev = dev;
 		dev_sw_netstats_rx_add(dev, skb->len);
+		skb_scrub_packet(skb, false);
 		return -EAGAIN;
 	}
 	return flags & BPF_F_NEIGH ?
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH bpf v2 2/2] bpf: Clarify handling of mark and tstamp by redirect_peer
  2025-05-05 19:58 [PATCH bpf v2 1/2] bpf: Scrub packet on bpf_redirect_peer Paul Chaignon
@ 2025-05-05 19:58 ` Paul Chaignon
  2025-05-05 21:29   ` Daniel Borkmann
  2025-05-06 19:19   ` Martin KaFai Lau
  2025-05-05 21:28 ` [PATCH bpf v2 1/2] bpf: Scrub packet on bpf_redirect_peer Daniel Borkmann
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 7+ messages in thread
From: Paul Chaignon @ 2025-05-05 19:58 UTC (permalink / raw)
  To: bpf; +Cc: Martin KaFai Lau, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko

When switching network namespaces with the bpf_redirect_peer helper, the
skb->mark and skb->tstamp fields are not zeroed out like they can be on
a typical netns switch. This patch clarifies that in the helper
description.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
---
 include/uapi/linux/bpf.h       | 3 +++
 tools/include/uapi/linux/bpf.h | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 28705ae67784..fd404729b115 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4968,6 +4968,9 @@ union bpf_attr {
  * 		the netns switch takes place from ingress to ingress without
  * 		going through the CPU's backlog queue.
  *
+ * 		*skb*\ **->mark** and *skb*\ **->tstamp** are not cleared during
+ * 		the netns switch.
+ *
  * 		The *flags* argument is reserved and must be 0. The helper is
  * 		currently only supported for tc BPF program types at the
  * 		ingress hook and for veth and netkit target device types. The
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 28705ae67784..fd404729b115 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4968,6 +4968,9 @@ union bpf_attr {
  * 		the netns switch takes place from ingress to ingress without
  * 		going through the CPU's backlog queue.
  *
+ * 		*skb*\ **->mark** and *skb*\ **->tstamp** are not cleared during
+ * 		the netns switch.
+ *
  * 		The *flags* argument is reserved and must be 0. The helper is
  * 		currently only supported for tc BPF program types at the
  * 		ingress hook and for veth and netkit target device types. The
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf v2 2/2] bpf: Clarify handling of mark and tstamp by redirect_peer
  2025-05-05 19:58 ` [PATCH bpf v2 2/2] bpf: Clarify handling of mark and tstamp by redirect_peer Paul Chaignon
@ 2025-05-05 21:29   ` Daniel Borkmann
  2025-05-06 19:19   ` Martin KaFai Lau
  1 sibling, 0 replies; 7+ messages in thread
From: Daniel Borkmann @ 2025-05-05 21:29 UTC (permalink / raw)
  To: Paul Chaignon, bpf; +Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko

On 5/5/25 9:58 PM, Paul Chaignon wrote:
> When switching network namespaces with the bpf_redirect_peer helper, the
> skb->mark and skb->tstamp fields are not zeroed out like they can be on
> a typical netns switch. This patch clarifies that in the helper
> description.
> 
> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf v2 2/2] bpf: Clarify handling of mark and tstamp by redirect_peer
  2025-05-05 19:58 ` [PATCH bpf v2 2/2] bpf: Clarify handling of mark and tstamp by redirect_peer Paul Chaignon
  2025-05-05 21:29   ` Daniel Borkmann
@ 2025-05-06 19:19   ` Martin KaFai Lau
  1 sibling, 0 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2025-05-06 19:19 UTC (permalink / raw)
  To: Paul Chaignon, Jakub Kicinski
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf,
	Network Development

On 5/5/25 12:58 PM, Paul Chaignon wrote:
> When switching network namespaces with the bpf_redirect_peer helper, the
> skb->mark and skb->tstamp fields are not zeroed out like they can be on
> a typical netns switch. This patch clarifies that in the helper
> description.
> 
> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

Jakub, could you help to land them to the net tree? Thanks.

Acked-by: Martin KaFai Lau <martin.lau@kernel.org>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf v2 1/2] bpf: Scrub packet on bpf_redirect_peer
  2025-05-05 19:58 [PATCH bpf v2 1/2] bpf: Scrub packet on bpf_redirect_peer Paul Chaignon
  2025-05-05 19:58 ` [PATCH bpf v2 2/2] bpf: Clarify handling of mark and tstamp by redirect_peer Paul Chaignon
@ 2025-05-05 21:28 ` Daniel Borkmann
  2025-05-06 19:17 ` Martin KaFai Lau
  2025-05-08  1:40 ` patchwork-bot+netdevbpf
  3 siblings, 0 replies; 7+ messages in thread
From: Daniel Borkmann @ 2025-05-05 21:28 UTC (permalink / raw)
  To: Paul Chaignon, bpf; +Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko

On 5/5/25 9:58 PM, Paul Chaignon wrote:
> When bpf_redirect_peer is used to redirect packets to a device in
> another network namespace, the skb isn't scrubbed. That can lead skb
> information from one namespace to be "misused" in another namespace.
> 
> As one example, this is causing Cilium to drop traffic when using
> bpf_redirect_peer to redirect packets that just went through IPsec
> decryption to a container namespace. The following pwru trace shows (1)
> the packet path from the host's XFRM layer to the container's XFRM
> layer where it's dropped and (2) the number of active skb extensions at
> each function.
> 
>      NETNS       MARK  IFACE  TUPLE                                FUNC
>      4026533547  d00   eth0   10.244.3.124:35473->10.244.2.158:53  xfrm_rcv_cb
>                               .active_extensions = (__u8)2,
>      4026533547  d00   eth0   10.244.3.124:35473->10.244.2.158:53  xfrm4_rcv_cb
>                               .active_extensions = (__u8)2,
>      4026533547  d00   eth0   10.244.3.124:35473->10.244.2.158:53  gro_cells_receive
>                               .active_extensions = (__u8)2,
>      [...]
>      4026533547  0     eth0   10.244.3.124:35473->10.244.2.158:53  skb_do_redirect
>                               .active_extensions = (__u8)2,
>      4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  ip_rcv
>                               .active_extensions = (__u8)2,
>      4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  ip_rcv_core
>                               .active_extensions = (__u8)2,
>      [...]
>      4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  udp_queue_rcv_one_skb
>                               .active_extensions = (__u8)2,
>      4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  __xfrm_policy_check
>                               .active_extensions = (__u8)2,
>      4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  __xfrm_decode_session
>                               .active_extensions = (__u8)2,
>      4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  security_xfrm_decode_session
>                               .active_extensions = (__u8)2,
>      4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  kfree_skb_reason(SKB_DROP_REASON_XFRM_POLICY)
>                               .active_extensions = (__u8)2,
> 
> In this case, there are no XFRM policies in the container's network
> namespace so the drop is unexpected. When we decrypt the IPsec packet,
> the XFRM state used for decryption is set in the skb extensions. This
> information is preserved across the netns switch. When we reach the
> XFRM policy check in the container's netns, __xfrm_policy_check drops
> the packet with LINUX_MIB_XFRMINNOPOLS because a (container-side) XFRM
> policy can't be found that matches the (host-side) XFRM state used for
> decryption.
> 
> This patch fixes this by scrubbing the packet when using
> bpf_redirect_peer, as is done on typical netns switches via veth
> devices except skb->mark and skb->tstamp are not zeroed.
> 
> Fixes: 9aa1206e8f482 ("bpf: Add redirect_peer helper")
> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf v2 1/2] bpf: Scrub packet on bpf_redirect_peer
  2025-05-05 19:58 [PATCH bpf v2 1/2] bpf: Scrub packet on bpf_redirect_peer Paul Chaignon
  2025-05-05 19:58 ` [PATCH bpf v2 2/2] bpf: Clarify handling of mark and tstamp by redirect_peer Paul Chaignon
  2025-05-05 21:28 ` [PATCH bpf v2 1/2] bpf: Scrub packet on bpf_redirect_peer Daniel Borkmann
@ 2025-05-06 19:17 ` Martin KaFai Lau
  2025-05-08  1:40 ` patchwork-bot+netdevbpf
  3 siblings, 0 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2025-05-06 19:17 UTC (permalink / raw)
  To: Paul Chaignon
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Jakub Kicinski, Network Development, bpf

On 5/5/25 12:58 PM, Paul Chaignon wrote:
> When bpf_redirect_peer is used to redirect packets to a device in
> another network namespace, the skb isn't scrubbed. That can lead skb
> information from one namespace to be "misused" in another namespace.
> 
> As one example, this is causing Cilium to drop traffic when using
> bpf_redirect_peer to redirect packets that just went through IPsec
> decryption to a container namespace. The following pwru trace shows (1)
> the packet path from the host's XFRM layer to the container's XFRM
> layer where it's dropped and (2) the number of active skb extensions at
> each function.
> 
>      NETNS       MARK  IFACE  TUPLE                                FUNC
>      4026533547  d00   eth0   10.244.3.124:35473->10.244.2.158:53  xfrm_rcv_cb
>                               .active_extensions = (__u8)2,
>      4026533547  d00   eth0   10.244.3.124:35473->10.244.2.158:53  xfrm4_rcv_cb
>                               .active_extensions = (__u8)2,
>      4026533547  d00   eth0   10.244.3.124:35473->10.244.2.158:53  gro_cells_receive
>                               .active_extensions = (__u8)2,
>      [...]
>      4026533547  0     eth0   10.244.3.124:35473->10.244.2.158:53  skb_do_redirect
>                               .active_extensions = (__u8)2,
>      4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  ip_rcv
>                               .active_extensions = (__u8)2,
>      4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  ip_rcv_core
>                               .active_extensions = (__u8)2,
>      [...]
>      4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  udp_queue_rcv_one_skb
>                               .active_extensions = (__u8)2,
>      4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  __xfrm_policy_check
>                               .active_extensions = (__u8)2,
>      4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  __xfrm_decode_session
>                               .active_extensions = (__u8)2,
>      4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  security_xfrm_decode_session
>                               .active_extensions = (__u8)2,
>      4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  kfree_skb_reason(SKB_DROP_REASON_XFRM_POLICY)
>                               .active_extensions = (__u8)2,
> 
> In this case, there are no XFRM policies in the container's network
> namespace so the drop is unexpected. When we decrypt the IPsec packet,
> the XFRM state used for decryption is set in the skb extensions. This
> information is preserved across the netns switch. When we reach the
> XFRM policy check in the container's netns, __xfrm_policy_check drops
> the packet with LINUX_MIB_XFRMINNOPOLS because a (container-side) XFRM
> policy can't be found that matches the (host-side) XFRM state used for
> decryption.
> 
> This patch fixes this by scrubbing the packet when using
> bpf_redirect_peer, as is done on typical netns switches via veth
> devices except skb->mark and skb->tstamp are not zeroed.
> 
> Fixes: 9aa1206e8f482 ("bpf: Add redirect_peer helper")
> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>

Acked-by: Martin KaFai Lau <martin.lau@kernel.org>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf v2 1/2] bpf: Scrub packet on bpf_redirect_peer
  2025-05-05 19:58 [PATCH bpf v2 1/2] bpf: Scrub packet on bpf_redirect_peer Paul Chaignon
                   ` (2 preceding siblings ...)
  2025-05-06 19:17 ` Martin KaFai Lau
@ 2025-05-08  1:40 ` patchwork-bot+netdevbpf
  3 siblings, 0 replies; 7+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-05-08  1:40 UTC (permalink / raw)
  To: Paul Chaignon; +Cc: bpf, martin.lau, ast, daniel, andrii

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 5 May 2025 21:58:04 +0200 you wrote:
> When bpf_redirect_peer is used to redirect packets to a device in
> another network namespace, the skb isn't scrubbed. That can lead skb
> information from one namespace to be "misused" in another namespace.
> 
> As one example, this is causing Cilium to drop traffic when using
> bpf_redirect_peer to redirect packets that just went through IPsec
> decryption to a container namespace. The following pwru trace shows (1)
> the packet path from the host's XFRM layer to the container's XFRM
> layer where it's dropped and (2) the number of active skb extensions at
> each function.
> 
> [...]

Here is the summary with links:
  - [bpf,v2,1/2] bpf: Scrub packet on bpf_redirect_peer
    https://git.kernel.org/netdev/net/c/c43272299488
  - [bpf,v2,2/2] bpf: Clarify handling of mark and tstamp by redirect_peer
    https://git.kernel.org/netdev/net/c/f5c79ffdc250

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-05-08  1:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-05 19:58 [PATCH bpf v2 1/2] bpf: Scrub packet on bpf_redirect_peer Paul Chaignon
2025-05-05 19:58 ` [PATCH bpf v2 2/2] bpf: Clarify handling of mark and tstamp by redirect_peer Paul Chaignon
2025-05-05 21:29   ` Daniel Borkmann
2025-05-06 19:19   ` Martin KaFai Lau
2025-05-05 21:28 ` [PATCH bpf v2 1/2] bpf: Scrub packet on bpf_redirect_peer Daniel Borkmann
2025-05-06 19:17 ` Martin KaFai Lau
2025-05-08  1:40 ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox