netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf] bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6}
@ 2025-10-03  7:34 Daniel Borkmann
  2025-10-03  9:42 ` Simon Horman
                   ` (4 more replies)
  0 siblings, 5 replies; 7+ messages in thread
From: Daniel Borkmann @ 2025-10-03  7:34 UTC (permalink / raw)
  To: bpf
  Cc: netdev, Yusuke Suzuki, Julian Wiedmann, Martin KaFai Lau,
	Jakub Kicinski, Jordan Rife

Cilium has a BPF egress gateway feature which forces outgoing K8s Pod
traffic to pass through dedicated egress gateways which then SNAT the
traffic in order to interact with stable IPs outside the cluster.

The traffic is directed to the gateway via vxlan tunnel in collect md
mode. A recent BPF change utilized the bpf_redirect_neigh() helper to
forward packets after the arrival and decap on vxlan, which turned out
over time that the kmalloc-256 slab usage in kernel was ever-increasing.

The issue was that vxlan allocates the metadata_dst object and attaches
it through a fake dst entry to the skb. The latter was never released
though given bpf_redirect_neigh() was merely setting the new dst entry
via skb_dst_set() without dropping an existing one first.

Fixes: b4ab31414970 ("bpf: Add redirect_neigh helper as redirect drop-in")
Reported-by: Yusuke Suzuki <yusuke.suzuki@isovalent.com>
Reported-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jordan Rife <jrife@google.com>
---
 net/core/filter.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index b005363f482c..c3c0b5a37504 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2281,6 +2281,7 @@ static int __bpf_redirect_neigh_v6(struct sk_buff *skb, struct net_device *dev,
 		if (IS_ERR(dst))
 			goto out_drop;
 
+		skb_dst_drop(skb);
 		skb_dst_set(skb, dst);
 	} else if (nh->nh_family != AF_INET6) {
 		goto out_drop;
@@ -2389,6 +2390,7 @@ static int __bpf_redirect_neigh_v4(struct sk_buff *skb, struct net_device *dev,
 			goto out_drop;
 		}
 
+		skb_dst_drop(skb);
 		skb_dst_set(skb, &rt->dst);
 	}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf] bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6}
  2025-10-03  7:34 [PATCH bpf] bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6} Daniel Borkmann
@ 2025-10-03  9:42 ` Simon Horman
  2025-10-03 15:01 ` Jordan Rife
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Simon Horman @ 2025-10-03  9:42 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, Yusuke Suzuki, Julian Wiedmann, Martin KaFai Lau,
	Jakub Kicinski, Jordan Rife

On Fri, Oct 03, 2025 at 09:34:18AM +0200, Daniel Borkmann wrote:
> Cilium has a BPF egress gateway feature which forces outgoing K8s Pod
> traffic to pass through dedicated egress gateways which then SNAT the
> traffic in order to interact with stable IPs outside the cluster.
> 
> The traffic is directed to the gateway via vxlan tunnel in collect md
> mode. A recent BPF change utilized the bpf_redirect_neigh() helper to
> forward packets after the arrival and decap on vxlan, which turned out
> over time that the kmalloc-256 slab usage in kernel was ever-increasing.
> 
> The issue was that vxlan allocates the metadata_dst object and attaches
> it through a fake dst entry to the skb. The latter was never released
> though given bpf_redirect_neigh() was merely setting the new dst entry
> via skb_dst_set() without dropping an existing one first.
> 
> Fixes: b4ab31414970 ("bpf: Add redirect_neigh helper as redirect drop-in")
> Reported-by: Yusuke Suzuki <yusuke.suzuki@isovalent.com>
> Reported-by: Julian Wiedmann <jwi@isovalent.com>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Martin KaFai Lau <martin.lau@kernel.org>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Jordan Rife <jrife@google.com>

Reviewed-by: Simon Horman <horms@kernel.org>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf] bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6}
  2025-10-03  7:34 [PATCH bpf] bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6} Daniel Borkmann
  2025-10-03  9:42 ` Simon Horman
@ 2025-10-03 15:01 ` Jordan Rife
  2025-10-03 16:02 ` Jakub Kicinski
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Jordan Rife @ 2025-10-03 15:01 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, Yusuke Suzuki, Julian Wiedmann, Martin KaFai Lau,
	Jakub Kicinski

On Fri, Oct 03, 2025 at 09:34:18AM +0200, Daniel Borkmann wrote:
> Cilium has a BPF egress gateway feature which forces outgoing K8s Pod
> traffic to pass through dedicated egress gateways which then SNAT the
> traffic in order to interact with stable IPs outside the cluster.
> 
> The traffic is directed to the gateway via vxlan tunnel in collect md
> mode. A recent BPF change utilized the bpf_redirect_neigh() helper to
> forward packets after the arrival and decap on vxlan, which turned out
> over time that the kmalloc-256 slab usage in kernel was ever-increasing.
> 
> The issue was that vxlan allocates the metadata_dst object and attaches
> it through a fake dst entry to the skb. The latter was never released
> though given bpf_redirect_neigh() was merely setting the new dst entry
> via skb_dst_set() without dropping an existing one first.
> 
> Fixes: b4ab31414970 ("bpf: Add redirect_neigh helper as redirect drop-in")
> Reported-by: Yusuke Suzuki <yusuke.suzuki@isovalent.com>
> Reported-by: Julian Wiedmann <jwi@isovalent.com>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Martin KaFai Lau <martin.lau@kernel.org>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Jordan Rife <jrife@google.com>
> ---
>  net/core/filter.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index b005363f482c..c3c0b5a37504 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2281,6 +2281,7 @@ static int __bpf_redirect_neigh_v6(struct sk_buff *skb, struct net_device *dev,
>  		if (IS_ERR(dst))
>  			goto out_drop;
>  
> +		skb_dst_drop(skb);
>  		skb_dst_set(skb, dst);
>  	} else if (nh->nh_family != AF_INET6) {
>  		goto out_drop;
> @@ -2389,6 +2390,7 @@ static int __bpf_redirect_neigh_v4(struct sk_buff *skb, struct net_device *dev,
>  			goto out_drop;
>  		}
>  
> +		skb_dst_drop(skb);
>  		skb_dst_set(skb, &rt->dst);
>  	}
>  
> -- 
> 2.43.0
>

Nice catch!

Reviewed-by: Jordan Rife <jrife@google.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf] bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6}
  2025-10-03  7:34 [PATCH bpf] bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6} Daniel Borkmann
  2025-10-03  9:42 ` Simon Horman
  2025-10-03 15:01 ` Jordan Rife
@ 2025-10-03 16:02 ` Jakub Kicinski
  2025-10-03 19:24   ` Daniel Borkmann
  2025-10-03 18:50 ` Martin KaFai Lau
  2025-10-07  4:30 ` patchwork-bot+netdevbpf
  4 siblings, 1 reply; 7+ messages in thread
From: Jakub Kicinski @ 2025-10-03 16:02 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, Yusuke Suzuki, Julian Wiedmann, Martin KaFai Lau,
	Jordan Rife

On Fri,  3 Oct 2025 09:34:18 +0200 Daniel Borkmann wrote:
> Cilium has a BPF egress gateway feature which forces outgoing K8s Pod
> traffic to pass through dedicated egress gateways which then SNAT the
> traffic in order to interact with stable IPs outside the cluster.

Nice! The warning Stan added at work?

Reviewed-by: Jakub Kicinski <kuba@kernel.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf] bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6}
  2025-10-03  7:34 [PATCH bpf] bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6} Daniel Borkmann
                   ` (2 preceding siblings ...)
  2025-10-03 16:02 ` Jakub Kicinski
@ 2025-10-03 18:50 ` Martin KaFai Lau
  2025-10-07  4:30 ` patchwork-bot+netdevbpf
  4 siblings, 0 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2025-10-03 18:50 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, Yusuke Suzuki, Julian Wiedmann, Martin KaFai Lau,
	Jakub Kicinski, Jordan Rife

On 10/3/25 12:34 AM, Daniel Borkmann wrote:
> Cilium has a BPF egress gateway feature which forces outgoing K8s Pod
> traffic to pass through dedicated egress gateways which then SNAT the
> traffic in order to interact with stable IPs outside the cluster.
> 
> The traffic is directed to the gateway via vxlan tunnel in collect md
> mode. A recent BPF change utilized the bpf_redirect_neigh() helper to
> forward packets after the arrival and decap on vxlan, which turned out
> over time that the kmalloc-256 slab usage in kernel was ever-increasing.
> 
> The issue was that vxlan allocates the metadata_dst object and attaches
> it through a fake dst entry to the skb. The latter was never released
> though given bpf_redirect_neigh() was merely setting the new dst entry
> via skb_dst_set() without dropping an existing one first.

Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf] bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6}
  2025-10-03 16:02 ` Jakub Kicinski
@ 2025-10-03 19:24   ` Daniel Borkmann
  0 siblings, 0 replies; 7+ messages in thread
From: Daniel Borkmann @ 2025-10-03 19:24 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bpf, netdev, Yusuke Suzuki, Julian Wiedmann, Martin KaFai Lau,
	Jordan Rife

On 10/3/25 6:02 PM, Jakub Kicinski wrote:
> On Fri,  3 Oct 2025 09:34:18 +0200 Daniel Borkmann wrote:
>> Cilium has a BPF egress gateway feature which forces outgoing K8s Pod
>> traffic to pass through dedicated egress gateways which then SNAT the
>> traffic in order to interact with stable IPs outside the cluster.
> 
> Nice! The warning Stan added at work?

We should add CONFIG_DEBUG_NET to our Cilium CI test kernels actually.
The memory leak was observed on AWS nodes with the above Cilium config,
so bpftrace came to the rescue in the end.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf] bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6}
  2025-10-03  7:34 [PATCH bpf] bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6} Daniel Borkmann
                   ` (3 preceding siblings ...)
  2025-10-03 18:50 ` Martin KaFai Lau
@ 2025-10-07  4:30 ` patchwork-bot+netdevbpf
  4 siblings, 0 replies; 7+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-10-07  4:30 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: bpf, netdev, yusuke.suzuki, jwi, martin.lau, kuba, jrife

Hello:

This patch was applied to bpf/bpf.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Fri,  3 Oct 2025 09:34:18 +0200 you wrote:
> Cilium has a BPF egress gateway feature which forces outgoing K8s Pod
> traffic to pass through dedicated egress gateways which then SNAT the
> traffic in order to interact with stable IPs outside the cluster.
> 
> The traffic is directed to the gateway via vxlan tunnel in collect md
> mode. A recent BPF change utilized the bpf_redirect_neigh() helper to
> forward packets after the arrival and decap on vxlan, which turned out
> over time that the kmalloc-256 slab usage in kernel was ever-increasing.
> 
> [...]

Here is the summary with links:
  - [bpf] bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6}
    https://git.kernel.org/bpf/bpf/c/23f3770e1a53

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-10-07  4:30 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-03  7:34 [PATCH bpf] bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6} Daniel Borkmann
2025-10-03  9:42 ` Simon Horman
2025-10-03 15:01 ` Jordan Rife
2025-10-03 16:02 ` Jakub Kicinski
2025-10-03 19:24   ` Daniel Borkmann
2025-10-03 18:50 ` Martin KaFai Lau
2025-10-07  4:30 ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).