netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net v1] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe
@ 2025-11-11  6:43 Chuang Wang
  2025-11-12  7:28 ` Ido Schimmel
  2025-11-12 14:50 ` patchwork-bot+netdevbpf
  0 siblings, 2 replies; 4+ messages in thread
From: Chuang Wang @ 2025-11-11  6:43 UTC (permalink / raw)
  Cc: Chuang Wang, stable, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, linux-kernel

The sit driver's packet transmission path calls: sit_tunnel_xmit() ->
update_or_create_fnhe(), which lead to fnhe_remove_oldest() being called
to delete entries exceeding FNHE_RECLAIM_DEPTH+random.

The race window is between fnhe_remove_oldest() selecting fnheX for
deletion and the subsequent kfree_rcu(). During this time, the
concurrent path's __mkroute_output() -> find_exception() can fetch the
soon-to-be-deleted fnheX, and rt_bind_exception() then binds it with a
new dst using a dst_hold(). When the original fnheX is freed via RCU,
the dst reference remains permanently leaked.

CPU 0                             CPU 1
__mkroute_output()
  find_exception() [fnheX]
                                  update_or_create_fnhe()
                                    fnhe_remove_oldest() [fnheX]
  rt_bind_exception() [bind dst]
                                  RCU callback [fnheX freed, dst leak]

This issue manifests as a device reference count leak and a warning in
dmesg when unregistering the net device:

  unregister_netdevice: waiting for sitX to become free. Usage count = N

Ido Schimmel provided the simple test validation method [1].

The fix clears 'oldest->fnhe_daddr' before calling fnhe_flush_routes().
Since rt_bind_exception() checks this field, setting it to zero prevents
the stale fnhe from being reused and bound to a new dst just before it
is freed.

[1]
ip netns add ns1
ip -n ns1 link set dev lo up
ip -n ns1 address add 192.0.2.1/32 dev lo
ip -n ns1 link add name dummy1 up type dummy
ip -n ns1 route add 192.0.2.2/32 dev dummy1
ip -n ns1 link add name gretap1 up arp off type gretap \
    local 192.0.2.1 remote 192.0.2.2
ip -n ns1 route add 198.51.0.0/16 dev gretap1
taskset -c 0 ip netns exec ns1 mausezahn gretap1 \
    -A 198.51.100.1 -B 198.51.0.0/16 -t udp -p 1000 -c 0 -q &
taskset -c 2 ip netns exec ns1 mausezahn gretap1 \
    -A 198.51.100.1 -B 198.51.0.0/16 -t udp -p 1000 -c 0 -q &
sleep 10
ip netns pids ns1 | xargs kill
ip netns del ns1

Cc: stable@vger.kernel.org
Fixes: 67d6d681e15b ("ipv4: make exception cache less predictible")
Signed-off-by: Chuang Wang <nashuiliang@gmail.com>
---
v0 -> v1:
- Expanded commit description to fully document the race condition,
  including the sit driver's call chain and stack trace.
- Added Ido Schimmel's validation method.
---

 net/ipv4/route.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6d27d3610c1c..b549d6a57307 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -607,6 +607,11 @@ static void fnhe_remove_oldest(struct fnhe_hash_bucket *hash)
 			oldest_p = fnhe_p;
 		}
 	}
+
+	/* Clear oldest->fnhe_daddr to prevent this fnhe from being
+	 * rebound with new dsts in rt_bind_exception().
+	 */
+	oldest->fnhe_daddr = 0;
 	fnhe_flush_routes(oldest);
 	*oldest_p = oldest->fnhe_next;
 	kfree_rcu(oldest, rcu);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH net v1] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe
  2025-11-11  6:43 [PATCH net v1] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe Chuang Wang
@ 2025-11-12  7:28 ` Ido Schimmel
  2025-11-12  9:00   ` Eric Dumazet
  2025-11-12 14:50 ` patchwork-bot+netdevbpf
  1 sibling, 1 reply; 4+ messages in thread
From: Ido Schimmel @ 2025-11-12  7:28 UTC (permalink / raw)
  To: Chuang Wang
  Cc: stable, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, linux-kernel

On Tue, Nov 11, 2025 at 02:43:24PM +0800, Chuang Wang wrote:
> The sit driver's packet transmission path calls: sit_tunnel_xmit() ->
> update_or_create_fnhe(), which lead to fnhe_remove_oldest() being called
> to delete entries exceeding FNHE_RECLAIM_DEPTH+random.
> 
> The race window is between fnhe_remove_oldest() selecting fnheX for
> deletion and the subsequent kfree_rcu(). During this time, the
> concurrent path's __mkroute_output() -> find_exception() can fetch the
> soon-to-be-deleted fnheX, and rt_bind_exception() then binds it with a
> new dst using a dst_hold(). When the original fnheX is freed via RCU,
> the dst reference remains permanently leaked.
> 
> CPU 0                             CPU 1
> __mkroute_output()
>   find_exception() [fnheX]
>                                   update_or_create_fnhe()
>                                     fnhe_remove_oldest() [fnheX]
>   rt_bind_exception() [bind dst]
>                                   RCU callback [fnheX freed, dst leak]
> 
> This issue manifests as a device reference count leak and a warning in
> dmesg when unregistering the net device:
> 
>   unregister_netdevice: waiting for sitX to become free. Usage count = N
> 
> Ido Schimmel provided the simple test validation method [1].
> 
> The fix clears 'oldest->fnhe_daddr' before calling fnhe_flush_routes().
> Since rt_bind_exception() checks this field, setting it to zero prevents
> the stale fnhe from being reused and bound to a new dst just before it
> is freed.
> 
> [1]
> ip netns add ns1
> ip -n ns1 link set dev lo up
> ip -n ns1 address add 192.0.2.1/32 dev lo
> ip -n ns1 link add name dummy1 up type dummy
> ip -n ns1 route add 192.0.2.2/32 dev dummy1
> ip -n ns1 link add name gretap1 up arp off type gretap \
>     local 192.0.2.1 remote 192.0.2.2
> ip -n ns1 route add 198.51.0.0/16 dev gretap1
> taskset -c 0 ip netns exec ns1 mausezahn gretap1 \
>     -A 198.51.100.1 -B 198.51.0.0/16 -t udp -p 1000 -c 0 -q &
> taskset -c 2 ip netns exec ns1 mausezahn gretap1 \
>     -A 198.51.100.1 -B 198.51.0.0/16 -t udp -p 1000 -c 0 -q &
> sleep 10
> ip netns pids ns1 | xargs kill
> ip netns del ns1
> 
> Cc: stable@vger.kernel.org
> Fixes: 67d6d681e15b ("ipv4: make exception cache less predictible")
> Signed-off-by: Chuang Wang <nashuiliang@gmail.com>

Reviewed-by: Ido Schimmel <idosch@nvidia.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net v1] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe
  2025-11-12  7:28 ` Ido Schimmel
@ 2025-11-12  9:00   ` Eric Dumazet
  0 siblings, 0 replies; 4+ messages in thread
From: Eric Dumazet @ 2025-11-12  9:00 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Chuang Wang, stable, David S. Miller, David Ahern, Jakub Kicinski,
	Paolo Abeni, Simon Horman, netdev, linux-kernel

On Tue, Nov 11, 2025 at 11:28 PM Ido Schimmel <idosch@idosch.org> wrote:
>
> On Tue, Nov 11, 2025 at 02:43:24PM +0800, Chuang Wang wrote:
> > The sit driver's packet transmission path calls: sit_tunnel_xmit() ->
> > update_or_create_fnhe(), which lead to fnhe_remove_oldest() being called
> > to delete entries exceeding FNHE_RECLAIM_DEPTH+random.
...
> Reviewed-by: Ido Schimmel <idosch@nvidia.com>

Thanks !

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net v1] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe
  2025-11-11  6:43 [PATCH net v1] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe Chuang Wang
  2025-11-12  7:28 ` Ido Schimmel
@ 2025-11-12 14:50 ` patchwork-bot+netdevbpf
  1 sibling, 0 replies; 4+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-11-12 14:50 UTC (permalink / raw)
  To: Chuang Wang
  Cc: stable, davem, dsahern, edumazet, kuba, pabeni, horms, netdev,
	linux-kernel

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 11 Nov 2025 14:43:24 +0800 you wrote:
> The sit driver's packet transmission path calls: sit_tunnel_xmit() ->
> update_or_create_fnhe(), which lead to fnhe_remove_oldest() being called
> to delete entries exceeding FNHE_RECLAIM_DEPTH+random.
> 
> The race window is between fnhe_remove_oldest() selecting fnheX for
> deletion and the subsequent kfree_rcu(). During this time, the
> concurrent path's __mkroute_output() -> find_exception() can fetch the
> soon-to-be-deleted fnheX, and rt_bind_exception() then binds it with a
> new dst using a dst_hold(). When the original fnheX is freed via RCU,
> the dst reference remains permanently leaked.
> 
> [...]

Here is the summary with links:
  - [net,v1] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe
    https://git.kernel.org/netdev/net/c/ac1499fcd40f

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-11-12 14:50 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-11  6:43 [PATCH net v1] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe Chuang Wang
2025-11-12  7:28 ` Ido Schimmel
2025-11-12  9:00   ` Eric Dumazet
2025-11-12 14:50 ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).