[PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe
@ 2025-11-04  3:09 chuang
  2025-11-05 14:21 ` Ido Schimmel
  2025-11-05 14:26 ` Eric Dumazet
  0 siblings, 2 replies; 8+ messages in thread
From: chuang @ 2025-11-04  3:09 UTC (permalink / raw)
  Cc: Chuang Wang, stable, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Networking, open list

From 35dbc9abd8da820007391b707bd2c1a9c99ee67d Mon Sep 17 00:00:00 2001
From: Chuang Wang <nashuiliang@gmail.com>
Date: Tue, 4 Nov 2025 02:52:11 +0000
Subject: [PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding
 stale fnhe

A race condition exists between fnhe_remove_oldest() and
rt_bind_exception() where a fnhe that is scheduled for removal can be
rebound to a new dst.

The issue occurs when fnhe_remove_oldest() selects an fnhe (fnheX)
for deletion, but before it can be flushed and freed via RCU,
CPU 0 enters rt_bind_exception() and attempts to reuse the entry.

CPU 0                             CPU 1
__mkroute_output()
  find_exception() [fnheX]
                                  update_or_create_fnhe()
                                    fnhe_remove_oldest() [fnheX]
  rt_bind_exception() [bind dst]
                                  RCU callback [fnheX freed, dst leak]

If rt_bind_exception() successfully binds fnheX to a new dst, the
newly bound dst will never be properly freed because fnheX will
soon be released by the RCU callback, leading to a permanent
reference count leak on the old dst and the device.

This issue manifests as a device reference count leak and a
warning in dmesg when unregistering the net device:

  unregister_netdevice: waiting for ethX to become free. Usage count = N

Fix this race by clearing 'oldest->fnhe_daddr' before calling
fnhe_flush_routes(). Since rt_bind_exception() checks this field,
setting it to zero prevents the stale fnhe from being reused and
bound to a new dst just before it is freed.

Cc: stable@vger.kernel.org
Fixes: 67d6d681e15b ("ipv4: make exception cache less predictible")
Signed-off-by: Chuang Wang <nashuiliang@gmail.com>
---
 net/ipv4/route.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6d27d3610c1c..b549d6a57307 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -607,6 +607,11 @@ static void fnhe_remove_oldest(struct
fnhe_hash_bucket *hash)
                        oldest_p = fnhe_p;
                }
        }
+
+       /* Clear oldest->fnhe_daddr to prevent this fnhe from being
+        * rebound with new dsts in rt_bind_exception().
+        */
+       oldest->fnhe_daddr = 0;
        fnhe_flush_routes(oldest);
        *oldest_p = oldest->fnhe_next;
        kfree_rcu(oldest, rcu);
--

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe
  2025-11-04  3:09 [PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe chuang
@ 2025-11-05 14:21 ` Ido Schimmel
  2025-11-05 14:26 ` Eric Dumazet
  1 sibling, 0 replies; 8+ messages in thread
From: Ido Schimmel @ 2025-11-05 14:21 UTC (permalink / raw)
  To: chuang
  Cc: stable, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Networking, open list

On Tue, Nov 04, 2025 at 11:09:08AM +0800, chuang wrote:
> From 35dbc9abd8da820007391b707bd2c1a9c99ee67d Mon Sep 17 00:00:00 2001
> From: Chuang Wang <nashuiliang@gmail.com>
> Date: Tue, 4 Nov 2025 02:52:11 +0000
> Subject: [PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding
>  stale fnhe

I'm unable to apply the patch:

$ b4 am -o - CACueBy7yNo4jq4HbiLXn0ez14w8CUTtTpPHmpSB-Ou6jhhNypA@mail.gmail.com | git am
[...]
Applying: ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe
error: corrupt patch at line 10
Patch failed at 0001 ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe

Did you use git send-email?

> 
> A race condition exists between fnhe_remove_oldest() and
> rt_bind_exception() where a fnhe that is scheduled for removal can be
> rebound to a new dst.
> 
> The issue occurs when fnhe_remove_oldest() selects an fnhe (fnheX)
> for deletion, but before it can be flushed and freed via RCU,
> CPU 0 enters rt_bind_exception() and attempts to reuse the entry.
> 
> CPU 0                             CPU 1
> __mkroute_output()
>   find_exception() [fnheX]
>                                   update_or_create_fnhe()
>                                     fnhe_remove_oldest() [fnheX]
>   rt_bind_exception() [bind dst]
>                                   RCU callback [fnheX freed, dst leak]
> 
> If rt_bind_exception() successfully binds fnheX to a new dst, the
> newly bound dst will never be properly freed because fnheX will
> soon be released by the RCU callback, leading to a permanent
> reference count leak on the old dst and the device.
> 
> This issue manifests as a device reference count leak and a
> warning in dmesg when unregistering the net device:
> 
>   unregister_netdevice: waiting for ethX to become free. Usage count = N

Can you say more about how you debugged this? It seems like a very rare
race condition. I expect netdevice_tracker to only show that a dst entry
took a reference on the net device, but it wouldn't show who took a
reference on the dst entry.

> 
> Fix this race by clearing 'oldest->fnhe_daddr' before calling
> fnhe_flush_routes(). Since rt_bind_exception() checks this field,
> setting it to zero prevents the stale fnhe from being reused and
> bound to a new dst just before it is freed.

Seems safe given that both fnhe_remove_oldest() and rt_bind_exception()
access 'fnhe_daddr' while holding 'fnhe_lock'. Same trick was used in
commit ee60ad219f5c ("route: set the deleted fnhe fnhe_daddr to 0 in
ip_del_fnhe to fix a race").

> 
> Cc: stable@vger.kernel.org
> Fixes: 67d6d681e15b ("ipv4: make exception cache less predictible")
> Signed-off-by: Chuang Wang <nashuiliang@gmail.com>
> ---
>  net/ipv4/route.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 6d27d3610c1c..b549d6a57307 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -607,6 +607,11 @@ static void fnhe_remove_oldest(struct
> fnhe_hash_bucket *hash)
>                         oldest_p = fnhe_p;
>                 }
>         }
> +
> +       /* Clear oldest->fnhe_daddr to prevent this fnhe from being
> +        * rebound with new dsts in rt_bind_exception().
> +        */
> +       oldest->fnhe_daddr = 0;
>         fnhe_flush_routes(oldest);
>         *oldest_p = oldest->fnhe_next;
>         kfree_rcu(oldest, rcu);
> --
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe
  2025-11-04  3:09 [PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe chuang
  2025-11-05 14:21 ` Ido Schimmel
@ 2025-11-05 14:26 ` Eric Dumazet
  2025-11-05 15:34   ` Ido Schimmel
  1 sibling, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2025-11-05 14:26 UTC (permalink / raw)
  To: chuang
  Cc: stable, David S. Miller, David Ahern, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Networking, open list

On Mon, Nov 3, 2025 at 7:09 PM chuang <nashuiliang@gmail.com> wrote:
>
> From 35dbc9abd8da820007391b707bd2c1a9c99ee67d Mon Sep 17 00:00:00 2001
> From: Chuang Wang <nashuiliang@gmail.com>
> Date: Tue, 4 Nov 2025 02:52:11 +0000
> Subject: [PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding
>  stale fnhe
>
> A race condition exists between fnhe_remove_oldest() and
> rt_bind_exception() where a fnhe that is scheduled for removal can be
> rebound to a new dst.
>
> The issue occurs when fnhe_remove_oldest() selects an fnhe (fnheX)
> for deletion, but before it can be flushed and freed via RCU,
> CPU 0 enters rt_bind_exception() and attempts to reuse the entry.
>
> CPU 0                             CPU 1
> __mkroute_output()
>   find_exception() [fnheX]
>                                   update_or_create_fnhe()
>                                     fnhe_remove_oldest() [fnheX]
>   rt_bind_exception() [bind dst]
>                                   RCU callback [fnheX freed, dst leak]
>
> If rt_bind_exception() successfully binds fnheX to a new dst, the
> newly bound dst will never be properly freed because fnheX will
> soon be released by the RCU callback, leading to a permanent
> reference count leak on the old dst and the device.
>
> This issue manifests as a device reference count leak and a
> warning in dmesg when unregistering the net device:
>
>   unregister_netdevice: waiting for ethX to become free. Usage count = N
>
> Fix this race by clearing 'oldest->fnhe_daddr' before calling
> fnhe_flush_routes(). Since rt_bind_exception() checks this field,
> setting it to zero prevents the stale fnhe from being reused and
> bound to a new dst just before it is freed.
>
> Cc: stable@vger.kernel.org
> Fixes: 67d6d681e15b ("ipv4: make exception cache less predictible")

I do not see how this commit added the bug you are looking at ?

> Signed-off-by: Chuang Wang <nashuiliang@gmail.com>
> ---
>  net/ipv4/route.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 6d27d3610c1c..b549d6a57307 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -607,6 +607,11 @@ static void fnhe_remove_oldest(struct
> fnhe_hash_bucket *hash)
>                         oldest_p = fnhe_p;
>                 }
>         }
> +
> +       /* Clear oldest->fnhe_daddr to prevent this fnhe from being
> +        * rebound with new dsts in rt_bind_exception().
> +        */
> +       oldest->fnhe_daddr = 0;
>         fnhe_flush_routes(oldest);
>         *oldest_p = oldest->fnhe_next;
>         kfree_rcu(oldest, rcu);
> --

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe
  2025-11-05 14:26 ` Eric Dumazet
@ 2025-11-05 15:34   ` Ido Schimmel
  2025-11-05 15:50     ` Eric Dumazet
  2025-11-06  0:31     ` chuang
  0 siblings, 2 replies; 8+ messages in thread
From: Ido Schimmel @ 2025-11-05 15:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: chuang, stable, David S. Miller, David Ahern, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Networking, open list

On Wed, Nov 05, 2025 at 06:26:22AM -0800, Eric Dumazet wrote:
> On Mon, Nov 3, 2025 at 7:09 PM chuang <nashuiliang@gmail.com> wrote:
> >
> > From 35dbc9abd8da820007391b707bd2c1a9c99ee67d Mon Sep 17 00:00:00 2001
> > From: Chuang Wang <nashuiliang@gmail.com>
> > Date: Tue, 4 Nov 2025 02:52:11 +0000
> > Subject: [PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding
> >  stale fnhe
> >
> > A race condition exists between fnhe_remove_oldest() and
> > rt_bind_exception() where a fnhe that is scheduled for removal can be
> > rebound to a new dst.
> >
> > The issue occurs when fnhe_remove_oldest() selects an fnhe (fnheX)
> > for deletion, but before it can be flushed and freed via RCU,
> > CPU 0 enters rt_bind_exception() and attempts to reuse the entry.
> >
> > CPU 0                             CPU 1
> > __mkroute_output()
> >   find_exception() [fnheX]
> >                                   update_or_create_fnhe()
> >                                     fnhe_remove_oldest() [fnheX]
> >   rt_bind_exception() [bind dst]
> >                                   RCU callback [fnheX freed, dst leak]
> >
> > If rt_bind_exception() successfully binds fnheX to a new dst, the
> > newly bound dst will never be properly freed because fnheX will
> > soon be released by the RCU callback, leading to a permanent
> > reference count leak on the old dst and the device.
> >
> > This issue manifests as a device reference count leak and a
> > warning in dmesg when unregistering the net device:
> >
> >   unregister_netdevice: waiting for ethX to become free. Usage count = N
> >
> > Fix this race by clearing 'oldest->fnhe_daddr' before calling
> > fnhe_flush_routes(). Since rt_bind_exception() checks this field,
> > setting it to zero prevents the stale fnhe from being reused and
> > bound to a new dst just before it is freed.
> >
> > Cc: stable@vger.kernel.org
> > Fixes: 67d6d681e15b ("ipv4: make exception cache less predictible")
> 
> I do not see how this commit added the bug you are looking at ?

Not the author, but my understanding is that the issue is that an
exception entry which is queued for deletion allows a dst entry to be
bound to it. As such, nobody will ever release the reference from the
dst entry and the associated net device.

Before 67d6d681e15b, exception entries were only queued for deletion by
ip_del_fnhe() and it prevented dst entries from binding themselves to
the deleted exception entry by clearing 'fnhe->fnhe_daddr' which is
checked in rt_bind_exception(). See ee60ad219f5c7.

67d6d681e15b added another point in the code that queues exception
entries for deletion, but without clearing 'fnhe->fnhe_daddr' first.
Therefore, it added another instance of the bug that was fixed in
ee60ad219f5c7.

> 
> > Signed-off-by: Chuang Wang <nashuiliang@gmail.com>
> > ---
> >  net/ipv4/route.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index 6d27d3610c1c..b549d6a57307 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -607,6 +607,11 @@ static void fnhe_remove_oldest(struct
> > fnhe_hash_bucket *hash)
> >                         oldest_p = fnhe_p;
> >                 }
> >         }
> > +
> > +       /* Clear oldest->fnhe_daddr to prevent this fnhe from being
> > +        * rebound with new dsts in rt_bind_exception().
> > +        */
> > +       oldest->fnhe_daddr = 0;
> >         fnhe_flush_routes(oldest);
> >         *oldest_p = oldest->fnhe_next;
> >         kfree_rcu(oldest, rcu);
> > --
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe
  2025-11-05 15:34   ` Ido Schimmel
@ 2025-11-05 15:50     ` Eric Dumazet
  2025-11-06  0:31     ` chuang
  1 sibling, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2025-11-05 15:50 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: chuang, stable, David S. Miller, David Ahern, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Networking, open list

On Wed, Nov 5, 2025 at 7:34 AM Ido Schimmel <idosch@idosch.org> wrote:
>
> On Wed, Nov 05, 2025 at 06:26:22AM -0800, Eric Dumazet wrote:
> > On Mon, Nov 3, 2025 at 7:09 PM chuang <nashuiliang@gmail.com> wrote:
> > >
> > > From 35dbc9abd8da820007391b707bd2c1a9c99ee67d Mon Sep 17 00:00:00 2001
> > > From: Chuang Wang <nashuiliang@gmail.com>
> > > Date: Tue, 4 Nov 2025 02:52:11 +0000
> > > Subject: [PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding
> > >  stale fnhe
> > >
> > > A race condition exists between fnhe_remove_oldest() and
> > > rt_bind_exception() where a fnhe that is scheduled for removal can be
> > > rebound to a new dst.
> > >
> > > The issue occurs when fnhe_remove_oldest() selects an fnhe (fnheX)
> > > for deletion, but before it can be flushed and freed via RCU,
> > > CPU 0 enters rt_bind_exception() and attempts to reuse the entry.
> > >
> > > CPU 0                             CPU 1
> > > __mkroute_output()
> > >   find_exception() [fnheX]
> > >                                   update_or_create_fnhe()
> > >                                     fnhe_remove_oldest() [fnheX]
> > >   rt_bind_exception() [bind dst]
> > >                                   RCU callback [fnheX freed, dst leak]
> > >
> > > If rt_bind_exception() successfully binds fnheX to a new dst, the
> > > newly bound dst will never be properly freed because fnheX will
> > > soon be released by the RCU callback, leading to a permanent
> > > reference count leak on the old dst and the device.
> > >
> > > This issue manifests as a device reference count leak and a
> > > warning in dmesg when unregistering the net device:
> > >
> > >   unregister_netdevice: waiting for ethX to become free. Usage count = N
> > >
> > > Fix this race by clearing 'oldest->fnhe_daddr' before calling
> > > fnhe_flush_routes(). Since rt_bind_exception() checks this field,
> > > setting it to zero prevents the stale fnhe from being reused and
> > > bound to a new dst just before it is freed.
> > >
> > > Cc: stable@vger.kernel.org
> > > Fixes: 67d6d681e15b ("ipv4: make exception cache less predictible")
> >
> > I do not see how this commit added the bug you are looking at ?
>
> Not the author, but my understanding is that the issue is that an
> exception entry which is queued for deletion allows a dst entry to be
> bound to it. As such, nobody will ever release the reference from the
> dst entry and the associated net device.
>
> Before 67d6d681e15b, exception entries were only queued for deletion by
> ip_del_fnhe() and it prevented dst entries from binding themselves to
> the deleted exception entry by clearing 'fnhe->fnhe_daddr' which is
> checked in rt_bind_exception(). See ee60ad219f5c7.
>
> 67d6d681e15b added another point in the code that queues exception
> entries for deletion, but without clearing 'fnhe->fnhe_daddr' first.
> Therefore, it added another instance of the bug that was fixed in
> ee60ad219f5c7.
>

Thanks for the clarification.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe
  2025-11-05 15:34   ` Ido Schimmel
  2025-11-05 15:50     ` Eric Dumazet
@ 2025-11-06  0:31     ` chuang
  2025-11-07  9:53       ` chuang
  1 sibling, 1 reply; 8+ messages in thread
From: chuang @ 2025-11-06  0:31 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Eric Dumazet, stable, David S. Miller, David Ahern,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Networking, open list

Thanks, your analysis is excellent and makes perfect sense. I can
briefly describe the issue.

This problem took quite some time to analyze overall — we enabled
netdev refcnt, added dst tracepoints, and eventually captured a race
condition between fnhe deletion and rt_bind_exception.

In our environment, we use the sit driver(ip tunnel). During the xmit
path, it records the PMTU for each destination, creating or updating
fnhe entries (even when the MTU is already appropriate). Because there
are many data flows, the sit driver updates PMTU very frequently,
which leads to the race condition mentioned above.

Sorry for the brief summary — I’ll provide a more detailed explanation
later, along with the patch verification method.

On Wed, Nov 5, 2025 at 23:34 Ido Schimmel <idosch@idosch.org> wrote:
>
> On Wed, Nov 05, 2025 at 06:26:22AM -0800, Eric Dumazet wrote:
> > On Mon, Nov 3, 2025 at 7:09 PM chuang <nashuiliang@gmail.com> wrote:
> > >
> > > From 35dbc9abd8da820007391b707bd2c1a9c99ee67d Mon Sep 17 00:00:00 2001
> > > From: Chuang Wang <nashuiliang@gmail.com>
> > > Date: Tue, 4 Nov 2025 02:52:11 +0000
> > > Subject: [PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding
> > >  stale fnhe
> > >
> > > A race condition exists between fnhe_remove_oldest() and
> > > rt_bind_exception() where a fnhe that is scheduled for removal can be
> > > rebound to a new dst.
> > >
> > > The issue occurs when fnhe_remove_oldest() selects an fnhe (fnheX)
> > > for deletion, but before it can be flushed and freed via RCU,
> > > CPU 0 enters rt_bind_exception() and attempts to reuse the entry.
> > >
> > > CPU 0                             CPU 1
> > > __mkroute_output()
> > >   find_exception() [fnheX]
> > >                                   update_or_create_fnhe()
> > >                                     fnhe_remove_oldest() [fnheX]
> > >   rt_bind_exception() [bind dst]
> > >                                   RCU callback [fnheX freed, dst leak]
> > >
> > > If rt_bind_exception() successfully binds fnheX to a new dst, the
> > > newly bound dst will never be properly freed because fnheX will
> > > soon be released by the RCU callback, leading to a permanent
> > > reference count leak on the old dst and the device.
> > >
> > > This issue manifests as a device reference count leak and a
> > > warning in dmesg when unregistering the net device:
> > >
> > >   unregister_netdevice: waiting for ethX to become free. Usage count = N
> > >
> > > Fix this race by clearing 'oldest->fnhe_daddr' before calling
> > > fnhe_flush_routes(). Since rt_bind_exception() checks this field,
> > > setting it to zero prevents the stale fnhe from being reused and
> > > bound to a new dst just before it is freed.
> > >
> > > Cc: stable@vger.kernel.org
> > > Fixes: 67d6d681e15b ("ipv4: make exception cache less predictible")
> >
> > I do not see how this commit added the bug you are looking at ?
>
> Not the author, but my understanding is that the issue is that an
> exception entry which is queued for deletion allows a dst entry to be
> bound to it. As such, nobody will ever release the reference from the
> dst entry and the associated net device.
>
> Before 67d6d681e15b, exception entries were only queued for deletion by
> ip_del_fnhe() and it prevented dst entries from binding themselves to
> the deleted exception entry by clearing 'fnhe->fnhe_daddr' which is
> checked in rt_bind_exception(). See ee60ad219f5c7.
>
> 67d6d681e15b added another point in the code that queues exception
> entries for deletion, but without clearing 'fnhe->fnhe_daddr' first.
> Therefore, it added another instance of the bug that was fixed in
> ee60ad219f5c7.
>
> >
> > > Signed-off-by: Chuang Wang <nashuiliang@gmail.com>
> > > ---
> > >  net/ipv4/route.c | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > index 6d27d3610c1c..b549d6a57307 100644
> > > --- a/net/ipv4/route.c
> > > +++ b/net/ipv4/route.c
> > > @@ -607,6 +607,11 @@ static void fnhe_remove_oldest(struct
> > > fnhe_hash_bucket *hash)
> > >                         oldest_p = fnhe_p;
> > >                 }
> > >         }
> > > +
> > > +       /* Clear oldest->fnhe_daddr to prevent this fnhe from being
> > > +        * rebound with new dsts in rt_bind_exception().
> > > +        */
> > > +       oldest->fnhe_daddr = 0;
> > >         fnhe_flush_routes(oldest);
> > >         *oldest_p = oldest->fnhe_next;
> > >         kfree_rcu(oldest, rcu);
> > > --
> >

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe
  2025-11-06  0:31     ` chuang
@ 2025-11-07  9:53       ` chuang
  2025-11-09 15:48         ` Ido Schimmel
  0 siblings, 1 reply; 8+ messages in thread
From: chuang @ 2025-11-07  9:53 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Eric Dumazet, stable, David S. Miller, David Ahern,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Networking, open list

Thanks for reviewing the patch. I'm providing the detailed analysis
and debugging traces below to confirm the root cause and exact
location of the reference count leak.

1. Environment and Symptom

The issue was consistently reproduced when routing TCP traffic through
a Software IP Tunnel interface (sit0). The traffic flow  is:

  APP -> sit0 (IP tunnel) -> outside

This leads to a reference count leak that prevents the device from
being freed during unregistration, resulting in the kernel log
warning:

  unregister_netdevice: waiting for sit0 to become free. Usage count = N

2. Enable refcnt_tracer

Live-crash analysis identified a stale dst entry retaining a reference
to sit0. With CONFIG_NET_DEV_REFCNT_TRACKER enabled, the allocation
stack for the leaked reference was identified:

[1279559.416854] leaked reference.
[1279559.416955]  dst_init+0x48/0x100
[1279559.416965]  dst_alloc+0x66/0xd0
[1279559.416966]  rt_dst_alloc+0x3c/0xd0
[1279559.416974]  ip_route_output_key_hash_rcu+0x1d7/0x940
[1279559.416978]  ip_route_output_key_hash+0x6d/0xa0
[1279559.416979]  ip_route_output_flow+0x1f/0x70
[1279559.416980]  __ip_queue_xmit+0x415/0x480
[1279559.416984]  ip_queue_xmit+0x15/0x20
[1279559.416986]  __tcp_transmit_skb+0xad4/0xc50

3. Pinpointing the Unmatched dst_hold()

To pinpoint the specific reference not released, we added tracepoints
to all dst_hold/put functions and used eBPF to record the full
lifecycle. The tracing identified a hold operation with the following
call stack:

do_trace_dst_entry_inc+0x45
rt_set_nexthop.constprop.0+0x376      /* <<<<<<<<<<<<<<<<<<<< HERE */
__mkroute_output+0x2B7
ip_route_output_key_hash_rcu+0xBD
ip_route_output_key_hash+0x6D
ip_route_output_flow+0x1F
inet_sk_rebuild_header+0x19C
__tcp_retransmit_skb+0x7E
tcp_retransmit_skb+0x19
tcp_retransmit_timer+0x3DF

The address rt_set_nexthop.constprop.0+0x376 corresponds to the
dst_hold() call inside rt_bind_exception().

4. Root Cause Analysis

The sit driver's packet transmission path calls: sit_tunnel_xmit() ->
... -> update_or_create_fnhe(), which lead to fnhe_remove_oldest()
being called to delete entries exceeding the
FNHE_RECLAIM_DEPTH+random.

The race window is between fnhe_remove_oldest() selecting fnheX for
deletion and the subsequent kfree_rcu(). During this time, the
concurrent path's __mkroute_output() -> find_exception() can fetch the
soon-to-be-deleted fnheX, and rt_bind_exception() then binds it with a
new dst using a dst_hold(). When the original fnheX is freed via RCU,
the dst reference remains permanently leaked.

5. Fix Validation with eBPF

The patch mitigates this by zeroing fnhe_daddr before the
RCU-protected deletion steps. This prevents rt_bind_exception() from
attempting to reuse the entry.
The fix was validated by probing the rt_bind_exception path (which in
my environment is optimized to rt_set_nexthop.constprop.0) to catch
any zeroed but active FNHEs being processed:

bpftrace -e 'kprobe:rt_set_nexthop.constprop.0
{
    $rt = (struct rtable *)arg0;
    $fnhe = (struct fib_nh_exception *)arg3;
    $fi = (struct flowi *)arg4;

    /* Check for an FNHE that is marked for deletion (daddr == 0)
     * but is still visible/valid (fnhe_expires != 0 and not expired).
     */
    if ($fi != 0 && $fnhe != 0 && $fnhe->fnhe_daddr == 0 &&
$fnhe->fnhe_expires != 0 && $fnhe->fnhe_expires >= jiffies) {
        printf("rt: %llx, dev: %s, will leak before this patch\n",
$rt, $rt->dst.dev->name);
    }
}'



On Thu, Nov 6, 2025 at 8:31 AM chuang <nashuiliang@gmail.com> wrote:
>
> Thanks, your analysis is excellent and makes perfect sense. I can
> briefly describe the issue.
>
> This problem took quite some time to analyze overall — we enabled
> netdev refcnt, added dst tracepoints, and eventually captured a race
> condition between fnhe deletion and rt_bind_exception.
>
> In our environment, we use the sit driver(ip tunnel). During the xmit
> path, it records the PMTU for each destination, creating or updating
> fnhe entries (even when the MTU is already appropriate). Because there
> are many data flows, the sit driver updates PMTU very frequently,
> which leads to the race condition mentioned above.
>
> Sorry for the brief summary — I’ll provide a more detailed explanation
> later, along with the patch verification method.
>
> On Wed, Nov 5, 2025 at 23:34 Ido Schimmel <idosch@idosch.org> wrote:
> >
> > On Wed, Nov 05, 2025 at 06:26:22AM -0800, Eric Dumazet wrote:
> > > On Mon, Nov 3, 2025 at 7:09 PM chuang <nashuiliang@gmail.com> wrote:
> > > >
> > > > From 35dbc9abd8da820007391b707bd2c1a9c99ee67d Mon Sep 17 00:00:00 2001
> > > > From: Chuang Wang <nashuiliang@gmail.com>
> > > > Date: Tue, 4 Nov 2025 02:52:11 +0000
> > > > Subject: [PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding
> > > >  stale fnhe
> > > >
> > > > A race condition exists between fnhe_remove_oldest() and
> > > > rt_bind_exception() where a fnhe that is scheduled for removal can be
> > > > rebound to a new dst.
> > > >
> > > > The issue occurs when fnhe_remove_oldest() selects an fnhe (fnheX)
> > > > for deletion, but before it can be flushed and freed via RCU,
> > > > CPU 0 enters rt_bind_exception() and attempts to reuse the entry.
> > > >
> > > > CPU 0                             CPU 1
> > > > __mkroute_output()
> > > >   find_exception() [fnheX]
> > > >                                   update_or_create_fnhe()
> > > >                                     fnhe_remove_oldest() [fnheX]
> > > >   rt_bind_exception() [bind dst]
> > > >                                   RCU callback [fnheX freed, dst leak]
> > > >
> > > > If rt_bind_exception() successfully binds fnheX to a new dst, the
> > > > newly bound dst will never be properly freed because fnheX will
> > > > soon be released by the RCU callback, leading to a permanent
> > > > reference count leak on the old dst and the device.
> > > >
> > > > This issue manifests as a device reference count leak and a
> > > > warning in dmesg when unregistering the net device:
> > > >
> > > >   unregister_netdevice: waiting for ethX to become free. Usage count = N
> > > >
> > > > Fix this race by clearing 'oldest->fnhe_daddr' before calling
> > > > fnhe_flush_routes(). Since rt_bind_exception() checks this field,
> > > > setting it to zero prevents the stale fnhe from being reused and
> > > > bound to a new dst just before it is freed.
> > > >
> > > > Cc: stable@vger.kernel.org
> > > > Fixes: 67d6d681e15b ("ipv4: make exception cache less predictible")
> > >
> > > I do not see how this commit added the bug you are looking at ?
> >
> > Not the author, but my understanding is that the issue is that an
> > exception entry which is queued for deletion allows a dst entry to be
> > bound to it. As such, nobody will ever release the reference from the
> > dst entry and the associated net device.
> >
> > Before 67d6d681e15b, exception entries were only queued for deletion by
> > ip_del_fnhe() and it prevented dst entries from binding themselves to
> > the deleted exception entry by clearing 'fnhe->fnhe_daddr' which is
> > checked in rt_bind_exception(). See ee60ad219f5c7.
> >
> > 67d6d681e15b added another point in the code that queues exception
> > entries for deletion, but without clearing 'fnhe->fnhe_daddr' first.
> > Therefore, it added another instance of the bug that was fixed in
> > ee60ad219f5c7.
> >
> > >
> > > > Signed-off-by: Chuang Wang <nashuiliang@gmail.com>
> > > > ---
> > > >  net/ipv4/route.c | 5 +++++
> > > >  1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > > index 6d27d3610c1c..b549d6a57307 100644
> > > > --- a/net/ipv4/route.c
> > > > +++ b/net/ipv4/route.c
> > > > @@ -607,6 +607,11 @@ static void fnhe_remove_oldest(struct
> > > > fnhe_hash_bucket *hash)
> > > >                         oldest_p = fnhe_p;
> > > >                 }
> > > >         }
> > > > +
> > > > +       /* Clear oldest->fnhe_daddr to prevent this fnhe from being
> > > > +        * rebound with new dsts in rt_bind_exception().
> > > > +        */
> > > > +       oldest->fnhe_daddr = 0;
> > > >         fnhe_flush_routes(oldest);
> > > >         *oldest_p = oldest->fnhe_next;
> > > >         kfree_rcu(oldest, rcu);
> > > > --
> > >

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe
  2025-11-07  9:53       ` chuang
@ 2025-11-09 15:48         ` Ido Schimmel
  0 siblings, 0 replies; 8+ messages in thread
From: Ido Schimmel @ 2025-11-09 15:48 UTC (permalink / raw)
  To: chuang
  Cc: Eric Dumazet, stable, David S. Miller, David Ahern,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Networking, open list

On Fri, Nov 07, 2025 at 05:53:51PM +0800, chuang wrote:
> Thanks for reviewing the patch. I'm providing the detailed analysis
> and debugging traces below to confirm the root cause and exact
> location of the reference count leak.
> 
> 1. Environment and Symptom
> 
> The issue was consistently reproduced when routing TCP traffic through
> a Software IP Tunnel interface (sit0). The traffic flow  is:
> 
>   APP -> sit0 (IP tunnel) -> outside
> 
> This leads to a reference count leak that prevents the device from
> being freed during unregistration, resulting in the kernel log
> warning:
> 
>   unregister_netdevice: waiting for sit0 to become free. Usage count = N
> 
> 2. Enable refcnt_tracer
> 
> Live-crash analysis identified a stale dst entry retaining a reference
> to sit0. With CONFIG_NET_DEV_REFCNT_TRACKER enabled, the allocation
> stack for the leaked reference was identified:
> 
> [1279559.416854] leaked reference.
> [1279559.416955]  dst_init+0x48/0x100
> [1279559.416965]  dst_alloc+0x66/0xd0
> [1279559.416966]  rt_dst_alloc+0x3c/0xd0
> [1279559.416974]  ip_route_output_key_hash_rcu+0x1d7/0x940
> [1279559.416978]  ip_route_output_key_hash+0x6d/0xa0
> [1279559.416979]  ip_route_output_flow+0x1f/0x70
> [1279559.416980]  __ip_queue_xmit+0x415/0x480
> [1279559.416984]  ip_queue_xmit+0x15/0x20
> [1279559.416986]  __tcp_transmit_skb+0xad4/0xc50
> 
> 3. Pinpointing the Unmatched dst_hold()
> 
> To pinpoint the specific reference not released, we added tracepoints
> to all dst_hold/put functions and used eBPF to record the full
> lifecycle. The tracing identified a hold operation with the following
> call stack:
> 
> do_trace_dst_entry_inc+0x45
> rt_set_nexthop.constprop.0+0x376      /* <<<<<<<<<<<<<<<<<<<< HERE */
> __mkroute_output+0x2B7
> ip_route_output_key_hash_rcu+0xBD
> ip_route_output_key_hash+0x6D
> ip_route_output_flow+0x1F
> inet_sk_rebuild_header+0x19C
> __tcp_retransmit_skb+0x7E
> tcp_retransmit_skb+0x19
> tcp_retransmit_timer+0x3DF
> 
> The address rt_set_nexthop.constprop.0+0x376 corresponds to the
> dst_hold() call inside rt_bind_exception().
> 
> 4. Root Cause Analysis
> 
> The sit driver's packet transmission path calls: sit_tunnel_xmit() ->
> ... -> update_or_create_fnhe(), which lead to fnhe_remove_oldest()
> being called to delete entries exceeding the
> FNHE_RECLAIM_DEPTH+random.
> 
> The race window is between fnhe_remove_oldest() selecting fnheX for
> deletion and the subsequent kfree_rcu(). During this time, the
> concurrent path's __mkroute_output() -> find_exception() can fetch the
> soon-to-be-deleted fnheX, and rt_bind_exception() then binds it with a
> new dst using a dst_hold(). When the original fnheX is freed via RCU,
> the dst reference remains permanently leaked.
> 
> 5. Fix Validation with eBPF
> 
> The patch mitigates this by zeroing fnhe_daddr before the
> RCU-protected deletion steps. This prevents rt_bind_exception() from
> attempting to reuse the entry.
> The fix was validated by probing the rt_bind_exception path (which in
> my environment is optimized to rt_set_nexthop.constprop.0) to catch
> any zeroed but active FNHEs being processed:
> 
> bpftrace -e 'kprobe:rt_set_nexthop.constprop.0
> {
>     $rt = (struct rtable *)arg0;
>     $fnhe = (struct fib_nh_exception *)arg3;
>     $fi = (struct flowi *)arg4;
> 
>     /* Check for an FNHE that is marked for deletion (daddr == 0)
>      * but is still visible/valid (fnhe_expires != 0 and not expired).
>      */
>     if ($fi != 0 && $fnhe != 0 && $fnhe->fnhe_daddr == 0 &&
> $fnhe->fnhe_expires != 0 && $fnhe->fnhe_expires >= jiffies) {
>         printf("rt: %llx, dev: %s, will leak before this patch\n",
> $rt, $rt->dst.dev->name);
>     }
> }'

Thanks for the details. I was able to reproduce the issue with [1] and I
can confirm that it does not reproduce with your fix.

Are you going to submit v2?

[1]
#!/bin/bash

ip netns add ns1
ip -n ns1 link set dev lo up
ip -n ns1 address add 192.0.2.1/32 dev lo
ip -n ns1 link add name dummy1 up type dummy
ip -n ns1 route add 192.0.2.2/32 dev dummy1
ip -n ns1 link add name gretap1 up arp off type gretap local 192.0.2.1 remote 192.0.2.2
ip -n ns1 route add 198.51.0.0/16 dev gretap1
taskset -c 0 ip netns exec ns1 mausezahn gretap1 -A 198.51.100.1 -B 198.51.0.0/16 -t udp -p 1000 -c 0 -q &
taskset -c 2 ip netns exec ns1 mausezahn gretap1 -A 198.51.100.1 -B 198.51.0.0/16 -t udp -p 1000 -c 0 -q &
sleep 10
ip netns pids ns1 | xargs kill
ip netns del ns1

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-11-09 15:48 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-04  3:09 [PATCH net] ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe chuang
2025-11-05 14:21 ` Ido Schimmel
2025-11-05 14:26 ` Eric Dumazet
2025-11-05 15:34   ` Ido Schimmel
2025-11-05 15:50     ` Eric Dumazet
2025-11-06  0:31     ` chuang
2025-11-07  9:53       ` chuang
2025-11-09 15:48         ` Ido Schimmel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).