netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG nexthop] refcount leak in "struct nexthop" handling
@ 2025-12-20 14:57 Tetsuo Handa
  2025-12-20 17:54 ` David Ahern
  0 siblings, 1 reply; 3+ messages in thread
From: Tetsuo Handa @ 2025-12-20 14:57 UTC (permalink / raw)
  To: David Ahern, David S. Miller, Kuniyuki Iwashima, Eric Dumazet,
	Jakub Kicinski, Network Development

[-- Attachment #1: Type: text/plain, Size: 2335 bytes --]

syzbot is reporting refcount leak in "struct nexthop" handling
which manifests as a hung up with below message.

  unregister_netdevice: waiting for lo to become free. Usage count = 2
  ref_tracker: netdev@ffff88803a65e618 has 1/1 users at
       __netdev_tracker_alloc include/linux/netdevice.h:4400 [inline]
       netdev_tracker_alloc include/linux/netdevice.h:4412 [inline]
       netdev_get_by_index+0x7c/0xb0 net/core/dev.c:1008
       fib6_nh_init+0x791/0x1fb0 net/ipv6/route.c:3590
       nh_create_ipv6 net/ipv4/nexthop.c:2875 [inline]
       nexthop_create net/ipv4/nexthop.c:2926 [inline]
       nexthop_add net/ipv4/nexthop.c:2963 [inline]
       rtm_new_nexthop+0x244b/0x87d0 net/ipv4/nexthop.c:3277
       rtnetlink_rcv_msg+0x95e/0xe90 net/core/rtnetlink.c:6958
       netlink_rcv_skb+0x158/0x420 net/netlink/af_netlink.c:2550
       netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
       netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
       netlink_sendmsg+0x8c8/0xdd0 net/netlink/af_netlink.c:1894
       sock_sendmsg_nosec net/socket.c:727 [inline]
       __sock_sendmsg net/socket.c:742 [inline]
       ____sys_sendmsg+0xa5d/0xc30 net/socket.c:2592
       ___sys_sendmsg+0x134/0x1d0 net/socket.c:2646
       __sys_sendmsg+0x16d/0x220 net/socket.c:2678
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0xcd/0xf80 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

Commit ab84be7e54fc ("net: Initial nexthop code") says

  Nexthop notifications are sent when a nexthop is added or deleted,
  but NOT if the delete is due to a device event or network namespace
  teardown (which also involves device events).

which I guess that it is an intended behavior that
nexthop_notify(RTM_DELNEXTHOP) is not called from remove_nexthop() from
flush_all_nexthops() from nexthop_net_exit_rtnl() from ops_undo_list()
 from cleanup_net() because remove_nexthop() passes nlinfo == NULL.

However, like the attached reproducer demonstrates, it is inevitable that
a userspace process terminates and network namespace teardown automatically
happens without explicitly invoking RTM_DELNEXTHOP request. The kernel is
not currently prepared for such scenario. How to fix this problem?

Link: https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84

[-- Attachment #2: repro.c --]
[-- Type: text/plain, Size: 5370 bytes --]

#define _GNU_SOURCE
#include <sched.h>
#include <stdint.h>
#include <string.h>
#include <sys/syscall.h>
#include <unistd.h>

static void execute1(const int fd)
{
	//  sendmsg$nl_route arguments: [
	//    fd: sock_nl_route (resource)
	//    msg: ptr[in, msghdr_netlink[netlink_msg_route]] {
	//      msghdr_netlink[netlink_msg_route] {
	//        addr: nil
	//        addrlen: len = 0x0 (4 bytes)
	//        pad = 0x0 (4 bytes)
	//        vec: ptr[in, iovec[in, netlink_msg_route]] {
	//          iovec[in, netlink_msg_route] {
	//            addr: ptr[in, netlink_msg_route] {
	//              union netlink_msg_route {
	//                ipv6_newnexthop: netlink_msg_t[const[RTM_NEWNEXTHOP, int16],
	//                nhmsg_new[AF_INET6], rtm_nh_policy_new] {
	//                  len: len = 0x1c (4 bytes)
	//                  type: const = 0x68 (2 bytes)
	//                  flags: netlink_msg_flags = 0x5fb9a818fb7378e9 (2 bytes)
	//                  seq: int32 = 0x0 (4 bytes)
	//                  pid: int32 = 0x0 (4 bytes)
	//                  payload: nhmsg[AF_INET6, flags[rtm_protocol, int8],
	//                  flags[rtnh_flags, int32]] {
	//                    nh_family: const = 0xa (1 bytes)
	//                    nh_scope: const = 0x0 (1 bytes)
	//                    nh_protocol: rtm_protocol = 0x0 (1 bytes)
	//                    resvd: const = 0x0 (1 bytes)
	//                    nh_flags: rtnh_flags = 0x0 (4 bytes)
	//                  }
	//                  attrs: array[rtm_nh_policy_new] {
	//                    union rtm_nh_policy_new {
	//                      NHA_BLACKHOLE: nlattr_t[const[NHA_BLACKHOLE, int16],
	//                      void] {
	//                        nla_len: offsetof = 0x4 (2 bytes)
	//                        nla_type: const = 0x4 (2 bytes)
	//                        payload: buffer: {} (length 0x0)
	//                        size: buffer: {} (length 0x0)
	//                      }
	//                    }
	//                  }
	//                }
	//              }
	//            }
	//            len: len = 0x1c (8 bytes)
	//          }
	//        }
	//        vlen: const = 0x1 (8 bytes)
	//        ctrl: const = 0x0 (8 bytes)
	//        ctrllen: const = 0x0 (8 bytes)
	//        f: send_flags = 0x0 (4 bytes)
	//        pad = 0x0 (4 bytes)
	//      }
	//    }
	//    f: send_flags = 0x0 (8 bytes)
	//  ]
	*(uint64_t*)0x200000000100 = 0;
	*(uint32_t*)0x200000000108 = 0;
	*(uint64_t*)0x200000000110 = 0x2000000000c0;
	*(uint64_t*)0x2000000000c0 = 0x200000000140;
	*(uint32_t*)0x200000000140 = 0x1c;
	*(uint16_t*)0x200000000144 = 0x68;
	*(uint16_t*)0x200000000146 = 0x78e9;
	*(uint32_t*)0x200000000148 = 0;
	*(uint32_t*)0x20000000014c = 0;
	*(uint8_t*)0x200000000150 = 0xa;
	*(uint8_t*)0x200000000151 = 0;
	*(uint8_t*)0x200000000152 = 0;
	*(uint8_t*)0x200000000153 = 0;
	*(uint32_t*)0x200000000154 = 0;
	*(uint16_t*)0x200000000158 = 4;
	*(uint16_t*)0x20000000015a = 4;
	*(uint64_t*)0x2000000000c8 = 0x1c;
	*(uint64_t*)0x200000000118 = 1;
	*(uint64_t*)0x200000000120 = 0;
	*(uint64_t*)0x200000000128 = 0;
	*(uint32_t*)0x200000000130 = 0;
	syscall(__NR_sendmsg, /*fd=*/fd, /*msg=*/0x200000000100ul, /*f=*/0ul);
}

static void execute2(const int fd)
{
	//  sendmsg$nl_route arguments: [
	//    fd: sock_nl_route (resource)
	//    msg: ptr[in, msghdr_netlink[netlink_msg_route]] {
	//      msghdr_netlink[netlink_msg_route] {
	//        addr: nil
	//        addrlen: len = 0x0 (4 bytes)
	//        pad = 0x0 (4 bytes)
	//        vec: ptr[in, iovec[in, netlink_msg_route]] {
	//          iovec[in, netlink_msg_route] {
	//            addr: ptr[inout, array[ANYUNION]] {
	//              array[ANYUNION] {
	//                union ANYUNION {
	//                  ANYBLOB: buffer: {30 00 00 00 18 00 dd 8d 00 00 00 00 00
	//                  00 00 00 02 00 00 00 00 00 00 06 00 00 00 00 08 00 1e 00
	//                  02} (length 0x21)
	//                }
	//              }
	//            }
	//            len: len = 0x30 (8 bytes)
	//          }
	//        }
	//        vlen: const = 0x1 (8 bytes)
	//        ctrl: const = 0x0 (8 bytes)
	//        ctrllen: const = 0x0 (8 bytes)
	//        f: send_flags = 0x0 (4 bytes)
	//        pad = 0x0 (4 bytes)
	//      }
	//    }
	//    f: send_flags = 0x4090 (8 bytes)
	//  ]
	*(uint64_t*)0x200000000180 = 0;
	*(uint32_t*)0x200000000188 = 0;
	*(uint64_t*)0x200000000190 = 0x200000000780;
	*(uint64_t*)0x200000000780 = 0x200000000380;
	memcpy((void*)0x200000000380,
	       "\x30\x00\x00\x00\x18\x00\xdd\x8d\x00\x00\x00\x00\x00\x00\x00\x00\x02"
	       "\x00\x00\x00\x00\x00\x00\x06\x00\x00\x00\x00\x08\x00\x1e\x00\x02",
	       33);
	*(uint64_t*)0x200000000788 = 0x30;
	*(uint64_t*)0x200000000198 = 1;
	*(uint64_t*)0x2000000001a0 = 0;
	*(uint64_t*)0x2000000001a8 = 0;
	*(uint32_t*)0x2000000001b0 = 0;
	syscall(__NR_sendmsg, /*fd=*/fd, /*msg=*/0x200000000180ul,
		/*f=MSG_PROBE|MSG_NOSIGNAL|MSG_EOR*/ 0x4090ul);
}

int main(int argc, char *argv[])
{
	syscall(__NR_mmap, /*addr=*/0x200000000000ul, /*len=*/0x1000000ul,
		/*prot=PROT_WRITE|PROT_READ|PROT_EXEC*/ 7ul,
		/*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul,
		/*fd=*/(intptr_t)-1, /*offset=*/0ul);
	if (unshare(CLONE_NEWNET))
		return 1;
	int fd = syscall(__NR_socket, /*domain=*/0x10ul, /*type=*/3ul, /*proto=*/0);
	if (fd == -1)
		return 1;
	execute1(fd);
	execute1(fd);
	execute2(fd);
	return 0;
}

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG nexthop] refcount leak in "struct nexthop" handling
  2025-12-20 14:57 [BUG nexthop] refcount leak in "struct nexthop" handling Tetsuo Handa
@ 2025-12-20 17:54 ` David Ahern
  2025-12-20 18:14   ` Ido Schimmel
  0 siblings, 1 reply; 3+ messages in thread
From: David Ahern @ 2025-12-20 17:54 UTC (permalink / raw)
  To: Tetsuo Handa, David S. Miller, Kuniyuki Iwashima, Eric Dumazet,
	Jakub Kicinski, Network Development, Ido Schimmel

On 12/20/25 7:57 AM, Tetsuo Handa wrote:
> syzbot is reporting refcount leak in "struct nexthop" handling
> which manifests as a hung up with below message.
> 

...

> 
> Commit ab84be7e54fc ("net: Initial nexthop code") says
> 
>   Nexthop notifications are sent when a nexthop is added or deleted,
>   but NOT if the delete is due to a device event or network namespace
>   teardown (which also involves device events).
> 
> which I guess that it is an intended behavior that
> nexthop_notify(RTM_DELNEXTHOP) is not called from remove_nexthop() from
> flush_all_nexthops() from nexthop_net_exit_rtnl() from ops_undo_list()
>  from cleanup_net() because remove_nexthop() passes nlinfo == NULL.
> 
> However, like the attached reproducer demonstrates, it is inevitable that
> a userspace process terminates and network namespace teardown automatically
> happens without explicitly invoking RTM_DELNEXTHOP request. The kernel is
> not currently prepared for such scenario. How to fix this problem?
> 
> Link: https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84

thanks for the report and a reproducer. I am about to go offline for a
week, so I will not have time to take a look until the last few days of
December. Adding Ido in case he has time between now and then.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG nexthop] refcount leak in "struct nexthop" handling
  2025-12-20 17:54 ` David Ahern
@ 2025-12-20 18:14   ` Ido Schimmel
  0 siblings, 0 replies; 3+ messages in thread
From: Ido Schimmel @ 2025-12-20 18:14 UTC (permalink / raw)
  To: David Ahern
  Cc: Tetsuo Handa, David S. Miller, Kuniyuki Iwashima, Eric Dumazet,
	Jakub Kicinski, Network Development

On Sat, Dec 20, 2025 at 10:54:27AM -0700, David Ahern wrote:
> On 12/20/25 7:57 AM, Tetsuo Handa wrote:
> > syzbot is reporting refcount leak in "struct nexthop" handling
> > which manifests as a hung up with below message.
> > 
> 
> ...
> 
> > 
> > Commit ab84be7e54fc ("net: Initial nexthop code") says
> > 
> >   Nexthop notifications are sent when a nexthop is added or deleted,
> >   but NOT if the delete is due to a device event or network namespace
> >   teardown (which also involves device events).
> > 
> > which I guess that it is an intended behavior that
> > nexthop_notify(RTM_DELNEXTHOP) is not called from remove_nexthop() from
> > flush_all_nexthops() from nexthop_net_exit_rtnl() from ops_undo_list()
> >  from cleanup_net() because remove_nexthop() passes nlinfo == NULL.
> > 
> > However, like the attached reproducer demonstrates, it is inevitable that
> > a userspace process terminates and network namespace teardown automatically
> > happens without explicitly invoking RTM_DELNEXTHOP request. The kernel is
> > not currently prepared for such scenario. How to fix this problem?
> > 
> > Link: https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84
> 
> thanks for the report and a reproducer. I am about to go offline for a
> week, so I will not have time to take a look until the last few days of
> December. Adding Ido in case he has time between now and then.

Thanks for the detailed report. The following is derived from the C
reproducer and works for me:

ip netns add ns1
ip -n ns1 -6 nexthop add id 1 blackhole
ip -n ns1 route add blackhole 0.0.0.0/0 nhid 1
ip netns del ns1

Nexthops are flushed when the network namespace is dismantled, but the
error route that is using the nexthop does not release its reference
from the nexthop. Therefore, the nexthop is not deleted and does not
release the reference from its nexthop device (lo in this case).

The following fixes the issue for me:

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 59a6f0a9638f..7e2c17fec3fc 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2053,10 +2053,11 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all)
 				continue;
 			}
 
-			/* Do not flush error routes if network namespace is
-			 * not being dismantled
+			/* When not flushing the entire table, skip error
+			 * routes that are not marked for deletion.
 			 */
-			if (!flush_all && fib_props[fa->fa_type].error) {
+			if (!flush_all && fib_props[fa->fa_type].error &&
+			    !(fi->fib_flags & RTNH_F_DEAD)) {
 				slen = fa->fa_slen;
 				continue;
 			}

Will post it later this week assuming I don't find problems with it.

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-12-20 18:15 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-20 14:57 [BUG nexthop] refcount leak in "struct nexthop" handling Tetsuo Handa
2025-12-20 17:54 ` David Ahern
2025-12-20 18:14   ` Ido Schimmel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).