* [PATCH v2 0/2] RDMA/rxe: Fix per-netns UDP tunnel issues.
@ 2026-04-25 6:04 Kuniyuki Iwashima
2026-04-25 6:04 ` [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown() Kuniyuki Iwashima
2026-04-25 6:04 ` [PATCH v2 2/2] RDMA/rxe: Fix up RCU usage for rxe_ns_pernet_sk6() Kuniyuki Iwashima
0 siblings, 2 replies; 22+ messages in thread
From: Kuniyuki Iwashima @ 2026-04-25 6:04 UTC (permalink / raw)
To: Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky
Cc: David Ahern, Kuniyuki Iwashima, Kuniyuki Iwashima, linux-rdma
Patch 1 fixes racy allocation/destruction of per-netns UDP
tunnel sockets.
Patch 2 fixes unsafe access to the socket in rxe_find_route6().
Changes:
v2:
Patch 1: Set up UDP tunnels in __net_init instead of adding mutex.
v1: https://lore.kernel.org/all/20260424013759.728288-1-kuniyu@google.com/
Kuniyuki Iwashima (2):
RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown().
RDMA/rxe: Fix up RCU usage for rxe_ns_pernet_sk6().
drivers/infiniband/sw/rxe/rxe.c | 6 --
drivers/infiniband/sw/rxe/rxe_net.c | 137 +++-------------------------
drivers/infiniband/sw/rxe/rxe_net.h | 5 +-
drivers/infiniband/sw/rxe/rxe_ns.c | 97 ++++++++------------
drivers/infiniband/sw/rxe/rxe_ns.h | 1 -
5 files changed, 56 insertions(+), 190 deletions(-)
--
2.54.0.rc2.544.gc7ae2d5bb8-goog
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-25 6:04 [PATCH v2 0/2] RDMA/rxe: Fix per-netns UDP tunnel issues Kuniyuki Iwashima @ 2026-04-25 6:04 ` Kuniyuki Iwashima 2026-04-25 15:47 ` David Ahern 2026-04-25 21:25 ` Zhu Yanjun 2026-04-25 6:04 ` [PATCH v2 2/2] RDMA/rxe: Fix up RCU usage for rxe_ns_pernet_sk6() Kuniyuki Iwashima 1 sibling, 2 replies; 22+ messages in thread From: Kuniyuki Iwashima @ 2026-04-25 6:04 UTC (permalink / raw) To: Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky Cc: David Ahern, Kuniyuki Iwashima, Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 syzbot reported null-ptr-deref in kernel_sock_shutdown(). [0] The problem is ->newlink() and ->dellink() can be called concurrently with no synchronisation, leading sk leak or double free, etc. We defer UDP tunnel allocation to the first device creation, but this would requrie per-netns locking. Let's allocate UDP tunnels in the __init_net hook. Now extra sock_hold() and __sock_put() are no longer needed. Note that rxe_ns_pernet_sk6() is broken and will be fixed in the following patch. [0]: Oops: general protection fault, probably for non-canonical address 0xdffffc000000000d: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000068-0x000000000000006f] CPU: 3 UID: 0 PID: 12652 Comm: syz.7.1709 Tainted: G L syzkaller #0 PREEMPT(full) Tainted: [L]=SOFTLOCKUP Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:kernel_sock_shutdown+0x47/0x70 net/socket.c:3785 Code: fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 75 33 48 b8 00 00 00 00 00 fc ff df 4c 8b 63 20 49 8d 7c 24 68 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 1a 49 8b 44 24 68 89 ee 48 89 df 5b 5d 41 5c e9 46 RSP: 0018:ffffc9000566f180 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: ffff888058587240 RCX: 0000000000000000 RDX: 000000000000000d RSI: ffffffff895ced12 RDI: 0000000000000068 RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed1006d98945 R10: ffff888036cc4a2b R11: 0000003683c25c00 R12: 0000000000000000 R13: ffff88805c998000 R14: 0000000000000002 R15: 0000000000000018 FS: 00007f1306d976c0(0000) GS:ffff8880d65db000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1306d97d58 CR3: 00000000404f1000 CR4: 0000000000352ef0 DR0: ffffffffffffffff DR1: 00000000000001f8 DR2: 0000000000000002 DR3: ffffffffefffff15 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: <TASK> udp_tunnel_sock_release+0x68/0x80 net/ipv4/udp_tunnel_core.c:202 rxe_release_udp_tunnel drivers/infiniband/sw/rxe/rxe_net.c:294 [inline] rxe_sock_put+0xae/0x130 drivers/infiniband/sw/rxe/rxe_net.c:639 rxe_net_del+0x83/0x120 drivers/infiniband/sw/rxe/rxe_net.c:660 rxe_dellink+0x15/0x20 drivers/infiniband/sw/rxe/rxe.c:254 nldev_dellink+0x289/0x3c0 drivers/infiniband/core/nldev.c:1849 rdma_nl_rcv_msg+0x392/0x6f0 drivers/infiniband/core/netlink.c:195 rdma_nl_rcv_skb.constprop.0.isra.0+0x2cb/0x410 drivers/infiniband/core/netlink.c:239 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x585/0x850 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:787 [inline] __sock_sendmsg net/socket.c:802 [inline] ____sys_sendmsg+0x9e1/0xb70 net/socket.c:2698 ___sys_sendmsg+0x190/0x1e0 net/socket.c:2752 __sys_sendmsg+0x170/0x220 net/socket.c:2784 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x10b/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f1305f9c819 Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f1306d97028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f1306216090 RCX: 00007f1305f9c819 RDX: 0000000000000000 RSI: 00002000000002c0 RDI: 0000000000000003 RBP: 00007f1306032c91 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f1306216128 R14: 00007f1306216090 R15: 00007ffd8ecad288 </TASK> Modules linked in: Fixes: f1327abd6abe ("RDMA/rxe: Support RDMA link creation and destruction per net namespace") Reported-by: syzbot+d8f76778263ab65c2b21@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69ea344f.a00a0220.17a17.0040.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> --- v2: Set up UDP tunnels in __net_init instead of adding mutex. v1: https://lore.kernel.org/all/20260424013759.728288-1-kuniyu@google.com/ --- drivers/infiniband/sw/rxe/rxe.c | 6 -- drivers/infiniband/sw/rxe/rxe_net.c | 126 ++-------------------------- drivers/infiniband/sw/rxe/rxe_net.h | 5 +- drivers/infiniband/sw/rxe/rxe_ns.c | 90 +++++++++----------- drivers/infiniband/sw/rxe/rxe_ns.h | 1 - 5 files changed, 47 insertions(+), 181 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c index b0714f9abe3d..111ba4e57261 100644 --- a/drivers/infiniband/sw/rxe/rxe.c +++ b/drivers/infiniband/sw/rxe/rxe.c @@ -236,10 +236,6 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev) goto err; } - err = rxe_net_init(ndev); - if (err) - return err; - err = rxe_net_add(ibdev_name, ndev); if (err) { rxe_err("failed to add %s\n", ndev->name); @@ -251,8 +247,6 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev) static int rxe_dellink(struct ib_device *dev) { - rxe_net_del(dev); - return 0; } diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c index 50a2cb5405e2..9080d4c893a1 100644 --- a/drivers/infiniband/sw/rxe/rxe_net.c +++ b/drivers/infiniband/sw/rxe/rxe_net.c @@ -256,8 +256,8 @@ static int rxe_udp_encap_recv(struct sock *sk, struct sk_buff *skb) return 0; } -static struct socket *rxe_setup_udp_tunnel(struct net *net, __be16 port, - bool ipv6) +struct sock *rxe_setup_udp_tunnel(struct net *net, __be16 port, + bool ipv6) { int err; struct socket *sock; @@ -285,13 +285,12 @@ static struct socket *rxe_setup_udp_tunnel(struct net *net, __be16 port, /* Setup UDP tunnel */ setup_udp_tunnel_sock(net, sock, &tnl_cfg); - return sock; + return sock->sk; } -static void rxe_release_udp_tunnel(struct socket *sk) +void rxe_release_udp_tunnel(struct sock *sk) { - if (sk) - udp_tunnel_sock_release(sk); + udp_tunnel_sock_release(sk->sk_socket); } static void prepare_udp_hdr(struct sk_buff *skb, __be16 src_port, @@ -629,43 +628,6 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev) return 0; } -static void rxe_sock_put(struct sock *sk, - void (*set_sk)(struct net *, struct sock *), - struct net *net) -{ - if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL) { - __sock_put(sk); - } else { - rxe_release_udp_tunnel(sk->sk_socket); - sk = NULL; - set_sk(net, sk); - } -} - -void rxe_net_del(struct ib_device *dev) -{ - struct rxe_dev *rxe = container_of(dev, struct rxe_dev, ib_dev); - struct net_device *ndev; - struct sock *sk; - struct net *net; - - ndev = rxe_ib_device_get_netdev(&rxe->ib_dev); - if (!ndev) - return; - - net = dev_net(ndev); - - sk = rxe_ns_pernet_sk4(net); - if (sk) - rxe_sock_put(sk, rxe_ns_pernet_set_sk4, net); - - sk = rxe_ns_pernet_sk6(net); - if (sk) - rxe_sock_put(sk, rxe_ns_pernet_set_sk6, net); - - dev_put(ndev); -} - static void rxe_port_event(struct rxe_dev *rxe, enum ib_event_type event) { @@ -722,7 +684,6 @@ static int rxe_notify(struct notifier_block *not_blk, switch (event) { case NETDEV_UNREGISTER: ib_unregister_device_queued(&rxe->ib_dev); - rxe_net_del(&rxe->ib_dev); break; case NETDEV_CHANGEMTU: rxe_dbg_dev(rxe, "%s changed mtu to %d\n", ndev->name, ndev->mtu); @@ -752,56 +713,6 @@ static struct notifier_block rxe_net_notifier = { .notifier_call = rxe_notify, }; -static int rxe_net_ipv4_init(struct net *net) -{ - struct sock *sk; - struct socket *sock; - - sk = rxe_ns_pernet_sk4(net); - if (sk) { - sock_hold(sk); - return 0; - } - - sock = rxe_setup_udp_tunnel(net, htons(ROCE_V2_UDP_DPORT), false); - if (IS_ERR(sock)) { - pr_err("Failed to create IPv4 UDP tunnel\n"); - return -1; - } - rxe_ns_pernet_set_sk4(net, sock->sk); - - return 0; -} - -static int rxe_net_ipv6_init(struct net *net) -{ -#if IS_ENABLED(CONFIG_IPV6) - struct sock *sk; - struct socket *sock; - - sk = rxe_ns_pernet_sk6(net); - if (sk) { - sock_hold(sk); - return 0; - } - - sock = rxe_setup_udp_tunnel(net, htons(ROCE_V2_UDP_DPORT), true); - if (PTR_ERR(sock) == -EAFNOSUPPORT) { - pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n"); - return 0; - } - - if (IS_ERR(sock)) { - pr_err("Failed to create IPv6 UDP tunnel\n"); - return -1; - } - - rxe_ns_pernet_set_sk6(net, sock->sk); - -#endif - return 0; -} - int rxe_register_notifier(void) { int err; @@ -819,30 +730,3 @@ void rxe_net_exit(void) { unregister_netdevice_notifier(&rxe_net_notifier); } - -int rxe_net_init(struct net_device *ndev) -{ - struct net *net; - struct sock *sk; - int err; - - net = dev_net(ndev); - - err = rxe_net_ipv4_init(net); - if (err) - return err; - - err = rxe_net_ipv6_init(net); - if (err) - goto err_out; - - return 0; - -err_out: - /* If ipv6 error, release ipv4 resource */ - sk = rxe_ns_pernet_sk4(net); - if (sk) - rxe_sock_put(sk, rxe_ns_pernet_set_sk4, net); - - return err; -} diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h index 56249677d692..592b0e577f32 100644 --- a/drivers/infiniband/sw/rxe/rxe_net.h +++ b/drivers/infiniband/sw/rxe/rxe_net.h @@ -11,11 +11,12 @@ #include <net/if_inet6.h> #include <linux/module.h> +struct sock *rxe_setup_udp_tunnel(struct net *net, __be16 port, bool ipv6); +void rxe_release_udp_tunnel(struct sock *sk); + int rxe_net_add(const char *ibdev_name, struct net_device *ndev); -void rxe_net_del(struct ib_device *dev); int rxe_register_notifier(void); -int rxe_net_init(struct net_device *ndev); void rxe_net_exit(void); #endif /* RXE_NET_H */ diff --git a/drivers/infiniband/sw/rxe/rxe_ns.c b/drivers/infiniband/sw/rxe/rxe_ns.c index 8b9d734229b2..06eb2e2387a1 100644 --- a/drivers/infiniband/sw/rxe/rxe_ns.c +++ b/drivers/infiniband/sw/rxe/rxe_ns.c @@ -7,8 +7,10 @@ #include <linux/skbuff.h> #include <linux/pid_namespace.h> #include <net/udp_tunnel.h> +#include <rdma/ib_verbs.h> #include "rxe_ns.h" +#include "rxe_net.h" /* * Per network namespace data @@ -23,40 +25,54 @@ struct rxe_ns_sock { */ static unsigned int rxe_pernet_id; -/* - * Called for every existing and added network namespaces - */ -static int rxe_ns_init(struct net *net) +static __net_init int rxe_ns_init(struct net *net) { - /* defer socket create in the namespace to the first - * device create. - */ + struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id); + struct sock *sk; + int err = 0; + + sk = rxe_setup_udp_tunnel(net, htons(ROCE_V2_UDP_DPORT), false); + if (IS_ERR(sk)) { + err = PTR_ERR(sk); + goto out; + } + + RCU_INIT_POINTER(ns_sk->rxe_sk4, sk); + +#if IS_ENABLED(CONFIG_IPV6) + sk = rxe_setup_udp_tunnel(net, htons(ROCE_V2_UDP_DPORT), true); + if (IS_ERR(sk)) { + err = PTR_ERR(sk); + if (err == -EAFNOSUPPORT) { + err = 0; + goto out; + } + + sk = rcu_dereference_protected(ns_sk->rxe_sk4, 1); + rxe_release_udp_tunnel(sk); + goto out; + } - return 0; + RCU_INIT_POINTER(ns_sk->rxe_sk6, sk); +#endif +out: + return err; } -static void rxe_ns_exit(struct net *net) +static __net_exit void rxe_ns_exit(struct net *net) { - /* called when the network namespace is removed - */ struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id); struct sock *sk; - rcu_read_lock(); - sk = rcu_dereference(ns_sk->rxe_sk4); - rcu_read_unlock(); - if (sk) { - rcu_assign_pointer(ns_sk->rxe_sk4, NULL); - udp_tunnel_sock_release(sk->sk_socket); - } + sk = rcu_dereference_protected(ns_sk->rxe_sk4, 1); + RCU_INIT_POINTER(ns_sk->rxe_sk4, NULL); + rxe_release_udp_tunnel(sk); #if IS_ENABLED(CONFIG_IPV6) - rcu_read_lock(); - sk = rcu_dereference(ns_sk->rxe_sk6); - rcu_read_unlock(); + sk = rcu_dereference_protected(ns_sk->rxe_sk6, 1); if (sk) { - rcu_assign_pointer(ns_sk->rxe_sk6, NULL); - udp_tunnel_sock_release(sk->sk_socket); + RCU_INIT_POINTER(ns_sk->rxe_sk6, NULL); + rxe_release_udp_tunnel(sk); } #endif } @@ -71,26 +87,6 @@ static struct pernet_operations rxe_net_ops = { .size = sizeof(struct rxe_ns_sock), }; -struct sock *rxe_ns_pernet_sk4(struct net *net) -{ - struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id); - struct sock *sk; - - rcu_read_lock(); - sk = rcu_dereference(ns_sk->rxe_sk4); - rcu_read_unlock(); - - return sk; -} - -void rxe_ns_pernet_set_sk4(struct net *net, struct sock *sk) -{ - struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id); - - rcu_assign_pointer(ns_sk->rxe_sk4, sk); - synchronize_rcu(); -} - #if IS_ENABLED(CONFIG_IPV6) struct sock *rxe_ns_pernet_sk6(struct net *net) { @@ -103,14 +99,6 @@ struct sock *rxe_ns_pernet_sk6(struct net *net) return sk; } - -void rxe_ns_pernet_set_sk6(struct net *net, struct sock *sk) -{ - struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id); - - rcu_assign_pointer(ns_sk->rxe_sk6, sk); - synchronize_rcu(); -} #endif /* IPV6 */ int rxe_namespace_init(void) diff --git a/drivers/infiniband/sw/rxe/rxe_ns.h b/drivers/infiniband/sw/rxe/rxe_ns.h index 4da2709e6b71..7f48d624fa05 100644 --- a/drivers/infiniband/sw/rxe/rxe_ns.h +++ b/drivers/infiniband/sw/rxe/rxe_ns.h @@ -3,7 +3,6 @@ #ifndef RXE_NS_H #define RXE_NS_H -struct sock *rxe_ns_pernet_sk4(struct net *net); void rxe_ns_pernet_set_sk4(struct net *net, struct sock *sk); #if IS_ENABLED(CONFIG_IPV6) -- 2.54.0.rc2.544.gc7ae2d5bb8-goog ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-25 6:04 ` [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown() Kuniyuki Iwashima @ 2026-04-25 15:47 ` David Ahern 2026-04-25 20:55 ` Kuniyuki Iwashima 2026-04-25 21:25 ` Zhu Yanjun 1 sibling, 1 reply; 22+ messages in thread From: David Ahern @ 2026-04-25 15:47 UTC (permalink / raw) To: Kuniyuki Iwashima, Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky Cc: Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 On 4/25/26 12:04 AM, Kuniyuki Iwashima wrote: > syzbot reported null-ptr-deref in kernel_sock_shutdown(). [0] > > The problem is ->newlink() and ->dellink() can be called > concurrently with no synchronisation, leading sk leak or > double free, etc. My expectation is that the synchronization is managed by: rdma_nl_rcv_msg() down_read(&rdma_nl_types[index].sem); as the RTNL equivalent. > > We defer UDP tunnel allocation to the first device creation, > but this would requrie per-netns locking. typo: s/requrie/require/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-25 15:47 ` David Ahern @ 2026-04-25 20:55 ` Kuniyuki Iwashima 2026-04-26 16:40 ` David Ahern 0 siblings, 1 reply; 22+ messages in thread From: Kuniyuki Iwashima @ 2026-04-25 20:55 UTC (permalink / raw) To: David Ahern Cc: Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 On Sat, Apr 25, 2026 at 8:47 AM David Ahern <dsahern@kernel.org> wrote: > > On 4/25/26 12:04 AM, Kuniyuki Iwashima wrote: > > syzbot reported null-ptr-deref in kernel_sock_shutdown(). [0] > > > > The problem is ->newlink() and ->dellink() can be called > > concurrently with no synchronisation, leading sk leak or > > double free, etc. > > My expectation is that the synchronization is managed by: > > rdma_nl_rcv_msg() > down_read(&rdma_nl_types[index].sem); > > as the RTNL equivalent. but down_read() is a shared lock and does not work as per-netns exclusive locking. > > > > > We defer UDP tunnel allocation to the first device creation, > > but this would requrie per-netns locking. > > typo: s/requrie/require/ Will fix. Thanks ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-25 20:55 ` Kuniyuki Iwashima @ 2026-04-26 16:40 ` David Ahern 0 siblings, 0 replies; 22+ messages in thread From: David Ahern @ 2026-04-26 16:40 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 On 4/25/26 2:55 PM, Kuniyuki Iwashima wrote: > On Sat, Apr 25, 2026 at 8:47 AM David Ahern <dsahern@kernel.org> wrote: >> >> On 4/25/26 12:04 AM, Kuniyuki Iwashima wrote: >>> syzbot reported null-ptr-deref in kernel_sock_shutdown(). [0] >>> >>> The problem is ->newlink() and ->dellink() can be called >>> concurrently with no synchronisation, leading sk leak or >>> double free, etc. >> >> My expectation is that the synchronization is managed by: >> >> rdma_nl_rcv_msg() >> down_read(&rdma_nl_types[index].sem); >> >> as the RTNL equivalent. > > but down_read() is a shared lock and does not work as > per-netns exclusive locking. Well, that's a face palm moment for me; skimmed that code a bit too quickly when reviewing the rxe patches. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-25 6:04 ` [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown() Kuniyuki Iwashima 2026-04-25 15:47 ` David Ahern @ 2026-04-25 21:25 ` Zhu Yanjun 2026-04-26 16:42 ` David Ahern 1 sibling, 1 reply; 22+ messages in thread From: Zhu Yanjun @ 2026-04-25 21:25 UTC (permalink / raw) To: Kuniyuki Iwashima, Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, yanjun.zhu@linux.dev Cc: David Ahern, Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 在 2026/4/24 23:04, Kuniyuki Iwashima 写道: > syzbot reported null-ptr-deref in kernel_sock_shutdown(). [0] > > The problem is ->newlink() and ->dellink() can be called > concurrently with no synchronisation, leading sk leak or > double free, etc. > > We defer UDP tunnel allocation to the first device creation, > but this would requrie per-netns locking. > > Let's allocate UDP tunnels in the __init_net hook. > > Now extra sock_hold() and __sock_put() are no longer needed. > > Note that rxe_ns_pernet_sk6() is broken and will be fixed > in the following patch. > > [0]: > Oops: general protection fault, probably for non-canonical address 0xdffffc000000000d: 0000 [#1] SMP KASAN NOPTI > KASAN: null-ptr-deref in range [0x0000000000000068-0x000000000000006f] > CPU: 3 UID: 0 PID: 12652 Comm: syz.7.1709 Tainted: G L syzkaller #0 PREEMPT(full) > Tainted: [L]=SOFTLOCKUP > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 > RIP: 0010:kernel_sock_shutdown+0x47/0x70 net/socket.c:3785 > Code: fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 75 33 48 b8 00 00 00 00 00 fc ff df 4c 8b 63 20 49 8d 7c 24 68 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 1a 49 8b 44 24 68 89 ee 48 89 df 5b 5d 41 5c e9 46 > RSP: 0018:ffffc9000566f180 EFLAGS: 00010202 > RAX: dffffc0000000000 RBX: ffff888058587240 RCX: 0000000000000000 > RDX: 000000000000000d RSI: ffffffff895ced12 RDI: 0000000000000068 > RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed1006d98945 > R10: ffff888036cc4a2b R11: 0000003683c25c00 R12: 0000000000000000 > R13: ffff88805c998000 R14: 0000000000000002 R15: 0000000000000018 > FS: 00007f1306d976c0(0000) GS:ffff8880d65db000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f1306d97d58 CR3: 00000000404f1000 CR4: 0000000000352ef0 > DR0: ffffffffffffffff DR1: 00000000000001f8 DR2: 0000000000000002 > DR3: ffffffffefffff15 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Call Trace: > <TASK> > udp_tunnel_sock_release+0x68/0x80 net/ipv4/udp_tunnel_core.c:202 > rxe_release_udp_tunnel drivers/infiniband/sw/rxe/rxe_net.c:294 [inline] > rxe_sock_put+0xae/0x130 drivers/infiniband/sw/rxe/rxe_net.c:639 > rxe_net_del+0x83/0x120 drivers/infiniband/sw/rxe/rxe_net.c:660 > rxe_dellink+0x15/0x20 drivers/infiniband/sw/rxe/rxe.c:254 > nldev_dellink+0x289/0x3c0 drivers/infiniband/core/nldev.c:1849 > rdma_nl_rcv_msg+0x392/0x6f0 drivers/infiniband/core/netlink.c:195 > rdma_nl_rcv_skb.constprop.0.isra.0+0x2cb/0x410 drivers/infiniband/core/netlink.c:239 > netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] > netlink_unicast+0x585/0x850 net/netlink/af_netlink.c:1344 > netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894 > sock_sendmsg_nosec net/socket.c:787 [inline] > __sock_sendmsg net/socket.c:802 [inline] > ____sys_sendmsg+0x9e1/0xb70 net/socket.c:2698 > ___sys_sendmsg+0x190/0x1e0 net/socket.c:2752 > __sys_sendmsg+0x170/0x220 net/socket.c:2784 > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] > do_syscall_64+0x10b/0xf80 arch/x86/entry/syscall_64.c:94 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > RIP: 0033:0x7f1305f9c819 > Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 > RSP: 002b:00007f1306d97028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e > RAX: ffffffffffffffda RBX: 00007f1306216090 RCX: 00007f1305f9c819 > RDX: 0000000000000000 RSI: 00002000000002c0 RDI: 0000000000000003 > RBP: 00007f1306032c91 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 > R13: 00007f1306216128 R14: 00007f1306216090 R15: 00007ffd8ecad288 > </TASK> > Modules linked in: > > Fixes: f1327abd6abe ("RDMA/rxe: Support RDMA link creation and destruction per net namespace") > Reported-by: syzbot+d8f76778263ab65c2b21@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/all/69ea344f.a00a0220.17a17.0040.GAE@google.com/ > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > --- > v2: Set up UDP tunnels in __net_init instead of adding mutex. > v1: https://lore.kernel.org/all/20260424013759.728288-1-kuniyu@google.com/ > --- > drivers/infiniband/sw/rxe/rxe.c | 6 -- > drivers/infiniband/sw/rxe/rxe_net.c | 126 ++-------------------------- > drivers/infiniband/sw/rxe/rxe_net.h | 5 +- > drivers/infiniband/sw/rxe/rxe_ns.c | 90 +++++++++----------- > drivers/infiniband/sw/rxe/rxe_ns.h | 1 - > 5 files changed, 47 insertions(+), 181 deletions(-) > All the commits are functionally correct, but I noticed some regressions when running: make -C tools/testing/selftests/rdma/ TARGET=rdma run_tests After applying this commit, the UDP port 4791 starts listening in both init_net and all other net namespaces as soon as modprobe rdma_rxe is executed. This breaks tests that expect the port to be unoccupied until a device is actually created. I have a workaround to fix the current test failures. However, I think we should also add a new test case to the RDMA selftests. This test should explicitly verify that port 4791 is correctly listening in all namespaces immediately after the module is loaded, reflecting the new architectural change. The workaround: ----------------------------Begin-------------------------------------- diff --git a/tools/testing/selftests/rdma/rxe_ipv6.sh b/tools/testing/selftests/rdma/rxe_ipv6.sh index b7059bfd6d7c..e808d9829752 100755 --- a/tools/testing/selftests/rdma/rxe_ipv6.sh +++ b/tools/testing/selftests/rdma/rxe_ipv6.sh @@ -56,8 +56,8 @@ echo "Verified: Port $PORT is active." echo "Deleting RDMA link..." ip netns exec "$NS_NAME" rdma link del "$RXE_NAME" -if ip netns exec "$NS_NAME" ss -Hul6n sport = :$PORT | grep -q ":$PORT"; then - echo "Error: UDP port $PORT still active after link deletion." +if ! ip netns exec "$NS_NAME" ss -Hul6n sport = :$PORT | grep -q ":$PORT"; then + echo "Error: UDP port $PORT is not active after link deletion." exit 1 fi echo "Verified: Port $PORT closed successfully." diff --git a/tools/testing/selftests/rdma/rxe_socket_with_netns.sh b/tools/testing/selftests/rdma/rxe_socket_with_netns.sh index 002e5098f751..0ad4a8d4d755 100755 --- a/tools/testing/selftests/rdma/rxe_socket_with_netns.sh +++ b/tools/testing/selftests/rdma/rxe_socket_with_netns.sh @@ -68,8 +68,8 @@ echo "Deleting rxe0..." rdma link del rxe0 # Port should now be gone -if ss -Huln sport = :$PORT | grep -q ":$PORT"; then - echo "Error: UDP port $PORT still exists after all links deleted" +if ! ss -Huln sport = :$PORT | grep -q ":$PORT"; then + echo "Error: UDP port $PORT does not exist after all links deleted" exit 1 fi diff --git a/tools/testing/selftests/rdma/rxe_test_NETDEV_UNREGISTER.sh b/tools/testing/selftests/rdma/rxe_test_NETDEV_UNREGISTER.sh index 021ca451499d..07efe9ea6a71 100755 --- a/tools/testing/selftests/rdma/rxe_test_NETDEV_UNREGISTER.sh +++ b/tools/testing/selftests/rdma/rxe_test_NETDEV_UNREGISTER.sh @@ -55,8 +55,8 @@ if rdma link show "$RXE_NAME" 2>/dev/null; then exit 1 fi -if ss -Huln sport == :$RDMA_PORT | grep -q ":$RDMA_PORT"; then - echo "Error: UDP port $RDMA_PORT still listening after netdev removal." +if ! ss -Huln sport == :$RDMA_PORT | grep -q ":$RDMA_PORT"; then + echo "Error: UDP port $RDMA_PORT is not listening after netdev removal." exit 1 fi ---------------------------End----------------------------------------- The new testcase is like the following: ----------------------------Begin-------------------------------------- # Load module modprobe rdma_rxe # Check init_net ss -lnup | grep -q ":4791" || echo "Test Failed: Port 4791 not listening in init_net" # Check in a new namespace ip netns add rxe_test ip netns exec rxe_test ss -lnup | grep -q ":4791" || echo "Test Failed: Port 4791 not listening in netns" ---------------------------End----------------------------------------- Just my 2 cent suggestion. It is up to you about how to fix it. To now I am fine with these 2 commits. Please David Ahern, Leon and Jason comment. Thanks a lot. Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Zhu Yanjun > diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c > index b0714f9abe3d..111ba4e57261 100644 > --- a/drivers/infiniband/sw/rxe/rxe.c > +++ b/drivers/infiniband/sw/rxe/rxe.c > @@ -236,10 +236,6 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev) > goto err; > } > > - err = rxe_net_init(ndev); > - if (err) > - return err; > - > err = rxe_net_add(ibdev_name, ndev); > if (err) { > rxe_err("failed to add %s\n", ndev->name); > @@ -251,8 +247,6 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev) > > static int rxe_dellink(struct ib_device *dev) > { > - rxe_net_del(dev); > - > return 0; > } > > diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c > index 50a2cb5405e2..9080d4c893a1 100644 > --- a/drivers/infiniband/sw/rxe/rxe_net.c > +++ b/drivers/infiniband/sw/rxe/rxe_net.c > @@ -256,8 +256,8 @@ static int rxe_udp_encap_recv(struct sock *sk, struct sk_buff *skb) > return 0; > } > > -static struct socket *rxe_setup_udp_tunnel(struct net *net, __be16 port, > - bool ipv6) > +struct sock *rxe_setup_udp_tunnel(struct net *net, __be16 port, > + bool ipv6) > { > int err; > struct socket *sock; > @@ -285,13 +285,12 @@ static struct socket *rxe_setup_udp_tunnel(struct net *net, __be16 port, > /* Setup UDP tunnel */ > setup_udp_tunnel_sock(net, sock, &tnl_cfg); > > - return sock; > + return sock->sk; > } > > -static void rxe_release_udp_tunnel(struct socket *sk) > +void rxe_release_udp_tunnel(struct sock *sk) > { > - if (sk) > - udp_tunnel_sock_release(sk); > + udp_tunnel_sock_release(sk->sk_socket); > } > > static void prepare_udp_hdr(struct sk_buff *skb, __be16 src_port, > @@ -629,43 +628,6 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev) > return 0; > } > > -static void rxe_sock_put(struct sock *sk, > - void (*set_sk)(struct net *, struct sock *), > - struct net *net) > -{ > - if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL) { > - __sock_put(sk); > - } else { > - rxe_release_udp_tunnel(sk->sk_socket); > - sk = NULL; > - set_sk(net, sk); > - } > -} > - > -void rxe_net_del(struct ib_device *dev) > -{ > - struct rxe_dev *rxe = container_of(dev, struct rxe_dev, ib_dev); > - struct net_device *ndev; > - struct sock *sk; > - struct net *net; > - > - ndev = rxe_ib_device_get_netdev(&rxe->ib_dev); > - if (!ndev) > - return; > - > - net = dev_net(ndev); > - > - sk = rxe_ns_pernet_sk4(net); > - if (sk) > - rxe_sock_put(sk, rxe_ns_pernet_set_sk4, net); > - > - sk = rxe_ns_pernet_sk6(net); > - if (sk) > - rxe_sock_put(sk, rxe_ns_pernet_set_sk6, net); > - > - dev_put(ndev); > -} > - > static void rxe_port_event(struct rxe_dev *rxe, > enum ib_event_type event) > { > @@ -722,7 +684,6 @@ static int rxe_notify(struct notifier_block *not_blk, > switch (event) { > case NETDEV_UNREGISTER: > ib_unregister_device_queued(&rxe->ib_dev); > - rxe_net_del(&rxe->ib_dev); > break; > case NETDEV_CHANGEMTU: > rxe_dbg_dev(rxe, "%s changed mtu to %d\n", ndev->name, ndev->mtu); > @@ -752,56 +713,6 @@ static struct notifier_block rxe_net_notifier = { > .notifier_call = rxe_notify, > }; > > -static int rxe_net_ipv4_init(struct net *net) > -{ > - struct sock *sk; > - struct socket *sock; > - > - sk = rxe_ns_pernet_sk4(net); > - if (sk) { > - sock_hold(sk); > - return 0; > - } > - > - sock = rxe_setup_udp_tunnel(net, htons(ROCE_V2_UDP_DPORT), false); > - if (IS_ERR(sock)) { > - pr_err("Failed to create IPv4 UDP tunnel\n"); > - return -1; > - } > - rxe_ns_pernet_set_sk4(net, sock->sk); > - > - return 0; > -} > - > -static int rxe_net_ipv6_init(struct net *net) > -{ > -#if IS_ENABLED(CONFIG_IPV6) > - struct sock *sk; > - struct socket *sock; > - > - sk = rxe_ns_pernet_sk6(net); > - if (sk) { > - sock_hold(sk); > - return 0; > - } > - > - sock = rxe_setup_udp_tunnel(net, htons(ROCE_V2_UDP_DPORT), true); > - if (PTR_ERR(sock) == -EAFNOSUPPORT) { > - pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n"); > - return 0; > - } > - > - if (IS_ERR(sock)) { > - pr_err("Failed to create IPv6 UDP tunnel\n"); > - return -1; > - } > - > - rxe_ns_pernet_set_sk6(net, sock->sk); > - > -#endif > - return 0; > -} > - > int rxe_register_notifier(void) > { > int err; > @@ -819,30 +730,3 @@ void rxe_net_exit(void) > { > unregister_netdevice_notifier(&rxe_net_notifier); > } > - > -int rxe_net_init(struct net_device *ndev) > -{ > - struct net *net; > - struct sock *sk; > - int err; > - > - net = dev_net(ndev); > - > - err = rxe_net_ipv4_init(net); > - if (err) > - return err; > - > - err = rxe_net_ipv6_init(net); > - if (err) > - goto err_out; > - > - return 0; > - > -err_out: > - /* If ipv6 error, release ipv4 resource */ > - sk = rxe_ns_pernet_sk4(net); > - if (sk) > - rxe_sock_put(sk, rxe_ns_pernet_set_sk4, net); > - > - return err; > -} > diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h > index 56249677d692..592b0e577f32 100644 > --- a/drivers/infiniband/sw/rxe/rxe_net.h > +++ b/drivers/infiniband/sw/rxe/rxe_net.h > @@ -11,11 +11,12 @@ > #include <net/if_inet6.h> > #include <linux/module.h> > > +struct sock *rxe_setup_udp_tunnel(struct net *net, __be16 port, bool ipv6); > +void rxe_release_udp_tunnel(struct sock *sk); > + > int rxe_net_add(const char *ibdev_name, struct net_device *ndev); > -void rxe_net_del(struct ib_device *dev); > > int rxe_register_notifier(void); > -int rxe_net_init(struct net_device *ndev); > void rxe_net_exit(void); > > #endif /* RXE_NET_H */ > diff --git a/drivers/infiniband/sw/rxe/rxe_ns.c b/drivers/infiniband/sw/rxe/rxe_ns.c > index 8b9d734229b2..06eb2e2387a1 100644 > --- a/drivers/infiniband/sw/rxe/rxe_ns.c > +++ b/drivers/infiniband/sw/rxe/rxe_ns.c > @@ -7,8 +7,10 @@ > #include <linux/skbuff.h> > #include <linux/pid_namespace.h> > #include <net/udp_tunnel.h> > +#include <rdma/ib_verbs.h> > > #include "rxe_ns.h" > +#include "rxe_net.h" > > /* > * Per network namespace data > @@ -23,40 +25,54 @@ struct rxe_ns_sock { > */ > static unsigned int rxe_pernet_id; > > -/* > - * Called for every existing and added network namespaces > - */ > -static int rxe_ns_init(struct net *net) > +static __net_init int rxe_ns_init(struct net *net) > { > - /* defer socket create in the namespace to the first > - * device create. > - */ > + struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id); > + struct sock *sk; > + int err = 0; > + > + sk = rxe_setup_udp_tunnel(net, htons(ROCE_V2_UDP_DPORT), false); > + if (IS_ERR(sk)) { > + err = PTR_ERR(sk); > + goto out; > + } > + > + RCU_INIT_POINTER(ns_sk->rxe_sk4, sk); > + > +#if IS_ENABLED(CONFIG_IPV6) > + sk = rxe_setup_udp_tunnel(net, htons(ROCE_V2_UDP_DPORT), true); > + if (IS_ERR(sk)) { > + err = PTR_ERR(sk); > + if (err == -EAFNOSUPPORT) { > + err = 0; > + goto out; > + } > + > + sk = rcu_dereference_protected(ns_sk->rxe_sk4, 1); > + rxe_release_udp_tunnel(sk); > + goto out; > + } > > - return 0; > + RCU_INIT_POINTER(ns_sk->rxe_sk6, sk); > +#endif > +out: > + return err; > } > > -static void rxe_ns_exit(struct net *net) > +static __net_exit void rxe_ns_exit(struct net *net) > { > - /* called when the network namespace is removed > - */ > struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id); > struct sock *sk; > > - rcu_read_lock(); > - sk = rcu_dereference(ns_sk->rxe_sk4); > - rcu_read_unlock(); > - if (sk) { > - rcu_assign_pointer(ns_sk->rxe_sk4, NULL); > - udp_tunnel_sock_release(sk->sk_socket); > - } > + sk = rcu_dereference_protected(ns_sk->rxe_sk4, 1); > + RCU_INIT_POINTER(ns_sk->rxe_sk4, NULL); > + rxe_release_udp_tunnel(sk); > > #if IS_ENABLED(CONFIG_IPV6) > - rcu_read_lock(); > - sk = rcu_dereference(ns_sk->rxe_sk6); > - rcu_read_unlock(); > + sk = rcu_dereference_protected(ns_sk->rxe_sk6, 1); > if (sk) { > - rcu_assign_pointer(ns_sk->rxe_sk6, NULL); > - udp_tunnel_sock_release(sk->sk_socket); > + RCU_INIT_POINTER(ns_sk->rxe_sk6, NULL); > + rxe_release_udp_tunnel(sk); > } > #endif > } > @@ -71,26 +87,6 @@ static struct pernet_operations rxe_net_ops = { > .size = sizeof(struct rxe_ns_sock), > }; > > -struct sock *rxe_ns_pernet_sk4(struct net *net) > -{ > - struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id); > - struct sock *sk; > - > - rcu_read_lock(); > - sk = rcu_dereference(ns_sk->rxe_sk4); > - rcu_read_unlock(); > - > - return sk; > -} > - > -void rxe_ns_pernet_set_sk4(struct net *net, struct sock *sk) > -{ > - struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id); > - > - rcu_assign_pointer(ns_sk->rxe_sk4, sk); > - synchronize_rcu(); > -} > - > #if IS_ENABLED(CONFIG_IPV6) > struct sock *rxe_ns_pernet_sk6(struct net *net) > { > @@ -103,14 +99,6 @@ struct sock *rxe_ns_pernet_sk6(struct net *net) > > return sk; > } > - > -void rxe_ns_pernet_set_sk6(struct net *net, struct sock *sk) > -{ > - struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id); > - > - rcu_assign_pointer(ns_sk->rxe_sk6, sk); > - synchronize_rcu(); > -} > #endif /* IPV6 */ > > int rxe_namespace_init(void) > diff --git a/drivers/infiniband/sw/rxe/rxe_ns.h b/drivers/infiniband/sw/rxe/rxe_ns.h > index 4da2709e6b71..7f48d624fa05 100644 > --- a/drivers/infiniband/sw/rxe/rxe_ns.h > +++ b/drivers/infiniband/sw/rxe/rxe_ns.h > @@ -3,7 +3,6 @@ > #ifndef RXE_NS_H > #define RXE_NS_H > > -struct sock *rxe_ns_pernet_sk4(struct net *net); > void rxe_ns_pernet_set_sk4(struct net *net, struct sock *sk); > > #if IS_ENABLED(CONFIG_IPV6) ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-25 21:25 ` Zhu Yanjun @ 2026-04-26 16:42 ` David Ahern 2026-04-27 2:57 ` Zhu Yanjun 0 siblings, 1 reply; 22+ messages in thread From: David Ahern @ 2026-04-26 16:42 UTC (permalink / raw) To: Zhu Yanjun, Kuniyuki Iwashima, Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky Cc: Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 On 4/25/26 3:25 PM, Zhu Yanjun wrote: > 在 2026/4/24 23:04, Kuniyuki Iwashima 写道: >> syzbot reported null-ptr-deref in kernel_sock_shutdown(). [0] >> >> The problem is ->newlink() and ->dellink() can be called >> concurrently with no synchronisation, leading sk leak or >> double free, etc. >> >> We defer UDP tunnel allocation to the first device creation, >> but this would requrie per-netns locking. >> >> Let's allocate UDP tunnels in the __init_net hook. >> >> Now extra sock_hold() and __sock_put() are no longer needed. >> >> Note that rxe_ns_pernet_sk6() is broken and will be fixed >> in the following patch. >> ... > > All the commits are functionally correct, but I noticed some regressions > when running: > make -C tools/testing/selftests/rdma/ TARGET=rdma run_tests > > After applying this commit, the UDP port 4791 starts listening in both > init_net and all other net namespaces as soon as modprobe rdma_rxe is > executed. This breaks tests that expect the port to be unoccupied until > a device is actually created. Not opening the port until an rxe device is created in that namespace needs to be kept. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-26 16:42 ` David Ahern @ 2026-04-27 2:57 ` Zhu Yanjun 2026-04-27 3:10 ` Kuniyuki Iwashima 0 siblings, 1 reply; 22+ messages in thread From: Zhu Yanjun @ 2026-04-27 2:57 UTC (permalink / raw) To: David Ahern, Kuniyuki Iwashima, Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, yanjun.zhu@linux.dev Cc: Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 在 2026/4/26 9:42, David Ahern 写道: > On 4/25/26 3:25 PM, Zhu Yanjun wrote: >> 在 2026/4/24 23:04, Kuniyuki Iwashima 写道: >>> syzbot reported null-ptr-deref in kernel_sock_shutdown(). [0] >>> >>> The problem is ->newlink() and ->dellink() can be called >>> concurrently with no synchronisation, leading sk leak or >>> double free, etc. >>> >>> We defer UDP tunnel allocation to the first device creation, >>> but this would requrie per-netns locking. >>> >>> Let's allocate UDP tunnels in the __init_net hook. >>> >>> Now extra sock_hold() and __sock_put() are no longer needed. >>> >>> Note that rxe_ns_pernet_sk6() is broken and will be fixed >>> in the following patch. >>> > ... >> >> All the commits are functionally correct, but I noticed some regressions >> when running: >> make -C tools/testing/selftests/rdma/ TARGET=rdma run_tests >> >> After applying this commit, the UDP port 4791 starts listening in both >> init_net and all other net namespaces as soon as modprobe rdma_rxe is >> executed. This breaks tests that expect the port to be unoccupied until >> a device is actually created. > > Not opening the port until an rxe device is created in that namespace > needs to be kept. This change alters the user-visible behavior: the UDP port is now always listening per netns, even without an rxe device. Is this intentional, or should we preserve the previous lazy allocation semantics? Zhu Yanjun ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-27 2:57 ` Zhu Yanjun @ 2026-04-27 3:10 ` Kuniyuki Iwashima 2026-04-27 3:53 ` Zhu Yanjun 0 siblings, 1 reply; 22+ messages in thread From: Kuniyuki Iwashima @ 2026-04-27 3:10 UTC (permalink / raw) To: Zhu Yanjun Cc: David Ahern, Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 On Sun, Apr 26, 2026 at 7:57 PM Zhu Yanjun <yanjun.zhu@linux.dev> wrote: > > 在 2026/4/26 9:42, David Ahern 写道: > > On 4/25/26 3:25 PM, Zhu Yanjun wrote: > >> 在 2026/4/24 23:04, Kuniyuki Iwashima 写道: > >>> syzbot reported null-ptr-deref in kernel_sock_shutdown(). [0] > >>> > >>> The problem is ->newlink() and ->dellink() can be called > >>> concurrently with no synchronisation, leading sk leak or > >>> double free, etc. > >>> > >>> We defer UDP tunnel allocation to the first device creation, > >>> but this would requrie per-netns locking. > >>> > >>> Let's allocate UDP tunnels in the __init_net hook. > >>> > >>> Now extra sock_hold() and __sock_put() are no longer needed. > >>> > >>> Note that rxe_ns_pernet_sk6() is broken and will be fixed > >>> in the following patch. > >>> > > ... > >> > >> All the commits are functionally correct, but I noticed some regressions > >> when running: > >> make -C tools/testing/selftests/rdma/ TARGET=rdma run_tests > >> > >> After applying this commit, the UDP port 4791 starts listening in both > >> init_net and all other net namespaces as soon as modprobe rdma_rxe is > >> executed. This breaks tests that expect the port to be unoccupied until > >> a device is actually created. > > > > Not opening the port until an rxe device is created in that namespace > > needs to be kept. > > This change alters the user-visible behavior: > the UDP port is now always listening per netns, > even without an rxe device. > > Is this intentional, or should we preserve the > previous lazy allocation semantics? I did it intentionally because before f1327abd6abe the port was reserved while loading the module, and if the tunnel creation failed, the module was unloaded. Rather I feel f1327abd6abe introduced the user-visible change. That said, I don't have a strong preference, so up to you maintainers. Simple lockless solution vs per-newlink/dellink locking. FWIW, there is another example that reserves a port in every netns while loading the module, see rds_tcp_listen_init(). ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-27 3:10 ` Kuniyuki Iwashima @ 2026-04-27 3:53 ` Zhu Yanjun 2026-04-27 14:38 ` David Ahern 0 siblings, 1 reply; 22+ messages in thread From: Zhu Yanjun @ 2026-04-27 3:53 UTC (permalink / raw) To: Kuniyuki Iwashima, Zhu Yanjun Cc: David Ahern, Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 在 2026/4/26 20:10, Kuniyuki Iwashima 写道: > On Sun, Apr 26, 2026 at 7:57 PM Zhu Yanjun <yanjun.zhu@linux.dev> wrote: >> 在 2026/4/26 9:42, David Ahern 写道: >>> On 4/25/26 3:25 PM, Zhu Yanjun wrote: >>>> 在 2026/4/24 23:04, Kuniyuki Iwashima 写道: >>>>> syzbot reported null-ptr-deref in kernel_sock_shutdown(). [0] >>>>> >>>>> The problem is ->newlink() and ->dellink() can be called >>>>> concurrently with no synchronisation, leading sk leak or >>>>> double free, etc. >>>>> >>>>> We defer UDP tunnel allocation to the first device creation, >>>>> but this would requrie per-netns locking. >>>>> >>>>> Let's allocate UDP tunnels in the __init_net hook. >>>>> >>>>> Now extra sock_hold() and __sock_put() are no longer needed. >>>>> >>>>> Note that rxe_ns_pernet_sk6() is broken and will be fixed >>>>> in the following patch. >>>>> >>> ... >>>> All the commits are functionally correct, but I noticed some regressions >>>> when running: >>>> make -C tools/testing/selftests/rdma/ TARGET=rdma run_tests >>>> >>>> After applying this commit, the UDP port 4791 starts listening in both >>>> init_net and all other net namespaces as soon as modprobe rdma_rxe is >>>> executed. This breaks tests that expect the port to be unoccupied until >>>> a device is actually created. >>> Not opening the port until an rxe device is created in that namespace >>> needs to be kept. >> This change alters the user-visible behavior: >> the UDP port is now always listening per netns, >> even without an rxe device. >> >> Is this intentional, or should we preserve the >> previous lazy allocation semantics? > I did it intentionally because before f1327abd6abe > the port was reserved while loading the module, and > if the tunnel creation failed, the module was unloaded. > > Rather I feel f1327abd6abe introduced the user-visible > change. > > That said, I don't have a strong preference, so up to you > maintainers. Simple lockless solution vs per-newlink/dellink > locking. To be honest, I do not have a strong preference here, though I lean slightly toward the per-newlink/dellink locking approach. Both the simple lockless solution and the locking approach seem reasonable, depending on whether we prioritize simplicity or explicit synchronization and lifecycle clarity. It would be helpful to get feedback from David Ahern, Leon, and Jason to converge on a final direction. Zhu Yanjun > > FWIW, there is another example that reserves a port in every > netns while loading the module, see rds_tcp_listen_init(). > -- Best Regards, Yanjun.Zhu ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-27 3:53 ` Zhu Yanjun @ 2026-04-27 14:38 ` David Ahern 2026-04-27 20:20 ` yanjun.zhu 0 siblings, 1 reply; 22+ messages in thread From: David Ahern @ 2026-04-27 14:38 UTC (permalink / raw) To: Zhu Yanjun, Kuniyuki Iwashima Cc: Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 On 4/26/26 9:53 PM, Zhu Yanjun wrote: >> That said, I don't have a strong preference, so up to you >> maintainers. Simple lockless solution vs per-newlink/dellink >> locking. > To be honest, I do not have a strong preference here, though > > I lean slightly toward the per-newlink/dellink locking approach. > > Both the simple lockless solution and the locking approach seem > > reasonable, depending on whether we prioritize simplicity or > > explicit synchronization and lifecycle clarity. > > It would be helpful to get feedback from David Ahern, Leon, > > and Jason to converge on a final direction. Going in circles. I have said from the beginning of network namespace support for rxe do not open the port until first rxe device create and once opened leave it open. Simple design that limits socket churn to users wanting to leverage rxe in a network namespace. Zhu Yanjun, if you are going to be a maintainer of a feature you need to have a consistent stance on the architecture and code design. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-27 14:38 ` David Ahern @ 2026-04-27 20:20 ` yanjun.zhu 2026-04-28 0:52 ` Kuniyuki Iwashima 0 siblings, 1 reply; 22+ messages in thread From: yanjun.zhu @ 2026-04-27 20:20 UTC (permalink / raw) To: David Ahern, Kuniyuki Iwashima, Zhu Yanjun Cc: Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 On 4/27/26 7:38 AM, David Ahern wrote: > On 4/26/26 9:53 PM, Zhu Yanjun wrote: >>> That said, I don't have a strong preference, so up to you >>> maintainers. Simple lockless solution vs per-newlink/dellink >>> locking. >> To be honest, I do not have a strong preference here, though >> >> I lean slightly toward the per-newlink/dellink locking approach. >> >> Both the simple lockless solution and the locking approach seem >> >> reasonable, depending on whether we prioritize simplicity or >> >> explicit synchronization and lifecycle clarity. >> >> It would be helpful to get feedback from David Ahern, Leon, >> >> and Jason to converge on a final direction. > > > Going in circles. I have said from the beginning of network namespace > support for rxe do not open the port until first rxe device create and > once opened leave it open. Simple design that limits socket churn to > users wanting to leverage rxe in a network namespace. Zhu Yanjun, if you > are going to be a maintainer of a feature you need to have a consistent > stance on the architecture and code design. Thanks for the clarification — I see your point about keeping the design simple and avoiding unnecessary socket churn. I agree that deferring the port open until the first RXE device is created, and keeping it open afterwards, provides a cleaner and more predictable model. It also better scopes the overhead to users who actually rely on RXE within a network namespace. My earlier responses were not consistent, and that’s on me. I was exploring different approaches, but I understand that as a maintainer I need to converge on a clear and stable architectural direction. Unless there are strong objections, I will align the implementation with this model and make sure the behavior is consistent across the code paths. I’ll also revisit the current patches to ensure they follow this design principle. Please let me know if there are additional constraints or edge cases I should consider. @Kuniyuki Iwashima, Please follow the per-newlink/dellink locking approach. Thanks a lot. Zhu Yanjun ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-27 20:20 ` yanjun.zhu @ 2026-04-28 0:52 ` Kuniyuki Iwashima 2026-04-28 0:58 ` David Ahern 0 siblings, 1 reply; 22+ messages in thread From: Kuniyuki Iwashima @ 2026-04-28 0:52 UTC (permalink / raw) To: yanjun.zhu Cc: David Ahern, Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 On Mon, Apr 27, 2026 at 1:20 PM yanjun.zhu <yanjun.zhu@linux.dev> wrote: > > On 4/27/26 7:38 AM, David Ahern wrote: > > On 4/26/26 9:53 PM, Zhu Yanjun wrote: > >>> That said, I don't have a strong preference, so up to you > >>> maintainers. Simple lockless solution vs per-newlink/dellink > >>> locking. > >> To be honest, I do not have a strong preference here, though > >> > >> I lean slightly toward the per-newlink/dellink locking approach. > >> > >> Both the simple lockless solution and the locking approach seem > >> > >> reasonable, depending on whether we prioritize simplicity or > >> > >> explicit synchronization and lifecycle clarity. > >> > >> It would be helpful to get feedback from David Ahern, Leon, > >> > >> and Jason to converge on a final direction. > > > > > > Going in circles. I have said from the beginning of network namespace > > support for rxe do not open the port until first rxe device create and > > once opened leave it open. Simple design that limits socket churn to > > users wanting to leverage rxe in a network namespace. Zhu Yanjun, if you > > are going to be a maintainer of a feature you need to have a consistent > > stance on the architecture and code design. > > Thanks for the clarification — I see your point about keeping the design > simple and avoiding unnecessary socket churn. > > I agree that deferring the port open until the first RXE device is > created, and keeping it open afterwards, provides a cleaner and more > predictable model. It also better scopes the overhead to users who > actually rely on RXE within a network namespace. > > My earlier responses were not consistent, and that’s on me. I was > exploring different approaches, but I understand that as a maintainer I > need to converge on a clear and stable architectural direction. > > Unless there are strong objections, I will align the implementation with > this model and make sure the behavior is consistent across the code > paths. I’ll also revisit the current patches to ensure they follow this > design principle. > > Please let me know if there are additional constraints or edge cases I > should consider. > > @Kuniyuki Iwashima, Please follow the per-newlink/dellink locking > approach. Thanks a lot. To be clear, you meant implementing David' idea, right ? I'm asking because dellink won't need locking then. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-28 0:52 ` Kuniyuki Iwashima @ 2026-04-28 0:58 ` David Ahern 2026-04-28 2:15 ` Zhu Yanjun 0 siblings, 1 reply; 22+ messages in thread From: David Ahern @ 2026-04-28 0:58 UTC (permalink / raw) To: Kuniyuki Iwashima, yanjun.zhu Cc: Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 On 4/27/26 6:52 PM, Kuniyuki Iwashima wrote: > > To be clear, you meant implementing David' idea, right ? > I'm asking because dellink won't need locking then. dellink is not needed with my suggestion. It was added to manage basically a refcount on the socket to close on last rxe delete in the namespace. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-28 0:58 ` David Ahern @ 2026-04-28 2:15 ` Zhu Yanjun 2026-04-28 5:12 ` Zhu Yanjun 0 siblings, 1 reply; 22+ messages in thread From: Zhu Yanjun @ 2026-04-28 2:15 UTC (permalink / raw) To: David Ahern, Kuniyuki Iwashima Cc: Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 在 2026/4/27 17:58, David Ahern 写道: > On 4/27/26 6:52 PM, Kuniyuki Iwashima wrote: >> To be clear, you meant implementing David' idea, right ? >> I'm asking because dellink won't need locking then. > dellink is not needed with my suggestion. It was added to manage > basically a refcount on the socket to close on last rxe delete in the This is my original implementation. @Kuniyuki Iwashima, can you reproduce this problem in your local host or other test environments? If yes, can you make tests after applying the commit in the link: https://patchwork.kernel.org/project/linux-rdma/patch/20260424043522.22901-1-yanjun.zhu@linux.dev/ Thanks a lot. Zhu Yanjun > namespace. -- Best Regards, Yanjun.Zhu ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-28 2:15 ` Zhu Yanjun @ 2026-04-28 5:12 ` Zhu Yanjun 2026-04-28 5:22 ` Kuniyuki Iwashima 0 siblings, 1 reply; 22+ messages in thread From: Zhu Yanjun @ 2026-04-28 5:12 UTC (permalink / raw) To: David Ahern, Kuniyuki Iwashima, yanjun.zhu@linux.dev Cc: Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 在 2026/4/27 19:15, Zhu Yanjun 写道: > > 在 2026/4/27 17:58, David Ahern 写道: >> On 4/27/26 6:52 PM, Kuniyuki Iwashima wrote: >>> To be clear, you meant implementing David' idea, right ? >>> I'm asking because dellink won't need locking then. >> dellink is not needed with my suggestion. It was added to manage >> basically a refcount on the socket to close on last rxe delete in the > > This is my original implementation. > > @Kuniyuki Iwashima, can you reproduce this problem in your local host or > other test environments? > > If yes, can you make tests after applying the commit in the link: > https://patchwork.kernel.org/project/linux-rdma/ > patch/20260424043522.22901-1-yanjun.zhu@linux.dev/ > > Thanks a lot. Hi, David && Kuniyuki I read the call trace again. If net namespace has already released socket in A thread, then rdma link del command is called in B thread to release socket. So A thread has released socket firstly, then B thread also release socket. The similar call trace would appear. The followiing is the explanation to the commit https://patchwork.kernel.org/project/linux-rdma/patch/20260424043522.22901-1-yanjun.zhu@linux.dev/ The double-free occurs as follows: CPU 0 (Net NameSpace cleanup) CPU 1 (RDMA device removal) --------------------- --------------------------- rxe_ns_exit() rxe_link_delete() (rdma link del ) -> sk = ns_sk->rxe_sk4 -> sk = ns_sk->rxe_sk4 -> udp_tunnel_sock_release(sk) [Success: First Free] -> udp_tunnel_sock_release(sk) [Crash: Double Free] After removing the socket release logic from rxe_ns_exit(), we ensure that only the device destruction path (rxe_link_delete) is responsible for freeing the tunnel sockets, effectively eliminating the double-free problem. I am not sure if this is the root cause or not. Please comment. Thanks a lot. Zhu Yanjun > > Zhu Yanjun > >> namespace. > -- Best Regards, Yanjun.Zhu ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-28 5:12 ` Zhu Yanjun @ 2026-04-28 5:22 ` Kuniyuki Iwashima 2026-04-28 6:30 ` Zhu Yanjun 0 siblings, 1 reply; 22+ messages in thread From: Kuniyuki Iwashima @ 2026-04-28 5:22 UTC (permalink / raw) To: Zhu Yanjun Cc: David Ahern, Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 On Mon, Apr 27, 2026 at 10:12 PM Zhu Yanjun <yanjun.zhu@linux.dev> wrote: > > > > 在 2026/4/27 19:15, Zhu Yanjun 写道: > > > > 在 2026/4/27 17:58, David Ahern 写道: > >> On 4/27/26 6:52 PM, Kuniyuki Iwashima wrote: > >>> To be clear, you meant implementing David' idea, right ? > >>> I'm asking because dellink won't need locking then. > >> dellink is not needed with my suggestion. It was added to manage > >> basically a refcount on the socket to close on last rxe delete in the > > > > This is my original implementation. > > > > @Kuniyuki Iwashima, can you reproduce this problem in your local host or > > other test environments? The syzbot does not have a repro, but I think it can be reproduced by calling newlink and dellink with multiple threads. newlink would trigger kmemleak splat while dellink trigger KASAN splat. > > > > If yes, can you make tests after applying the commit in the link: > > https://patchwork.kernel.org/project/linux-rdma/ > > patch/20260424043522.22901-1-yanjun.zhu@linux.dev/ > > > > Thanks a lot. > > Hi, David && Kuniyuki > > I read the call trace again. > > If net namespace has already released socket in A thread, then rdma link > del command is called in B thread to release socket. > > So A thread has released socket firstly, then B thread also release socket. > > The similar call trace would appear. > > The followiing is the explanation to the commit > https://patchwork.kernel.org/project/linux-rdma/patch/20260424043522.22901-1-yanjun.zhu@linux.dev/ > > The double-free occurs as follows: > > CPU 0 (Net NameSpace cleanup) CPU 1 (RDMA device removal) > --------------------- --------------------------- > rxe_ns_exit() rxe_link_delete() (rdma link del ) If rxe_link_delete() is in progress, it means the user thread is alive, holding the netns refcount, and rxe_ns_exit() cannot be called. So, dellink() never races with rxe_ns_exit(), and it races only with the concurrent dellink(). And when that occurs, the number of threads is not limited to two, theoretically triple-free, quad-free, ... are possible. > -> sk = ns_sk->rxe_sk4 -> sk = ns_sk->rxe_sk4 > -> udp_tunnel_sock_release(sk) > [Success: First Free] -> udp_tunnel_sock_release(sk) > [Crash: Double Free] > > After removing the socket release logic from rxe_ns_exit(), we ensure > that only the device destruction path (rxe_link_delete) is responsible > for freeing the tunnel sockets, effectively eliminating the double-free > problem. > > I am not sure if this is the root cause or not. > > Please comment. > > Thanks a lot. > Zhu Yanjun > > > > > Zhu Yanjun > > > >> namespace. > > > > -- > Best Regards, > Yanjun.Zhu > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-28 5:22 ` Kuniyuki Iwashima @ 2026-04-28 6:30 ` Zhu Yanjun 2026-04-28 6:39 ` Kuniyuki Iwashima 0 siblings, 1 reply; 22+ messages in thread From: Zhu Yanjun @ 2026-04-28 6:30 UTC (permalink / raw) To: Kuniyuki Iwashima, yanjun.zhu@linux.dev Cc: David Ahern, Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 在 2026/4/27 22:22, Kuniyuki Iwashima 写道: > On Mon, Apr 27, 2026 at 10:12 PM Zhu Yanjun <yanjun.zhu@linux.dev> wrote: >> >> >> >> 在 2026/4/27 19:15, Zhu Yanjun 写道: >>> >>> 在 2026/4/27 17:58, David Ahern 写道: >>>> On 4/27/26 6:52 PM, Kuniyuki Iwashima wrote: >>>>> To be clear, you meant implementing David' idea, right ? >>>>> I'm asking because dellink won't need locking then. >>>> dellink is not needed with my suggestion. It was added to manage >>>> basically a refcount on the socket to close on last rxe delete in the >>> >>> This is my original implementation. >>> >>> @Kuniyuki Iwashima, can you reproduce this problem in your local host or >>> other test environments? > > The syzbot does not have a repro, but I think it can be > reproduced by calling newlink and dellink with multiple > threads. > > newlink would trigger kmemleak splat while dellink trigger > KASAN splat. > > >>> >>> If yes, can you make tests after applying the commit in the link: >>> https://patchwork.kernel.org/project/linux-rdma/ >>> patch/20260424043522.22901-1-yanjun.zhu@linux.dev/ >>> >>> Thanks a lot. >> >> Hi, David && Kuniyuki >> >> I read the call trace again. >> >> If net namespace has already released socket in A thread, then rdma link >> del command is called in B thread to release socket. >> >> So A thread has released socket firstly, then B thread also release socket. >> >> The similar call trace would appear. >> >> The followiing is the explanation to the commit >> https://patchwork.kernel.org/project/linux-rdma/patch/20260424043522.22901-1-yanjun.zhu@linux.dev/ >> >> The double-free occurs as follows: >> >> CPU 0 (Net NameSpace cleanup) CPU 1 (RDMA device removal) >> --------------------- --------------------------- >> rxe_ns_exit() rxe_link_delete() (rdma link del ) > > If rxe_link_delete() is in progress, it means the user thread is > alive, holding the netns refcount, and rxe_ns_exit() cannot be > called. > > So, dellink() never races with rxe_ns_exit(), and it races only > with the concurrent dellink(). > > And when that occurs, the number of threads is not limited to > two, theoretically triple-free, quad-free, ... are possible. Thread 1: rdma link del Thread 2: rdma link del (User A calls dellink) (User B calls dellink) | | (1) Get Socket Pointer (2) Get Socket Pointer sk = ns_sk->rxe_sk4 sk = ns_sk->rxe_sk4 | | (3) Release Socket (4) Release Socket udp_tunnel_sock_release(sk) udp_tunnel_sock_release(sk) | | [ FIRST FREE ] | | [ DOUBLE FREE! ] v v (Memory freed) (Kernel Panic / Crash) I think the above should explain your idea. If so, your solution makes senses to add a per-netns mutex to synchronise. Let us use the first solution https://lore.kernel.org/all/20260424013759.728288-1-kuniyu@google.com/ BTW, 1) add mutex_destroy 2) take into account of rdma link add. I am not sure if it is OK or not. @David Ahern Thanks. Zhu Yanjun > > >> -> sk = ns_sk->rxe_sk4 -> sk = ns_sk->rxe_sk4 >> -> udp_tunnel_sock_release(sk) >> [Success: First Free] -> udp_tunnel_sock_release(sk) >> [Crash: Double Free] >> >> After removing the socket release logic from rxe_ns_exit(), we ensure >> that only the device destruction path (rxe_link_delete) is responsible >> for freeing the tunnel sockets, effectively eliminating the double-free >> problem. >> >> I am not sure if this is the root cause or not. >> >> Please comment. >> >> Thanks a lot. >> Zhu Yanjun >> >>> >>> Zhu Yanjun >>> >>>> namespace. >>> >> >> -- >> Best Regards, >> Yanjun.Zhu >> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-28 6:30 ` Zhu Yanjun @ 2026-04-28 6:39 ` Kuniyuki Iwashima 2026-04-28 16:56 ` yanjun.zhu 0 siblings, 1 reply; 22+ messages in thread From: Kuniyuki Iwashima @ 2026-04-28 6:39 UTC (permalink / raw) To: Zhu Yanjun Cc: David Ahern, Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 On Mon, Apr 27, 2026 at 11:30 PM Zhu Yanjun <yanjun.zhu@linux.dev> wrote: > > 在 2026/4/27 22:22, Kuniyuki Iwashima 写道: > > On Mon, Apr 27, 2026 at 10:12 PM Zhu Yanjun <yanjun.zhu@linux.dev> wrote: > >> > >> > >> > >> 在 2026/4/27 19:15, Zhu Yanjun 写道: > >>> > >>> 在 2026/4/27 17:58, David Ahern 写道: > >>>> On 4/27/26 6:52 PM, Kuniyuki Iwashima wrote: > >>>>> To be clear, you meant implementing David' idea, right ? > >>>>> I'm asking because dellink won't need locking then. > >>>> dellink is not needed with my suggestion. It was added to manage > >>>> basically a refcount on the socket to close on last rxe delete in the > >>> > >>> This is my original implementation. > >>> > >>> @Kuniyuki Iwashima, can you reproduce this problem in your local host or > >>> other test environments? > > > > The syzbot does not have a repro, but I think it can be > > reproduced by calling newlink and dellink with multiple > > threads. > > > > newlink would trigger kmemleak splat while dellink trigger > > KASAN splat. > > > > > >>> > >>> If yes, can you make tests after applying the commit in the link: > >>> https://patchwork.kernel.org/project/linux-rdma/ > >>> patch/20260424043522.22901-1-yanjun.zhu@linux.dev/ > >>> > >>> Thanks a lot. > >> > >> Hi, David && Kuniyuki > >> > >> I read the call trace again. > >> > >> If net namespace has already released socket in A thread, then rdma link > >> del command is called in B thread to release socket. > >> > >> So A thread has released socket firstly, then B thread also release socket. > >> > >> The similar call trace would appear. > >> > >> The followiing is the explanation to the commit > >> https://patchwork.kernel.org/project/linux-rdma/patch/20260424043522.22901-1-yanjun.zhu@linux.dev/ > >> > >> The double-free occurs as follows: > >> > >> CPU 0 (Net NameSpace cleanup) CPU 1 (RDMA device removal) > >> --------------------- --------------------------- > >> rxe_ns_exit() rxe_link_delete() (rdma link del ) > > > > If rxe_link_delete() is in progress, it means the user thread is > > alive, holding the netns refcount, and rxe_ns_exit() cannot be > > called. > > > > So, dellink() never races with rxe_ns_exit(), and it races only > > with the concurrent dellink(). > > > > And when that occurs, the number of threads is not limited to > > two, theoretically triple-free, quad-free, ... are possible. > > Thread 1: rdma link del Thread 2: rdma link del > (User A calls dellink) (User B calls dellink) > | | > (1) Get Socket Pointer (2) Get Socket Pointer > sk = ns_sk->rxe_sk4 sk = ns_sk->rxe_sk4 > | | > (3) Release Socket (4) Release Socket > udp_tunnel_sock_release(sk) udp_tunnel_sock_release(sk) > | | > [ FIRST FREE ] | > | [ DOUBLE FREE! ] > v v > (Memory freed) (Kernel Panic / Crash) > > I think the above should explain your idea. If so, your solution makes > senses to add a per-netns mutex to synchronise. > > Let us use the first solution > https://lore.kernel.org/all/20260424013759.728288-1-kuniyu@google.com/ > > BTW, 1) add mutex_destroy 2) take into account of rdma link add. > > I am not sure if it is OK or not. @David Ahern No, newlink is still racy and the same kind of race leaks the udp tunnel. If we defer allocation, there are two options: 1. David's idea, allocate on first use, and no free until netns destruction (newlink can add a fast path like check the pointer and only take mutex when it's NULL, and check again under mutex and allcoate a tunnel if not yet allocated) 2. Manage refcount properly. (If we allocate a dedicated refcount for each tunnel socket in rxe_ns_sock, we can implement a similar fast path for newlink, and dellink will be lockless thanks to atomic) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown(). 2026-04-28 6:39 ` Kuniyuki Iwashima @ 2026-04-28 16:56 ` yanjun.zhu 0 siblings, 0 replies; 22+ messages in thread From: yanjun.zhu @ 2026-04-28 16:56 UTC (permalink / raw) To: Kuniyuki Iwashima, Zhu Yanjun Cc: David Ahern, Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21 On 4/27/26 11:39 PM, Kuniyuki Iwashima wrote: > On Mon, Apr 27, 2026 at 11:30 PM Zhu Yanjun <yanjun.zhu@linux.dev> wrote: >> >> 在 2026/4/27 22:22, Kuniyuki Iwashima 写道: >>> On Mon, Apr 27, 2026 at 10:12 PM Zhu Yanjun <yanjun.zhu@linux.dev> wrote: >>>> >>>> >>>> >>>> 在 2026/4/27 19:15, Zhu Yanjun 写道: >>>>> >>>>> 在 2026/4/27 17:58, David Ahern 写道: >>>>>> On 4/27/26 6:52 PM, Kuniyuki Iwashima wrote: >>>>>>> To be clear, you meant implementing David' idea, right ? >>>>>>> I'm asking because dellink won't need locking then. >>>>>> dellink is not needed with my suggestion. It was added to manage >>>>>> basically a refcount on the socket to close on last rxe delete in the >>>>> >>>>> This is my original implementation. >>>>> >>>>> @Kuniyuki Iwashima, can you reproduce this problem in your local host or >>>>> other test environments? >>> >>> The syzbot does not have a repro, but I think it can be >>> reproduced by calling newlink and dellink with multiple >>> threads. >>> >>> newlink would trigger kmemleak splat while dellink trigger >>> KASAN splat. >>> >>> >>>>> >>>>> If yes, can you make tests after applying the commit in the link: >>>>> https://patchwork.kernel.org/project/linux-rdma/ >>>>> patch/20260424043522.22901-1-yanjun.zhu@linux.dev/ >>>>> >>>>> Thanks a lot. >>>> >>>> Hi, David && Kuniyuki >>>> >>>> I read the call trace again. >>>> >>>> If net namespace has already released socket in A thread, then rdma link >>>> del command is called in B thread to release socket. >>>> >>>> So A thread has released socket firstly, then B thread also release socket. >>>> >>>> The similar call trace would appear. >>>> >>>> The followiing is the explanation to the commit >>>> https://patchwork.kernel.org/project/linux-rdma/patch/20260424043522.22901-1-yanjun.zhu@linux.dev/ >>>> >>>> The double-free occurs as follows: >>>> >>>> CPU 0 (Net NameSpace cleanup) CPU 1 (RDMA device removal) >>>> --------------------- --------------------------- >>>> rxe_ns_exit() rxe_link_delete() (rdma link del ) >>> >>> If rxe_link_delete() is in progress, it means the user thread is >>> alive, holding the netns refcount, and rxe_ns_exit() cannot be >>> called. >>> >>> So, dellink() never races with rxe_ns_exit(), and it races only >>> with the concurrent dellink(). >>> >>> And when that occurs, the number of threads is not limited to >>> two, theoretically triple-free, quad-free, ... are possible. >> >> Thread 1: rdma link del Thread 2: rdma link del >> (User A calls dellink) (User B calls dellink) >> | | >> (1) Get Socket Pointer (2) Get Socket Pointer >> sk = ns_sk->rxe_sk4 sk = ns_sk->rxe_sk4 >> | | >> (3) Release Socket (4) Release Socket >> udp_tunnel_sock_release(sk) udp_tunnel_sock_release(sk) >> | | >> [ FIRST FREE ] | >> | [ DOUBLE FREE! ] >> v v >> (Memory freed) (Kernel Panic / Crash) >> >> I think the above should explain your idea. If so, your solution makes >> senses to add a per-netns mutex to synchronise. >> >> Let us use the first solution >> https://lore.kernel.org/all/20260424013759.728288-1-kuniyu@google.com/ >> >> BTW, 1) add mutex_destroy 2) take into account of rdma link add. >> >> I am not sure if it is OK or not. @David Ahern > > No, newlink is still racy and the same kind of race leaks > the udp tunnel. > > If we defer allocation, there are two options: > > 1. David's idea, allocate on first use, and no free > until netns destruction (newlink can add a fast path > like check the pointer and only take mutex when it's > NULL, and check again under mutex and allcoate a > tunnel if not yet allocated) > > 2. Manage refcount properly. (If we allocate a dedicated > refcount for each tunnel socket in rxe_ns_sock, we > can implement a similar fast path for newlink, and dellink > will be lockless thanks to atomic) I suggest we stick with Option 2 (proper refcounting) but move away from a purely lockless dellink. By protecting the tunnel destruction with a mutex, we can effectively close the race window and ensure the UDP tunnel is cleaned up reliably without compromising the efficiency of the fast path in newlink. Zhu Yanjun ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v2 2/2] RDMA/rxe: Fix up RCU usage for rxe_ns_pernet_sk6(). 2026-04-25 6:04 [PATCH v2 0/2] RDMA/rxe: Fix per-netns UDP tunnel issues Kuniyuki Iwashima 2026-04-25 6:04 ` [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown() Kuniyuki Iwashima @ 2026-04-25 6:04 ` Kuniyuki Iwashima 2026-04-25 21:26 ` Zhu Yanjun 1 sibling, 1 reply; 22+ messages in thread From: Kuniyuki Iwashima @ 2026-04-25 6:04 UTC (permalink / raw) To: Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky Cc: David Ahern, Kuniyuki Iwashima, Kuniyuki Iwashima, linux-rdma rxe_ns_pernet_sk6() is fundamentally broken. rcu_read_lock() only silences rcu_dereference() splat. The returned socket is no longer protected, and it may be freed during ip6_dst_lookup_flow(). Let's call rxe_ns_pernet_sk6() and ip6_dst_lookup_flow() under RCU. Fixes: f1327abd6abe ("RDMA/rxe: Support RDMA link creation and destruction per net namespace") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> --- drivers/infiniband/sw/rxe/rxe_net.c | 11 ++++++++--- drivers/infiniband/sw/rxe/rxe_ns.c | 7 +------ 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c index 9080d4c893a1..8fca5c24c8b1 100644 --- a/drivers/infiniband/sw/rxe/rxe_net.c +++ b/drivers/infiniband/sw/rxe/rxe_net.c @@ -133,16 +133,21 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp, struct in6_addr *saddr, struct in6_addr *daddr) { - struct dst_entry *ndst; + struct dst_entry *ndst = NULL; struct flowi6 fl6 = {}; + struct sock *sk; fl6.flowi6_oif = ndev->ifindex; memcpy(&fl6.saddr, saddr, sizeof(*saddr)); memcpy(&fl6.daddr, daddr, sizeof(*daddr)); fl6.flowi6_proto = IPPROTO_UDP; - ndst = ip6_dst_lookup_flow(net, rxe_ns_pernet_sk6(net), &fl6, NULL); - if (IS_ERR(ndst)) { + rcu_read_lock(); + sk = rxe_ns_pernet_sk6(net); + if (sk) + ndst = ip6_dst_lookup_flow(net, sk, &fl6, NULL); + rcu_read_unlock(); + if (IS_ERR_OR_NULL(ndst)) { rxe_dbg_qp(qp, "no route to %pI6\n", daddr); return NULL; } diff --git a/drivers/infiniband/sw/rxe/rxe_ns.c b/drivers/infiniband/sw/rxe/rxe_ns.c index 06eb2e2387a1..ef408ffc0558 100644 --- a/drivers/infiniband/sw/rxe/rxe_ns.c +++ b/drivers/infiniband/sw/rxe/rxe_ns.c @@ -91,13 +91,8 @@ static struct pernet_operations rxe_net_ops = { struct sock *rxe_ns_pernet_sk6(struct net *net) { struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id); - struct sock *sk; - - rcu_read_lock(); - sk = rcu_dereference(ns_sk->rxe_sk6); - rcu_read_unlock(); - return sk; + return rcu_dereference(ns_sk->rxe_sk6); } #endif /* IPV6 */ -- 2.54.0.rc2.544.gc7ae2d5bb8-goog ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v2 2/2] RDMA/rxe: Fix up RCU usage for rxe_ns_pernet_sk6(). 2026-04-25 6:04 ` [PATCH v2 2/2] RDMA/rxe: Fix up RCU usage for rxe_ns_pernet_sk6() Kuniyuki Iwashima @ 2026-04-25 21:26 ` Zhu Yanjun 0 siblings, 0 replies; 22+ messages in thread From: Zhu Yanjun @ 2026-04-25 21:26 UTC (permalink / raw) To: Kuniyuki Iwashima, Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, yanjun.zhu@linux.dev Cc: David Ahern, Kuniyuki Iwashima, linux-rdma 在 2026/4/24 23:04, Kuniyuki Iwashima 写道: > rxe_ns_pernet_sk6() is fundamentally broken. > > rcu_read_lock() only silences rcu_dereference() splat. > > The returned socket is no longer protected, and it may be > freed during ip6_dst_lookup_flow(). > > Let's call rxe_ns_pernet_sk6() and ip6_dst_lookup_flow() > under RCU. > > Fixes: f1327abd6abe ("RDMA/rxe: Support RDMA link creation and destruction per net namespace") > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Thanks a lot. Please David Ahern, Leon and Jason comment. Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Zhu Yanjun > --- > drivers/infiniband/sw/rxe/rxe_net.c | 11 ++++++++--- > drivers/infiniband/sw/rxe/rxe_ns.c | 7 +------ > 2 files changed, 9 insertions(+), 9 deletions(-) > > diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c > index 9080d4c893a1..8fca5c24c8b1 100644 > --- a/drivers/infiniband/sw/rxe/rxe_net.c > +++ b/drivers/infiniband/sw/rxe/rxe_net.c > @@ -133,16 +133,21 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp, > struct in6_addr *saddr, > struct in6_addr *daddr) > { > - struct dst_entry *ndst; > + struct dst_entry *ndst = NULL; > struct flowi6 fl6 = {}; > + struct sock *sk; > > fl6.flowi6_oif = ndev->ifindex; > memcpy(&fl6.saddr, saddr, sizeof(*saddr)); > memcpy(&fl6.daddr, daddr, sizeof(*daddr)); > fl6.flowi6_proto = IPPROTO_UDP; > > - ndst = ip6_dst_lookup_flow(net, rxe_ns_pernet_sk6(net), &fl6, NULL); > - if (IS_ERR(ndst)) { > + rcu_read_lock(); > + sk = rxe_ns_pernet_sk6(net); > + if (sk) > + ndst = ip6_dst_lookup_flow(net, sk, &fl6, NULL); > + rcu_read_unlock(); > + if (IS_ERR_OR_NULL(ndst)) { > rxe_dbg_qp(qp, "no route to %pI6\n", daddr); > return NULL; > } > diff --git a/drivers/infiniband/sw/rxe/rxe_ns.c b/drivers/infiniband/sw/rxe/rxe_ns.c > index 06eb2e2387a1..ef408ffc0558 100644 > --- a/drivers/infiniband/sw/rxe/rxe_ns.c > +++ b/drivers/infiniband/sw/rxe/rxe_ns.c > @@ -91,13 +91,8 @@ static struct pernet_operations rxe_net_ops = { > struct sock *rxe_ns_pernet_sk6(struct net *net) > { > struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id); > - struct sock *sk; > - > - rcu_read_lock(); > - sk = rcu_dereference(ns_sk->rxe_sk6); > - rcu_read_unlock(); > > - return sk; > + return rcu_dereference(ns_sk->rxe_sk6); > } > #endif /* IPV6 */ > ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2026-04-28 16:56 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-25 6:04 [PATCH v2 0/2] RDMA/rxe: Fix per-netns UDP tunnel issues Kuniyuki Iwashima 2026-04-25 6:04 ` [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown() Kuniyuki Iwashima 2026-04-25 15:47 ` David Ahern 2026-04-25 20:55 ` Kuniyuki Iwashima 2026-04-26 16:40 ` David Ahern 2026-04-25 21:25 ` Zhu Yanjun 2026-04-26 16:42 ` David Ahern 2026-04-27 2:57 ` Zhu Yanjun 2026-04-27 3:10 ` Kuniyuki Iwashima 2026-04-27 3:53 ` Zhu Yanjun 2026-04-27 14:38 ` David Ahern 2026-04-27 20:20 ` yanjun.zhu 2026-04-28 0:52 ` Kuniyuki Iwashima 2026-04-28 0:58 ` David Ahern 2026-04-28 2:15 ` Zhu Yanjun 2026-04-28 5:12 ` Zhu Yanjun 2026-04-28 5:22 ` Kuniyuki Iwashima 2026-04-28 6:30 ` Zhu Yanjun 2026-04-28 6:39 ` Kuniyuki Iwashima 2026-04-28 16:56 ` yanjun.zhu 2026-04-25 6:04 ` [PATCH v2 2/2] RDMA/rxe: Fix up RCU usage for rxe_ns_pernet_sk6() Kuniyuki Iwashima 2026-04-25 21:26 ` Zhu Yanjun
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.