[PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace
@ 2026-03-04  4:16 Zhu Yanjun
  2026-03-04  4:44 ` Zhu Yanjun
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Zhu Yanjun @ 2026-03-04  4:16 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, dsahern, linux-rdma, yanjun.zhu

When run "ip link add" command to add a rxe rdma link in a net
namespace, normally this rxe rdma link can not work in a net
name space.

The root cause is that a sock listening on udp port 4791 is created
in init_net when the rdma_rxe module is loaded into kernel. That is,
the sock listening on udp port 4791 is created in init_net. Other net
namespace is difficult to use this sock.

The following commits will solve this problem.

In the first commit, move the creating sock listening on udp port 4791
from module_init function to rdma link creating functions. That is,
after the module rdma_rxe is loaded, the sock will not be created.
When run "rdma link add ..." command, the sock will be created. So
when creating a rdma link in the net namespace, the sock will be
created in this net namespace.

In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
will check the sock exists in the net namespace or not. If yes, rdma
link will increase the reference count of this sock, then continue other
jobs instead of creating a new sock to listen on udp port 4791. Since the
network notifier is global, when the module rdma_rxe is loaded, this
notifier will be registered.

After the rdma link is created, the command "rdma link del" is to
delete rdma link at the same time the sock is checked. If the reference
count of this sock is greater than the sock reference count needed by
udp tunnel, the sock reference count is decreased by one. If equal, it
indicates that this rdma link is the last one. As such, the udp tunnel
is shut down and the sock is closed. The above work should be
implemented in linkdel function. But currently no dellink function in
rxe. So the 3rd commit addes dellink function pointer. And the 4th
commit implements the dellink function in rxe.

To now, it is not necessary to keep a global variable to store the sock
listening udp port 4791. This global variable can be replaced by the
functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
function udp6_lib_lookup is in the fast path, a member variable l_sk6
is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
to lookup the sock, then the sock is stored in l_sk6, in the future,it
can be used directly.

All the above work has been done in init_net. And it can also work in
the net namespace. So the init_net is replaced by the individual net
namespace. This is what the 6th commit does. Because rxe device is
dependent on the net device and the sock listening on udp port 4791,
every rxe device is in exclusive mode in the individual net namespace.
Other rdma netns operations will be considerred in the future.

In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys
functions are added. When a new net namespace is created, the init
function will initialize the sk4 and sk6 socks. Then the 2 socks will
be released when the net namespace is destroyed. The functions
rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net
namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will
handle sk6. Then sk4 and sk6 are used in the previous commits.

As the sk4 and sk6 in pernet namespace can be accessed, it is not
necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is
replaced with the sk6 in pernet namespace.

Test steps:
1) Suppose that 2 NICs are in 2 different net namespaces.

  # ip netns exec net0 ip link
  3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
     link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
     altname enp5s0

  # ip netns exec net1 ip link
  4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
     link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff

2) Add rdma link in the different net namespace
    net0:
    # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2

    net1:
    # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3

3) Run rping test.
    net0
    # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
    [1] 1737
    # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
    verbose
    count 1
    ...
    ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
    ...

4) Remove the rdma links from the net namespaces.
    net0:
    # ip netns exec net0 ss -lu
    State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
    UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
    UNCONN    0         0         [::]:4791             [::]:*

    # ip netns exec net0 rdma link del rxe0

    # ip netns exec net0 ss -lu
    State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process

    net1:
    # ip netns exec net0 ss -lu
    State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
    UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
    UNCONN    0         0         [::]:4791             [::]:*

    # ip netns exec net1 rdma link del rxe1

    # ip netns exec net0 ss -lu
    State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
1. Use net_generic;
2. Add ipv6 check;
3. Use a function to handle ipv4/6 socket;
---
 drivers/infiniband/core/nldev.c     |   6 ++
 drivers/infiniband/sw/rxe/Makefile  |   3 +-
 drivers/infiniband/sw/rxe/rxe.c     |  32 +++++-
 drivers/infiniband/sw/rxe/rxe_net.c | 132 +++++++++++++++++------
 drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
 drivers/infiniband/sw/rxe/rxe_ns.c  | 156 ++++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_ns.h  |  17 +++
 include/rdma/rdma_netlink.h         |   2 +
 8 files changed, 316 insertions(+), 41 deletions(-)
 create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
 create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.h

diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index 2220a2dfab24..48684930660a 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -1824,6 +1824,12 @@ static int nldev_dellink(struct sk_buff *skb, struct nlmsghdr *nlh,
 		return -EINVAL;
 	}
 
+	if (device->link_ops) {
+		err = device->link_ops->dellink(device);
+		if (err)
+			return err;
+	}
+
 	ib_unregister_device_and_put(device);
 	return 0;
 }
diff --git a/drivers/infiniband/sw/rxe/Makefile b/drivers/infiniband/sw/rxe/Makefile
index 93134f1d1d0c..3977f4f13258 100644
--- a/drivers/infiniband/sw/rxe/Makefile
+++ b/drivers/infiniband/sw/rxe/Makefile
@@ -22,6 +22,7 @@ rdma_rxe-y := \
 	rxe_mcast.o \
 	rxe_task.o \
 	rxe_net.o \
-	rxe_hw_counters.o
+	rxe_hw_counters.o \
+	rxe_ns.o
 
 rdma_rxe-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += rxe_odp.o
diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index e891199cbdef..165155f9be6d 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -8,6 +8,8 @@
 #include <net/addrconf.h>
 #include "rxe.h"
 #include "rxe_loc.h"
+#include "rxe_net.h"
+#include "rxe_ns.h"
 
 MODULE_AUTHOR("Bob Pearson, Frank Zago, John Groves, Kamal Heib");
 MODULE_DESCRIPTION("Soft RDMA transport");
@@ -200,6 +202,8 @@ void rxe_set_mtu(struct rxe_dev *rxe, unsigned int ndev_mtu)
 	port->mtu_cap = ib_mtu_enum_to_int(mtu);
 }
 
+static struct rdma_link_ops rxe_link_ops;
+
 /* called by ifc layer to create new rxe device.
  * The caller should allocate memory for rxe by calling ib_alloc_device.
  */
@@ -208,6 +212,7 @@ int rxe_add(struct rxe_dev *rxe, unsigned int mtu, const char *ibdev_name,
 {
 	rxe_init(rxe, ndev);
 	rxe_set_mtu(rxe, mtu);
+	rxe->ib_dev.link_ops = &rxe_link_ops;
 
 	return rxe_register_device(rxe, ibdev_name, ndev);
 }
@@ -231,6 +236,10 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
 		goto err;
 	}
 
+	err = rxe_net_init(ndev);
+	if (err)
+		return err;
+
 	err = rxe_net_add(ibdev_name, ndev);
 	if (err) {
 		rxe_err("failed to add %s\n", ndev->name);
@@ -240,9 +249,17 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
 	return err;
 }
 
+static int rxe_dellink(struct ib_device *dev)
+{
+	rxe_net_del(dev);
+
+	return 0;
+}
+
 static struct rdma_link_ops rxe_link_ops = {
 	.type = "rxe",
 	.newlink = rxe_newlink,
+	.dellink = rxe_dellink,
 };
 
 static int __init rxe_module_init(void)
@@ -253,13 +270,20 @@ static int __init rxe_module_init(void)
 	if (err)
 		return err;
 
-	err = rxe_net_init();
+	rdma_link_register(&rxe_link_ops);
+	err = rxe_register_notifier();
 	if (err) {
+		pr_err("Failed to register netdev notifier\n");
 		rxe_destroy_wq();
-		return err;
+		return -1;
+	}
+
+	err = rxe_namespace_init();
+	if (err) {
+		pr_err("Failed to register net namespace notifier\n");
+		return -1;
 	}
 
-	rdma_link_register(&rxe_link_ops);
 	pr_info("loaded\n");
 	return 0;
 }
@@ -271,6 +295,8 @@ static void __exit rxe_module_exit(void)
 	rxe_net_exit();
 	rxe_destroy_wq();
 
+	rxe_namespace_exit();
+
 	pr_info("unloaded\n");
 }
 
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 0bd0902b11f7..bf50a298c9ba 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -17,8 +17,7 @@
 #include "rxe.h"
 #include "rxe_net.h"
 #include "rxe_loc.h"
-
-static struct rxe_recv_sockets recv_sockets;
+#include "rxe_ns.h"
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 /*
@@ -106,7 +105,7 @@ static struct dst_entry *rxe_find_route4(struct rxe_qp *qp,
 					 struct in_addr *daddr)
 {
 	struct rtable *rt;
-	struct flowi4 fl = { { 0 } };
+	struct flowi4 fl = {};
 
 	memset(&fl, 0, sizeof(fl));
 	fl.flowi4_oif = ndev->ifindex;
@@ -114,7 +113,7 @@ static struct dst_entry *rxe_find_route4(struct rxe_qp *qp,
 	memcpy(&fl.daddr, daddr, sizeof(*daddr));
 	fl.flowi4_proto = IPPROTO_UDP;
 
-	rt = ip_route_output_key(&init_net, &fl);
+	rt = ip_route_output_key(dev_net(ndev), &fl);
 	if (IS_ERR(rt)) {
 		rxe_dbg_qp(qp, "no route to %pI4\n", &daddr->s_addr);
 		return NULL;
@@ -130,7 +129,7 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
 					 struct in6_addr *daddr)
 {
 	struct dst_entry *ndst;
-	struct flowi6 fl6 = { { 0 } };
+	struct flowi6 fl6 = {};
 
 	memset(&fl6, 0, sizeof(fl6));
 	fl6.flowi6_oif = ndev->ifindex;
@@ -138,8 +137,8 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
 	memcpy(&fl6.daddr, daddr, sizeof(*daddr));
 	fl6.flowi6_proto = IPPROTO_UDP;
 
-	ndst = ipv6_stub->ipv6_dst_lookup_flow(sock_net(recv_sockets.sk6->sk),
-					       recv_sockets.sk6->sk, &fl6,
+	ndst = ipv6_stub->ipv6_dst_lookup_flow(dev_net(ndev),
+					       rxe_ns_pernet_sk6(dev_net(ndev)), &fl6,
 					       NULL);
 	if (IS_ERR(ndst)) {
 		rxe_dbg_qp(qp, "no route to %pI6\n", daddr);
@@ -624,6 +623,50 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev)
 	return 0;
 }
 
+#define SK_REF_FOR_TUNNEL	2
+
+static void rxe_sock_put(struct sock *sk,
+					void (*set_sk)(struct net *, struct sock *),
+					struct net_device *ndev)
+{
+	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL) {
+		__sock_put(sk);
+	} else {
+		rxe_release_udp_tunnel(sk->sk_socket);
+		sk = NULL;
+		set_sk(dev_net(ndev), sk);
+	}
+}
+
+void rxe_net_del(struct ib_device *dev)
+{
+	struct sock *sk;
+	struct rxe_dev *rxe;
+	struct net_device *ndev;
+
+	rxe = container_of(dev, struct rxe_dev, ib_dev);
+
+	ndev = rxe_ib_device_get_netdev(&rxe->ib_dev);
+	if (!ndev)
+		return;
+
+	sk = rxe_ns_pernet_sk4(dev_net(ndev));
+	if (!sk)
+		goto err_out;
+
+	rxe_sock_put(sk, rxe_ns_pernet_set_sk4, ndev);
+
+	sk = rxe_ns_pernet_sk6(dev_net(ndev));
+	if (!sk)
+		goto err_out;
+
+	rxe_sock_put(sk, rxe_ns_pernet_set_sk6, ndev);
+
+err_out:
+	dev_put(ndev);
+}
+#undef SK_REF_FOR_TUNNEL
+
 static void rxe_port_event(struct rxe_dev *rxe,
 			   enum ib_event_type event)
 {
@@ -680,6 +723,7 @@ static int rxe_notify(struct notifier_block *not_blk,
 	switch (event) {
 	case NETDEV_UNREGISTER:
 		ib_unregister_device_queued(&rxe->ib_dev);
+		rxe_net_del(&rxe->ib_dev);
 		break;
 	case NETDEV_CHANGEMTU:
 		rxe_dbg_dev(rxe, "%s changed mtu to %d\n", ndev->name, ndev->mtu);
@@ -709,66 +753,92 @@ static struct notifier_block rxe_net_notifier = {
 	.notifier_call = rxe_notify,
 };
 
-static int rxe_net_ipv4_init(void)
+static int rxe_net_ipv4_init(struct net_device *ndev)
 {
-	recv_sockets.sk4 = rxe_setup_udp_tunnel(&init_net,
-				htons(ROCE_V2_UDP_DPORT), false);
-	if (IS_ERR(recv_sockets.sk4)) {
-		recv_sockets.sk4 = NULL;
+	struct sock *sk;
+	struct socket *sock;
+
+	sk = rxe_ns_pernet_sk4(dev_net(ndev));
+	if (sk) {
+		sock_hold(sk);
+		return 0;
+	}
+
+	sock = rxe_setup_udp_tunnel(dev_net(ndev), htons(ROCE_V2_UDP_DPORT), false);
+	if (IS_ERR(sock)) {
 		pr_err("Failed to create IPv4 UDP tunnel\n");
 		return -1;
 	}
+	rxe_ns_pernet_set_sk4(dev_net(ndev), sock->sk);
 
 	return 0;
 }
 
-static int rxe_net_ipv6_init(void)
+static int rxe_net_ipv6_init(struct net_device *ndev)
 {
 #if IS_ENABLED(CONFIG_IPV6)
+	struct sock *sk;
+	struct socket *sock;
+
+	sk = rxe_ns_pernet_sk6(dev_net(ndev));
+	if (sk) {
+		sock_hold(sk);
+		return 0;
+	}
 
-	recv_sockets.sk6 = rxe_setup_udp_tunnel(&init_net,
-						htons(ROCE_V2_UDP_DPORT), true);
-	if (PTR_ERR(recv_sockets.sk6) == -EAFNOSUPPORT) {
-		recv_sockets.sk6 = NULL;
+	sock = rxe_setup_udp_tunnel(dev_net(ndev), htons(ROCE_V2_UDP_DPORT), true);
+	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
 		pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n");
 		return 0;
 	}
 
-	if (IS_ERR(recv_sockets.sk6)) {
-		recv_sockets.sk6 = NULL;
+	if (IS_ERR(sock)) {
 		pr_err("Failed to create IPv6 UDP tunnel\n");
 		return -1;
 	}
+
+	rxe_ns_pernet_set_sk6(dev_net(ndev), sock->sk);
+
 #endif
 	return 0;
 }
 
+int rxe_register_notifier(void)
+{
+	int err;
+
+	err = register_netdevice_notifier(&rxe_net_notifier);
+	if (err) {
+		pr_err("Failed to register netdev notifier\n");
+		return -1;
+	}
+
+	return 0;
+}
+
 void rxe_net_exit(void)
 {
-	rxe_release_udp_tunnel(recv_sockets.sk6);
-	rxe_release_udp_tunnel(recv_sockets.sk4);
 	unregister_netdevice_notifier(&rxe_net_notifier);
 }
 
-int rxe_net_init(void)
+int rxe_net_init(struct net_device *ndev)
 {
 	int err;
 
-	recv_sockets.sk6 = NULL;
-
-	err = rxe_net_ipv4_init();
+	err = rxe_net_ipv4_init(ndev);
 	if (err)
 		return err;
-	err = rxe_net_ipv6_init();
+
+	err = rxe_net_ipv6_init(ndev);
 	if (err)
 		goto err_out;
-	err = register_netdevice_notifier(&rxe_net_notifier);
-	if (err) {
-		pr_err("Failed to register netdev notifier\n");
-		goto err_out;
-	}
+
 	return 0;
+
 err_out:
+	/* If ipv6 error, release ipv4 resource */
+	udp_tunnel_sock_release(rxe_ns_pernet_sk4(dev_net(ndev))->sk_socket);
+	rxe_ns_pernet_set_sk4(dev_net(ndev), NULL);
 	rxe_net_exit();
 	return err;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
index 45d80d00f86b..56249677d692 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.h
+++ b/drivers/infiniband/sw/rxe/rxe_net.h
@@ -11,14 +11,11 @@
 #include <net/if_inet6.h>
 #include <linux/module.h>
 
-struct rxe_recv_sockets {
-	struct socket *sk4;
-	struct socket *sk6;
-};
-
 int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
+void rxe_net_del(struct ib_device *dev);
 
-int rxe_net_init(void);
+int rxe_register_notifier(void);
+int rxe_net_init(struct net_device *ndev);
 void rxe_net_exit(void);
 
 #endif /* RXE_NET_H */
diff --git a/drivers/infiniband/sw/rxe/rxe_ns.c b/drivers/infiniband/sw/rxe/rxe_ns.c
new file mode 100644
index 000000000000..1ff34167a295
--- /dev/null
+++ b/drivers/infiniband/sw/rxe/rxe_ns.c
@@ -0,0 +1,156 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
+ * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
+ */
+
+#include <net/sock.h>
+#include <net/netns/generic.h>
+#include <net/net_namespace.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/pid_namespace.h>
+#include <net/udp_tunnel.h>
+
+#include "rxe_ns.h"
+
+/*
+ * Per network namespace data
+ */
+struct rxe_ns_sock {
+	struct sock __rcu *rxe_sk4;
+	struct sock __rcu *rxe_sk6;
+};
+
+/*
+ * Index to store custom data for each network namespace.
+ */
+static unsigned int rxe_pernet_id;
+
+/*
+ * Called for every existing and added network namespaces
+ */
+static int __net_init rxe_ns_init(struct net *net)
+{
+	/*
+	 * create (if not present) and access data item in network namespace
+	 * (net) using the id (net_id)
+	 */
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+
+	rcu_assign_pointer(ns_sk->rxe_sk4, NULL); /* initialize sock 4 socket */
+#if IS_ENABLED(CONFIG_IPV6)
+	rcu_assign_pointer(ns_sk->rxe_sk6, NULL); /* initialize sock 6 socket */
+#endif /* IPV6 */
+	synchronize_rcu();
+
+	return 0;
+}
+
+static void __net_exit rxe_ns_exit(struct net *net)
+{
+	/*
+	 * called when the network namespace is removed
+	 */
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+	struct sock *rxe_sk4 = NULL;
+#if IS_ENABLED(CONFIG_IPV6)
+	struct sock *rxe_sk6 = NULL;
+#endif
+
+	rcu_read_lock();
+	rxe_sk4 = rcu_dereference(ns_sk->rxe_sk4);
+#if IS_ENABLED(CONFIG_IPV6)
+	rxe_sk6 = rcu_dereference(ns_sk->rxe_sk6);
+#endif
+	rcu_read_unlock();
+
+	/* close socket */
+	if (rxe_sk4 && rxe_sk4->sk_socket) {
+		udp_tunnel_sock_release(rxe_sk4->sk_socket);
+		rcu_assign_pointer(ns_sk->rxe_sk4, NULL);
+		synchronize_rcu();
+	}
+
+#if IS_ENABLED(CONFIG_IPV6)
+	if (rxe_sk6 && rxe_sk6->sk_socket) {
+		udp_tunnel_sock_release(rxe_sk6->sk_socket);
+		rcu_assign_pointer(ns_sk->rxe_sk6, NULL);
+		synchronize_rcu();
+	}
+#endif
+}
+
+/*
+ * callback to make the module network namespace aware
+ */
+static struct pernet_operations rxe_net_ops __net_initdata = {
+	.init = rxe_ns_init,
+	.exit = rxe_ns_exit,
+	.id = &rxe_pernet_id,
+	.size = sizeof(struct rxe_ns_sock),
+};
+
+struct sock *rxe_ns_pernet_sk4(struct net *net)
+{
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+	struct sock *sk;
+
+	rcu_read_lock();
+	sk = rcu_dereference(ns_sk->rxe_sk4);
+	rcu_read_unlock();
+
+	return sk;
+}
+
+void rxe_ns_pernet_set_sk4(struct net *net, struct sock *sk)
+{
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+
+	rcu_assign_pointer(ns_sk->rxe_sk4, sk);
+	synchronize_rcu();
+}
+
+#if IS_ENABLED(CONFIG_IPV6)
+struct sock *rxe_ns_pernet_sk6(struct net *net)
+{
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+	struct sock *sk;
+
+	rcu_read_lock();
+	sk = rcu_dereference(ns_sk->rxe_sk6);
+	rcu_read_unlock();
+
+	return sk;
+}
+
+void rxe_ns_pernet_set_sk6(struct net *net, struct sock *sk)
+{
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+
+	rcu_assign_pointer(ns_sk->rxe_sk6, sk);
+	synchronize_rcu();
+}
+
+#else /* IPV6 */
+
+struct sock *rxe_ns_pernet_sk6(struct net *net)
+{
+	return NULL;
+}
+
+void rxe_ns_pernet_set_sk6(struct net *net, struct sock *sk)
+{
+}
+
+#endif /* IPV6 */
+
+int __init rxe_namespace_init(void)
+{
+	return register_pernet_subsys(&rxe_net_ops);
+}
+
+void __exit rxe_namespace_exit(void)
+{
+	unregister_pernet_subsys(&rxe_net_ops);
+}
diff --git a/drivers/infiniband/sw/rxe/rxe_ns.h b/drivers/infiniband/sw/rxe/rxe_ns.h
new file mode 100644
index 000000000000..da5bfcea1274
--- /dev/null
+++ b/drivers/infiniband/sw/rxe/rxe_ns.h
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
+ * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
+ */
+
+#ifndef RXE_NS_H
+#define RXE_NS_H
+
+struct sock *rxe_ns_pernet_sk4(struct net *net);
+struct sock *rxe_ns_pernet_sk6(struct net *net);
+void rxe_ns_pernet_set_sk4(struct net *net, struct sock *sk);
+void rxe_ns_pernet_set_sk6(struct net *net, struct sock *sk);
+int __init rxe_namespace_init(void);
+void __exit rxe_namespace_exit(void);
+
+#endif /* RXE_NS_H */
diff --git a/include/rdma/rdma_netlink.h b/include/rdma/rdma_netlink.h
index 326deaf56d5d..2fd1358ea57d 100644
--- a/include/rdma/rdma_netlink.h
+++ b/include/rdma/rdma_netlink.h
@@ -5,6 +5,7 @@
 
 #include <linux/netlink.h>
 #include <uapi/rdma/rdma_netlink.h>
+#include <rdma/ib_verbs.h>
 
 struct ib_device;
 
@@ -126,6 +127,7 @@ struct rdma_link_ops {
 	struct list_head list;
 	const char *type;
 	int (*newlink)(const char *ibdev_name, struct net_device *ndev);
+	int (*dellink)(struct ib_device *dev);
 };
 
 void rdma_link_register(struct rdma_link_ops *ops);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace
  2026-03-04  4:16 [PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace Zhu Yanjun
@ 2026-03-04  4:44 ` Zhu Yanjun
  2026-03-04 19:29   ` David Ahern
  2026-03-05 18:54 ` David Ahern
  2026-03-06  2:58 ` kernel test robot
  2 siblings, 1 reply; 12+ messages in thread
From: Zhu Yanjun @ 2026-03-04  4:44 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, dsahern, linux-rdma

在 2026/3/3 20:16, Zhu Yanjun 写道:
> When run "ip link add" command to add a rxe rdma link in a net
> namespace, normally this rxe rdma link can not work in a net
> name space.
> 
> The root cause is that a sock listening on udp port 4791 is created
> in init_net when the rdma_rxe module is loaded into kernel. That is,
> the sock listening on udp port 4791 is created in init_net. Other net
> namespace is difficult to use this sock.
> 
> The following commits will solve this problem.
> 
> In the first commit, move the creating sock listening on udp port 4791
> from module_init function to rdma link creating functions. That is,
> after the module rdma_rxe is loaded, the sock will not be created.
> When run "rdma link add ..." command, the sock will be created. So
> when creating a rdma link in the net namespace, the sock will be
> created in this net namespace.
> 
> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
> will check the sock exists in the net namespace or not. If yes, rdma
> link will increase the reference count of this sock, then continue other
> jobs instead of creating a new sock to listen on udp port 4791. Since the
> network notifier is global, when the module rdma_rxe is loaded, this
> notifier will be registered.
> 
> After the rdma link is created, the command "rdma link del" is to
> delete rdma link at the same time the sock is checked. If the reference
> count of this sock is greater than the sock reference count needed by
> udp tunnel, the sock reference count is decreased by one. If equal, it
> indicates that this rdma link is the last one. As such, the udp tunnel
> is shut down and the sock is closed. The above work should be
> implemented in linkdel function. But currently no dellink function in
> rxe. So the 3rd commit addes dellink function pointer. And the 4th
> commit implements the dellink function in rxe.
> 
> To now, it is not necessary to keep a global variable to store the sock
> listening udp port 4791. This global variable can be replaced by the
> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
> function udp6_lib_lookup is in the fast path, a member variable l_sk6
> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
> to lookup the sock, then the sock is stored in l_sk6, in the future,it
> can be used directly.
> 
> All the above work has been done in init_net. And it can also work in
> the net namespace. So the init_net is replaced by the individual net
> namespace. This is what the 6th commit does. Because rxe device is
> dependent on the net device and the sock listening on udp port 4791,
> every rxe device is in exclusive mode in the individual net namespace.
> Other rdma netns operations will be considerred in the future.
> 
> In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys
> functions are added. When a new net namespace is created, the init
> function will initialize the sk4 and sk6 socks. Then the 2 socks will
> be released when the net namespace is destroyed. The functions
> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net
> namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will
> handle sk6. Then sk4 and sk6 are used in the previous commits.
> 
> As the sk4 and sk6 in pernet namespace can be accessed, it is not
> necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is
> replaced with the sk6 in pernet namespace.
> 
> Test steps:
> 1) Suppose that 2 NICs are in 2 different net namespaces.
> 
>    # ip netns exec net0 ip link
>    3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
>       link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
>       altname enp5s0
> 
>    # ip netns exec net1 ip link
>    4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
>       link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff
> 
> 2) Add rdma link in the different net namespace
>      net0:
>      # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2
> 
>      net1:
>      # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
> 
> 3) Run rping test.
>      net0
>      # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
>      [1] 1737
>      # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
>      verbose
>      count 1
>      ...
>      ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
>      ...
> 
> 4) Remove the rdma links from the net namespaces.
>      net0:
>      # ip netns exec net0 ss -lu
>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>      UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>      UNCONN    0         0         [::]:4791             [::]:*
> 
>      # ip netns exec net0 rdma link del rxe0
> 
>      # ip netns exec net0 ss -lu
>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
> 
>      net1:
>      # ip netns exec net0 ss -lu
>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>      UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>      UNCONN    0         0         [::]:4791             [::]:*
> 
>      # ip netns exec net1 rdma link del rxe1
> 
>      # ip netns exec net0 ss -lu
>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
> 
> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>

Hi, all

The script in the link 
https://github.com/zhuyj/linux/blob/6.19-net-namespace/net_ns_rxe.sh can 
make tests in linux distributions.

BTW, please disable firewall before making tests.

Zhu Yanjun

> ---
> 1. Use net_generic;
> 2. Add ipv6 check;
> 3. Use a function to handle ipv4/6 socket;
> ---
>   drivers/infiniband/core/nldev.c     |   6 ++
>   drivers/infiniband/sw/rxe/Makefile  |   3 +-
>   drivers/infiniband/sw/rxe/rxe.c     |  32 +++++-
>   drivers/infiniband/sw/rxe/rxe_net.c | 132 +++++++++++++++++------
>   drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
>   drivers/infiniband/sw/rxe/rxe_ns.c  | 156 ++++++++++++++++++++++++++++
>   drivers/infiniband/sw/rxe/rxe_ns.h  |  17 +++
>   include/rdma/rdma_netlink.h         |   2 +
>   8 files changed, 316 insertions(+), 41 deletions(-)
>   create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
>   create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.h
> 
> diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
> index 2220a2dfab24..48684930660a 100644
> --- a/drivers/infiniband/core/nldev.c
> +++ b/drivers/infiniband/core/nldev.c
> @@ -1824,6 +1824,12 @@ static int nldev_dellink(struct sk_buff *skb, struct nlmsghdr *nlh,
>   		return -EINVAL;
>   	}
>   
> +	if (device->link_ops) {
> +		err = device->link_ops->dellink(device);
> +		if (err)
> +			return err;
> +	}
> +
>   	ib_unregister_device_and_put(device);
>   	return 0;
>   }
> diff --git a/drivers/infiniband/sw/rxe/Makefile b/drivers/infiniband/sw/rxe/Makefile
> index 93134f1d1d0c..3977f4f13258 100644
> --- a/drivers/infiniband/sw/rxe/Makefile
> +++ b/drivers/infiniband/sw/rxe/Makefile
> @@ -22,6 +22,7 @@ rdma_rxe-y := \
>   	rxe_mcast.o \
>   	rxe_task.o \
>   	rxe_net.o \
> -	rxe_hw_counters.o
> +	rxe_hw_counters.o \
> +	rxe_ns.o
>   
>   rdma_rxe-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += rxe_odp.o
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index e891199cbdef..165155f9be6d 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -8,6 +8,8 @@
>   #include <net/addrconf.h>
>   #include "rxe.h"
>   #include "rxe_loc.h"
> +#include "rxe_net.h"
> +#include "rxe_ns.h"
>   
>   MODULE_AUTHOR("Bob Pearson, Frank Zago, John Groves, Kamal Heib");
>   MODULE_DESCRIPTION("Soft RDMA transport");
> @@ -200,6 +202,8 @@ void rxe_set_mtu(struct rxe_dev *rxe, unsigned int ndev_mtu)
>   	port->mtu_cap = ib_mtu_enum_to_int(mtu);
>   }
>   
> +static struct rdma_link_ops rxe_link_ops;
> +
>   /* called by ifc layer to create new rxe device.
>    * The caller should allocate memory for rxe by calling ib_alloc_device.
>    */
> @@ -208,6 +212,7 @@ int rxe_add(struct rxe_dev *rxe, unsigned int mtu, const char *ibdev_name,
>   {
>   	rxe_init(rxe, ndev);
>   	rxe_set_mtu(rxe, mtu);
> +	rxe->ib_dev.link_ops = &rxe_link_ops;
>   
>   	return rxe_register_device(rxe, ibdev_name, ndev);
>   }
> @@ -231,6 +236,10 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
>   		goto err;
>   	}
>   
> +	err = rxe_net_init(ndev);
> +	if (err)
> +		return err;
> +
>   	err = rxe_net_add(ibdev_name, ndev);
>   	if (err) {
>   		rxe_err("failed to add %s\n", ndev->name);
> @@ -240,9 +249,17 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
>   	return err;
>   }
>   
> +static int rxe_dellink(struct ib_device *dev)
> +{
> +	rxe_net_del(dev);
> +
> +	return 0;
> +}
> +
>   static struct rdma_link_ops rxe_link_ops = {
>   	.type = "rxe",
>   	.newlink = rxe_newlink,
> +	.dellink = rxe_dellink,
>   };
>   
>   static int __init rxe_module_init(void)
> @@ -253,13 +270,20 @@ static int __init rxe_module_init(void)
>   	if (err)
>   		return err;
>   
> -	err = rxe_net_init();
> +	rdma_link_register(&rxe_link_ops);
> +	err = rxe_register_notifier();
>   	if (err) {
> +		pr_err("Failed to register netdev notifier\n");
>   		rxe_destroy_wq();
> -		return err;
> +		return -1;
> +	}
> +
> +	err = rxe_namespace_init();
> +	if (err) {
> +		pr_err("Failed to register net namespace notifier\n");
> +		return -1;
>   	}
>   
> -	rdma_link_register(&rxe_link_ops);
>   	pr_info("loaded\n");
>   	return 0;
>   }
> @@ -271,6 +295,8 @@ static void __exit rxe_module_exit(void)
>   	rxe_net_exit();
>   	rxe_destroy_wq();
>   
> +	rxe_namespace_exit();
> +
>   	pr_info("unloaded\n");
>   }
>   
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> index 0bd0902b11f7..bf50a298c9ba 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.c
> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> @@ -17,8 +17,7 @@
>   #include "rxe.h"
>   #include "rxe_net.h"
>   #include "rxe_loc.h"
> -
> -static struct rxe_recv_sockets recv_sockets;
> +#include "rxe_ns.h"
>   
>   #ifdef CONFIG_DEBUG_LOCK_ALLOC
>   /*
> @@ -106,7 +105,7 @@ static struct dst_entry *rxe_find_route4(struct rxe_qp *qp,
>   					 struct in_addr *daddr)
>   {
>   	struct rtable *rt;
> -	struct flowi4 fl = { { 0 } };
> +	struct flowi4 fl = {};
>   
>   	memset(&fl, 0, sizeof(fl));
>   	fl.flowi4_oif = ndev->ifindex;
> @@ -114,7 +113,7 @@ static struct dst_entry *rxe_find_route4(struct rxe_qp *qp,
>   	memcpy(&fl.daddr, daddr, sizeof(*daddr));
>   	fl.flowi4_proto = IPPROTO_UDP;
>   
> -	rt = ip_route_output_key(&init_net, &fl);
> +	rt = ip_route_output_key(dev_net(ndev), &fl);
>   	if (IS_ERR(rt)) {
>   		rxe_dbg_qp(qp, "no route to %pI4\n", &daddr->s_addr);
>   		return NULL;
> @@ -130,7 +129,7 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
>   					 struct in6_addr *daddr)
>   {
>   	struct dst_entry *ndst;
> -	struct flowi6 fl6 = { { 0 } };
> +	struct flowi6 fl6 = {};
>   
>   	memset(&fl6, 0, sizeof(fl6));
>   	fl6.flowi6_oif = ndev->ifindex;
> @@ -138,8 +137,8 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
>   	memcpy(&fl6.daddr, daddr, sizeof(*daddr));
>   	fl6.flowi6_proto = IPPROTO_UDP;
>   
> -	ndst = ipv6_stub->ipv6_dst_lookup_flow(sock_net(recv_sockets.sk6->sk),
> -					       recv_sockets.sk6->sk, &fl6,
> +	ndst = ipv6_stub->ipv6_dst_lookup_flow(dev_net(ndev),
> +					       rxe_ns_pernet_sk6(dev_net(ndev)), &fl6,
>   					       NULL);
>   	if (IS_ERR(ndst)) {
>   		rxe_dbg_qp(qp, "no route to %pI6\n", daddr);
> @@ -624,6 +623,50 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev)
>   	return 0;
>   }
>   
> +#define SK_REF_FOR_TUNNEL	2
> +
> +static void rxe_sock_put(struct sock *sk,
> +					void (*set_sk)(struct net *, struct sock *),
> +					struct net_device *ndev)
> +{
> +	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL) {
> +		__sock_put(sk);
> +	} else {
> +		rxe_release_udp_tunnel(sk->sk_socket);
> +		sk = NULL;
> +		set_sk(dev_net(ndev), sk);
> +	}
> +}
> +
> +void rxe_net_del(struct ib_device *dev)
> +{
> +	struct sock *sk;
> +	struct rxe_dev *rxe;
> +	struct net_device *ndev;
> +
> +	rxe = container_of(dev, struct rxe_dev, ib_dev);
> +
> +	ndev = rxe_ib_device_get_netdev(&rxe->ib_dev);
> +	if (!ndev)
> +		return;
> +
> +	sk = rxe_ns_pernet_sk4(dev_net(ndev));
> +	if (!sk)
> +		goto err_out;
> +
> +	rxe_sock_put(sk, rxe_ns_pernet_set_sk4, ndev);
> +
> +	sk = rxe_ns_pernet_sk6(dev_net(ndev));
> +	if (!sk)
> +		goto err_out;
> +
> +	rxe_sock_put(sk, rxe_ns_pernet_set_sk6, ndev);
> +
> +err_out:
> +	dev_put(ndev);
> +}
> +#undef SK_REF_FOR_TUNNEL
> +
>   static void rxe_port_event(struct rxe_dev *rxe,
>   			   enum ib_event_type event)
>   {
> @@ -680,6 +723,7 @@ static int rxe_notify(struct notifier_block *not_blk,
>   	switch (event) {
>   	case NETDEV_UNREGISTER:
>   		ib_unregister_device_queued(&rxe->ib_dev);
> +		rxe_net_del(&rxe->ib_dev);
>   		break;
>   	case NETDEV_CHANGEMTU:
>   		rxe_dbg_dev(rxe, "%s changed mtu to %d\n", ndev->name, ndev->mtu);
> @@ -709,66 +753,92 @@ static struct notifier_block rxe_net_notifier = {
>   	.notifier_call = rxe_notify,
>   };
>   
> -static int rxe_net_ipv4_init(void)
> +static int rxe_net_ipv4_init(struct net_device *ndev)
>   {
> -	recv_sockets.sk4 = rxe_setup_udp_tunnel(&init_net,
> -				htons(ROCE_V2_UDP_DPORT), false);
> -	if (IS_ERR(recv_sockets.sk4)) {
> -		recv_sockets.sk4 = NULL;
> +	struct sock *sk;
> +	struct socket *sock;
> +
> +	sk = rxe_ns_pernet_sk4(dev_net(ndev));
> +	if (sk) {
> +		sock_hold(sk);
> +		return 0;
> +	}
> +
> +	sock = rxe_setup_udp_tunnel(dev_net(ndev), htons(ROCE_V2_UDP_DPORT), false);
> +	if (IS_ERR(sock)) {
>   		pr_err("Failed to create IPv4 UDP tunnel\n");
>   		return -1;
>   	}
> +	rxe_ns_pernet_set_sk4(dev_net(ndev), sock->sk);
>   
>   	return 0;
>   }
>   
> -static int rxe_net_ipv6_init(void)
> +static int rxe_net_ipv6_init(struct net_device *ndev)
>   {
>   #if IS_ENABLED(CONFIG_IPV6)
> +	struct sock *sk;
> +	struct socket *sock;
> +
> +	sk = rxe_ns_pernet_sk6(dev_net(ndev));
> +	if (sk) {
> +		sock_hold(sk);
> +		return 0;
> +	}
>   
> -	recv_sockets.sk6 = rxe_setup_udp_tunnel(&init_net,
> -						htons(ROCE_V2_UDP_DPORT), true);
> -	if (PTR_ERR(recv_sockets.sk6) == -EAFNOSUPPORT) {
> -		recv_sockets.sk6 = NULL;
> +	sock = rxe_setup_udp_tunnel(dev_net(ndev), htons(ROCE_V2_UDP_DPORT), true);
> +	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
>   		pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n");
>   		return 0;
>   	}
>   
> -	if (IS_ERR(recv_sockets.sk6)) {
> -		recv_sockets.sk6 = NULL;
> +	if (IS_ERR(sock)) {
>   		pr_err("Failed to create IPv6 UDP tunnel\n");
>   		return -1;
>   	}
> +
> +	rxe_ns_pernet_set_sk6(dev_net(ndev), sock->sk);
> +
>   #endif
>   	return 0;
>   }
>   
> +int rxe_register_notifier(void)
> +{
> +	int err;
> +
> +	err = register_netdevice_notifier(&rxe_net_notifier);
> +	if (err) {
> +		pr_err("Failed to register netdev notifier\n");
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
>   void rxe_net_exit(void)
>   {
> -	rxe_release_udp_tunnel(recv_sockets.sk6);
> -	rxe_release_udp_tunnel(recv_sockets.sk4);
>   	unregister_netdevice_notifier(&rxe_net_notifier);
>   }
>   
> -int rxe_net_init(void)
> +int rxe_net_init(struct net_device *ndev)
>   {
>   	int err;
>   
> -	recv_sockets.sk6 = NULL;
> -
> -	err = rxe_net_ipv4_init();
> +	err = rxe_net_ipv4_init(ndev);
>   	if (err)
>   		return err;
> -	err = rxe_net_ipv6_init();
> +
> +	err = rxe_net_ipv6_init(ndev);
>   	if (err)
>   		goto err_out;
> -	err = register_netdevice_notifier(&rxe_net_notifier);
> -	if (err) {
> -		pr_err("Failed to register netdev notifier\n");
> -		goto err_out;
> -	}
> +
>   	return 0;
> +
>   err_out:
> +	/* If ipv6 error, release ipv4 resource */
> +	udp_tunnel_sock_release(rxe_ns_pernet_sk4(dev_net(ndev))->sk_socket);
> +	rxe_ns_pernet_set_sk4(dev_net(ndev), NULL);
>   	rxe_net_exit();
>   	return err;
>   }
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
> index 45d80d00f86b..56249677d692 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.h
> +++ b/drivers/infiniband/sw/rxe/rxe_net.h
> @@ -11,14 +11,11 @@
>   #include <net/if_inet6.h>
>   #include <linux/module.h>
>   
> -struct rxe_recv_sockets {
> -	struct socket *sk4;
> -	struct socket *sk6;
> -};
> -
>   int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
> +void rxe_net_del(struct ib_device *dev);
>   
> -int rxe_net_init(void);
> +int rxe_register_notifier(void);
> +int rxe_net_init(struct net_device *ndev);
>   void rxe_net_exit(void);
>   
>   #endif /* RXE_NET_H */
> diff --git a/drivers/infiniband/sw/rxe/rxe_ns.c b/drivers/infiniband/sw/rxe/rxe_ns.c
> new file mode 100644
> index 000000000000..1ff34167a295
> --- /dev/null
> +++ b/drivers/infiniband/sw/rxe/rxe_ns.c
> @@ -0,0 +1,156 @@
> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> +/*
> + * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
> + * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
> + */
> +
> +#include <net/sock.h>
> +#include <net/netns/generic.h>
> +#include <net/net_namespace.h>
> +#include <linux/module.h>
> +#include <linux/skbuff.h>
> +#include <linux/pid_namespace.h>
> +#include <net/udp_tunnel.h>
> +
> +#include "rxe_ns.h"
> +
> +/*
> + * Per network namespace data
> + */
> +struct rxe_ns_sock {
> +	struct sock __rcu *rxe_sk4;
> +	struct sock __rcu *rxe_sk6;
> +};
> +
> +/*
> + * Index to store custom data for each network namespace.
> + */
> +static unsigned int rxe_pernet_id;
> +
> +/*
> + * Called for every existing and added network namespaces
> + */
> +static int __net_init rxe_ns_init(struct net *net)
> +{
> +	/*
> +	 * create (if not present) and access data item in network namespace
> +	 * (net) using the id (net_id)
> +	 */
> +	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
> +
> +	rcu_assign_pointer(ns_sk->rxe_sk4, NULL); /* initialize sock 4 socket */
> +#if IS_ENABLED(CONFIG_IPV6)
> +	rcu_assign_pointer(ns_sk->rxe_sk6, NULL); /* initialize sock 6 socket */
> +#endif /* IPV6 */
> +	synchronize_rcu();
> +
> +	return 0;
> +}
> +
> +static void __net_exit rxe_ns_exit(struct net *net)
> +{
> +	/*
> +	 * called when the network namespace is removed
> +	 */
> +	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
> +	struct sock *rxe_sk4 = NULL;
> +#if IS_ENABLED(CONFIG_IPV6)
> +	struct sock *rxe_sk6 = NULL;
> +#endif
> +
> +	rcu_read_lock();
> +	rxe_sk4 = rcu_dereference(ns_sk->rxe_sk4);
> +#if IS_ENABLED(CONFIG_IPV6)
> +	rxe_sk6 = rcu_dereference(ns_sk->rxe_sk6);
> +#endif
> +	rcu_read_unlock();
> +
> +	/* close socket */
> +	if (rxe_sk4 && rxe_sk4->sk_socket) {
> +		udp_tunnel_sock_release(rxe_sk4->sk_socket);
> +		rcu_assign_pointer(ns_sk->rxe_sk4, NULL);
> +		synchronize_rcu();
> +	}
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +	if (rxe_sk6 && rxe_sk6->sk_socket) {
> +		udp_tunnel_sock_release(rxe_sk6->sk_socket);
> +		rcu_assign_pointer(ns_sk->rxe_sk6, NULL);
> +		synchronize_rcu();
> +	}
> +#endif
> +}
> +
> +/*
> + * callback to make the module network namespace aware
> + */
> +static struct pernet_operations rxe_net_ops __net_initdata = {
> +	.init = rxe_ns_init,
> +	.exit = rxe_ns_exit,
> +	.id = &rxe_pernet_id,
> +	.size = sizeof(struct rxe_ns_sock),
> +};
> +
> +struct sock *rxe_ns_pernet_sk4(struct net *net)
> +{
> +	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
> +	struct sock *sk;
> +
> +	rcu_read_lock();
> +	sk = rcu_dereference(ns_sk->rxe_sk4);
> +	rcu_read_unlock();
> +
> +	return sk;
> +}
> +
> +void rxe_ns_pernet_set_sk4(struct net *net, struct sock *sk)
> +{
> +	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
> +
> +	rcu_assign_pointer(ns_sk->rxe_sk4, sk);
> +	synchronize_rcu();
> +}
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +struct sock *rxe_ns_pernet_sk6(struct net *net)
> +{
> +	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
> +	struct sock *sk;
> +
> +	rcu_read_lock();
> +	sk = rcu_dereference(ns_sk->rxe_sk6);
> +	rcu_read_unlock();
> +
> +	return sk;
> +}
> +
> +void rxe_ns_pernet_set_sk6(struct net *net, struct sock *sk)
> +{
> +	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
> +
> +	rcu_assign_pointer(ns_sk->rxe_sk6, sk);
> +	synchronize_rcu();
> +}
> +
> +#else /* IPV6 */
> +
> +struct sock *rxe_ns_pernet_sk6(struct net *net)
> +{
> +	return NULL;
> +}
> +
> +void rxe_ns_pernet_set_sk6(struct net *net, struct sock *sk)
> +{
> +}
> +
> +#endif /* IPV6 */
> +
> +int __init rxe_namespace_init(void)
> +{
> +	return register_pernet_subsys(&rxe_net_ops);
> +}
> +
> +void __exit rxe_namespace_exit(void)
> +{
> +	unregister_pernet_subsys(&rxe_net_ops);
> +}
> diff --git a/drivers/infiniband/sw/rxe/rxe_ns.h b/drivers/infiniband/sw/rxe/rxe_ns.h
> new file mode 100644
> index 000000000000..da5bfcea1274
> --- /dev/null
> +++ b/drivers/infiniband/sw/rxe/rxe_ns.h
> @@ -0,0 +1,17 @@
> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> +/*
> + * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
> + * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
> + */
> +
> +#ifndef RXE_NS_H
> +#define RXE_NS_H
> +
> +struct sock *rxe_ns_pernet_sk4(struct net *net);
> +struct sock *rxe_ns_pernet_sk6(struct net *net);
> +void rxe_ns_pernet_set_sk4(struct net *net, struct sock *sk);
> +void rxe_ns_pernet_set_sk6(struct net *net, struct sock *sk);
> +int __init rxe_namespace_init(void);
> +void __exit rxe_namespace_exit(void);
> +
> +#endif /* RXE_NS_H */
> diff --git a/include/rdma/rdma_netlink.h b/include/rdma/rdma_netlink.h
> index 326deaf56d5d..2fd1358ea57d 100644
> --- a/include/rdma/rdma_netlink.h
> +++ b/include/rdma/rdma_netlink.h
> @@ -5,6 +5,7 @@
>   
>   #include <linux/netlink.h>
>   #include <uapi/rdma/rdma_netlink.h>
> +#include <rdma/ib_verbs.h>
>   
>   struct ib_device;
>   
> @@ -126,6 +127,7 @@ struct rdma_link_ops {
>   	struct list_head list;
>   	const char *type;
>   	int (*newlink)(const char *ibdev_name, struct net_device *ndev);
> +	int (*dellink)(struct ib_device *dev);
>   };
>   
>   void rdma_link_register(struct rdma_link_ops *ops);


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace
  2026-03-04  4:44 ` Zhu Yanjun
@ 2026-03-04 19:29   ` David Ahern
  2026-03-05  3:29     ` Zhu Yanjun
  0 siblings, 1 reply; 12+ messages in thread
From: David Ahern @ 2026-03-04 19:29 UTC (permalink / raw)
  To: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma

On 3/3/26 9:44 PM, Zhu Yanjun wrote:
> The script in the link
> https://github.com/zhuyj/linux/blob/6.19-net-namespace/net_ns_rxe.sh can
> make tests in linux distributions.

I have not read the patch, but I did look at the test script referenced
here. Comments

1. drop the sleeps. They should never be needed. If you need to wait for
some resource, then wait for that resource explicitly with a timeout.

2. tests should cover the range of features in the patch meaning IPv6,
and if you keep the attempts to delete the socket after the rxe devices
are deleted, then tests should include variations of this theme. e.g.,
per network namespace:

a. no devices = no socket

b. 1 device, sockets work, delete device, no socket

c. 2 devices, sockets work, delete 1 device, socket still works, delete
second device, no socket.

3. the script can be added to tools/testing/selftests/{infiniband,rdma}
-- whatever directory seems most appropriate. Adding it here and fitting
within kernel selftests means it can be run by CI as commits are done.

> 
> BTW, please disable firewall before making tests.

That should not be needed. The test script should be internal to a host
using only namespaces you control and configure.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace
  2026-03-04 19:29   ` David Ahern
@ 2026-03-05  3:29     ` Zhu Yanjun
  2026-03-05 18:58       ` David Ahern
  0 siblings, 1 reply; 12+ messages in thread
From: Zhu Yanjun @ 2026-03-05  3:29 UTC (permalink / raw)
  To: David Ahern, jgg, leon, zyjzyj2000, linux-rdma


在 2026/3/4 11:29, David Ahern 写道:
> On 3/3/26 9:44 PM, Zhu Yanjun wrote:
>> The script in the link
>> https://github.com/zhuyj/linux/blob/6.19-net-namespace/net_ns_rxe.sh can
>> make tests in linux distributions.
> I have not read the patch, but I did look at the test script referenced
> here. Comments
>
> 1. drop the sleeps. They should never be needed. If you need to wait for
> some resource, then wait for that resource explicitly with a timeout.
Thanks a lot. The sleep statements have been removed.
>
> 2. tests should cover the range of features in the patch meaning IPv6,
IPv6 functionality is now covered. Please check the link: 
https://github.com/zhuyj/linux/blob/6.19-net-namespace/net_ns_rxe.sh
> and if you keep the attempts to delete the socket after the rxe devices
> are deleted, then tests should include variations of this theme. e.g.,
> per network namespace:
>
> a. no devices = no socket
>
> b. 1 device, sockets work, delete device, no socket
>
> c. 2 devices, sockets work, delete 1 device, socket still works, delete
> second device, no socket.
The scenarios mentioned previously (a, b, c) have been fully tested. The 
link to the test script is: 
https://github.com/zhuyj/linux/blob/6.19-net-namespace/net_ns_rxe.sh
>
> 3. the script can be added to tools/testing/selftests/{infiniband,rdma}
> -- whatever directory seems most appropriate. Adding it here and fitting
> within kernel selftests means it can be run by CI as commits are done.

The script has been added to tools/testing/selftests/rdma. The commit is

https://github.com/zhuyj/linux/commit/0fa99629c1a656592b7b2011dc5cad16de2320fd

It can be tested by running:

make -C tools/testing/selftests TARGETS=rdma run_tests

Please let me know if there are any additional concerns or suggestions.

Thanks,

Zhu Yanjun

>
>> BTW, please disable firewall before making tests.
> That should not be needed. The test script should be internal to a host
> using only namespaces you control and configure.
>
-- 
Best Regards,
Yanjun.Zhu


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace
  2026-03-05  3:29     ` Zhu Yanjun
@ 2026-03-05 18:58       ` David Ahern
  2026-03-05 23:15         ` Yanjun.Zhu
  0 siblings, 1 reply; 12+ messages in thread
From: David Ahern @ 2026-03-05 18:58 UTC (permalink / raw)
  To: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma

On 3/4/26 8:29 PM, Zhu Yanjun wrote:
> The script has been added to tools/testing/selftests/rdma. The commit is
> 
> https://github.com/zhuyj/linux/commit/0fa99629c1a656592b7b2011dc5cad16de2320fd
> 
> It can be tested by running:
> 
> make -C tools/testing/selftests TARGETS=rdma run_tests
> 
> Please let me know if there are any additional concerns or suggestions.
> 

Thanks for the enhancements to the testing.

Progress and success / fail on what has been tested at each step would
improve the user experience. See any number of test scripts under
tools/testing/selftests/net/ from me - e.g.,
tools/testing/selftests/net/fib_nexthops.sh walks through permutations
of an API, tools/testing/selftests/net/icmp_redirect.sh is a much
simpler example.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace
  2026-03-05 18:58       ` David Ahern
@ 2026-03-05 23:15         ` Yanjun.Zhu
  2026-03-05 23:41           ` David Ahern
  0 siblings, 1 reply; 12+ messages in thread
From: Yanjun.Zhu @ 2026-03-05 23:15 UTC (permalink / raw)
  To: David Ahern, jgg, leon, zyjzyj2000, linux-rdma

On 3/5/26 10:58 AM, David Ahern wrote:
> On 3/4/26 8:29 PM, Zhu Yanjun wrote:
>> The script has been added to tools/testing/selftests/rdma. The commit is
>>
>> https://github.com/zhuyj/linux/commit/0fa99629c1a656592b7b2011dc5cad16de2320fd
>>
>> It can be tested by running:
>>
>> make -C tools/testing/selftests TARGETS=rdma run_tests
>>
>> Please let me know if there are any additional concerns or suggestions.
>>
>
> Thanks for the enhancements to the testing.

“

# make -C tools/testing/selftests TARGETS=rdma run_tests

make: Entering directory '/root/Development/linux/tools/testing/selftests'
make[1]: Nothing to be done for 'all'.
TAP version 13
1..3
# timeout set to 45
# selftests: rdma: rping_between_netns.sh
# server DISCONNECT EVENT...
# wait for RDMA_READ_ADV state 10
ok 1 selftests: rdma: rping_between_netns.sh
# timeout set to 45
# selftests: rdma: rxe_ipv6.sh
ok 2 selftests: rdma: rxe_ipv6.sh
# timeout set to 45
# selftests: rdma: socket_with_rxe.sh
ok 3 selftests: rdma: socket_with_rxe.sh

make: Leaving directory '/root/Development/linux/tools/testing/selftests'

”

I ran the three test cases, and the output is shown above. I would like 
to confirm whether this output format looks appropriate for the RDMA 
selftests.

If the format is acceptable, I plan to keep it as is and continue 
expanding the RDMA selftests based on this structure.

Please let me know if there are any suggestions or preferred conventions 
for the output format.

Zhu Yanjun

>
> Progress and success / fail on what has been tested at each step would
> improve the user experience. See any number of test scripts under
> tools/testing/selftests/net/ from me - e.g.,
> tools/testing/selftests/net/fib_nexthops.sh walks through permutations
> of an API, tools/testing/selftests/net/icmp_redirect.sh is a much
> simpler example.
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace
  2026-03-05 23:15         ` Yanjun.Zhu
@ 2026-03-05 23:41           ` David Ahern
  2026-03-06 21:29             ` yanjun.zhu
  0 siblings, 1 reply; 12+ messages in thread
From: David Ahern @ 2026-03-05 23:41 UTC (permalink / raw)
  To: Yanjun.Zhu, jgg, leon, zyjzyj2000, linux-rdma

On 3/5/26 4:15 PM, Yanjun.Zhu wrote:
> I ran the three test cases, and the output is shown above. I would like
> to confirm whether this output format looks appropriate for the RDMA
> selftests.
> 
> If the format is acceptable, I plan to keep it as is and continue
> expanding the RDMA selftests based on this structure.

Having tests is the important piece. I pointed out the existing tests as
a way of making things more user friendly. Simple output, easy to follow
what was tested.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace
  2026-03-05 23:41           ` David Ahern
@ 2026-03-06 21:29             ` yanjun.zhu
  0 siblings, 0 replies; 12+ messages in thread
From: yanjun.zhu @ 2026-03-06 21:29 UTC (permalink / raw)
  To: David Ahern, jgg, leon, zyjzyj2000, linux-rdma

On 3/5/26 3:41 PM, David Ahern wrote:
> On 3/5/26 4:15 PM, Yanjun.Zhu wrote:
>> I ran the three test cases, and the output is shown above. I would like
>> to confirm whether this output format looks appropriate for the RDMA
>> selftests.
>>
>> If the format is acceptable, I plan to keep it as is and continue
>> expanding the RDMA selftests based on this structure.
> 
> Having tests is the important piece. I pointed out the existing tests as
> a way of making things more user friendly. Simple output, easy to follow
> what was tested.

Thanks a lot. I will add more tests.

Zhu Yanjun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace
  2026-03-04  4:16 [PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace Zhu Yanjun
  2026-03-04  4:44 ` Zhu Yanjun
@ 2026-03-05 18:54 ` David Ahern
  2026-03-06  2:38   ` Yanjun.Zhu
  2026-03-06  2:58 ` kernel test robot
  2 siblings, 1 reply; 12+ messages in thread
From: David Ahern @ 2026-03-05 18:54 UTC (permalink / raw)
  To: Zhu Yanjun, jgg, leon, zyjzyj2000, linux-rdma

[-- Attachment #1: Type: text/plain, Size: 1332 bytes --]

On 3/3/26 9:16 PM, Zhu Yanjun wrote:
> When run "ip link add" command to add a rxe rdma link in a net
> namespace, normally this rxe rdma link can not work in a net
> name space.
> 
> The root cause is that a sock listening on udp port 4791 is created
> in init_net when the rdma_rxe module is loaded into kernel. That is,
> the sock listening on udp port 4791 is created in init_net. Other net
> namespace is difficult to use this sock.
> 
> The following commits will solve this problem.

you squashed all of the changes into 1 commit, so either the commit
message needs to be fixed up or you need to do the patch series.

That said, I still think the optimizations around tracking the number of
devices in the namespace and closing the sockets are unncessary at this
time. It brings in complications to your set which is delaying the merge
of namespace support. If it is done, the socket refcount approach seems
problematic; how about a simpler approach with a device count in the
net_generic struct used per namespace (again, I really believe this
should be a follow on patch).

Also, claude has some comments about this patch. See attached. At this
point surely you have access to some AI model that can do code reviews.
There are prompts here https://github.com/masoncl/review-prompts.git
that can be leveraged as I did here.

[-- Attachment #2: review-inline.txt --]
[-- Type: text/plain, Size: 5346 bytes --]

commit 04b0ab25bd94c6100401d52ad669fd21c4065d2f
Author: Zhu Yanjun <yanjun.zhu@linux.dev>

RDMA/rxe: Add the support that rxe can work in net namespace

This commit adds per-namespace UDP socket management for the rxe
soft RoCE driver, moving socket creation from module init to per-link
creation and adding pernet_operations to track sockets per namespace.

> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c

[ ... ]

> +#define SK_REF_FOR_TUNNEL	2
> +
> +static void rxe_sock_put(struct sock *sk,
> +				void (*set_sk)(struct net *, struct sock *),
> +				struct net_device *ndev)
> +{
> +	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL) {
> +		__sock_put(sk);
> +	} else {
> +		rxe_release_udp_tunnel(sk->sk_socket);
> +		sk = NULL;
> +		set_sk(dev_net(ndev), sk);
> +	}
> +}

Does SK_REF_FOR_TUNNEL have the right value here?

The socket created by rxe_setup_udp_tunnel() starts with sk_refcnt=1
(sk_alloc sets refcount_set(&sk->sk_refcnt, 1), and setup_udp_tunnel_sock
does not add a reference). The second link calling sock_hold() in
rxe_net_ipv4_init() brings the count to 2.

With the threshold at 2, when the first of two active links is deleted:

    rxe_sock_put():
        refcount_read(&sk->sk_refcnt) == 2
        2 > SK_REF_FOR_TUNNEL(2) is false
        => rxe_release_udp_tunnel(sk->sk_socket)

This calls udp_tunnel_sock_release(), which clears the encap_rcv handler
and calls sock_release(). The second link's socket is now shut down while
still in use.

Shouldn't SK_REF_FOR_TUNNEL be 1 so that the tunnel is only released when
the last link is removed?

> +void rxe_net_del(struct ib_device *dev)
> +{
> +	struct sock *sk;
> +	struct rxe_dev *rxe;
> +	struct net_device *ndev;
> +
> +	rxe = container_of(dev, struct rxe_dev, ib_dev);
> +
> +	ndev = rxe_ib_device_get_netdev(&rxe->ib_dev);
> +	if (!ndev)
> +		return;
> +
> +	sk = rxe_ns_pernet_sk4(dev_net(ndev));
> +	if (!sk)
> +		goto err_out;
> +
> +	rxe_sock_put(sk, rxe_ns_pernet_set_sk4, ndev);
> +
> +	sk = rxe_ns_pernet_sk6(dev_net(ndev));
> +	if (!sk)
> +		goto err_out;
> +
> +	rxe_sock_put(sk, rxe_ns_pernet_set_sk6, ndev);
> +
> +err_out:
> +	dev_put(ndev);
> +}

When sk4 is NULL, the goto err_out skips the sk6 cleanup entirely. Can
sk6 be non-NULL when sk4 is NULL? For example, after rxe_sock_put() sets
the pernet sk4 to NULL (via set_sk()) while a concurrent operation leaves
sk6 non-NULL.

> -int rxe_net_init(void)
> +int rxe_net_init(struct net_device *ndev)
>  {
>  	int err;
>
> -	recv_sockets.sk6 = NULL;
> -
> -	err = rxe_net_ipv4_init();
> +	err = rxe_net_ipv4_init(ndev);
>  	if (err)
>  		return err;
> -	err = rxe_net_ipv6_init();
> +
> +	err = rxe_net_ipv6_init(ndev);
>  	if (err)
>  		goto err_out;
>  	return 0;
> +
>  err_out:
> +	/* If ipv6 error, release ipv4 resource */
> +	udp_tunnel_sock_release(rxe_ns_pernet_sk4(dev_net(ndev))->sk_socket);
> +	rxe_ns_pernet_set_sk4(dev_net(ndev), NULL);
>  	rxe_net_exit();
>  	return err;
>  }

The call to rxe_net_exit() in the error path looks like it was carried over
from the old code but is no longer correct.

In the old code rxe_net_init() registered the netdev notifier itself, so
calling rxe_net_exit() on failure was the right cleanup. In the new code
the notifier is registered once at module load in rxe_register_notifier():

    rxe_module_init()
        rxe_register_notifier()
            register_netdevice_notifier(&rxe_net_notifier)  /* module-wide */

    rxe_net_init()  /* called per link from rxe_newlink() */
        rxe_net_ipv4_init()
        rxe_net_ipv6_init()  /* fails */
    err_out:
        rxe_net_exit()
            unregister_netdevice_notifier(&rxe_net_notifier)  /* oops */

If rxe_net_ipv6_init() fails for any link creation attempt (e.g., -ENOMEM
from udp_sock_create), the module-level notifier is unregistered. After
that, NETDEV_UNREGISTER, NETDEV_CHANGEMTU and other events are silently
ignored for all existing rxe devices.

Should the err_out path call rxe_net_exit() at all, given that
rxe_net_exit() is now only responsible for unregistering the global
notifier?

The same error path also calls udp_tunnel_sock_release() unconditionally
on the IPv4 socket. When rxe_net_ipv4_init() found an existing socket and
called sock_hold() rather than creating a new one, calling
udp_tunnel_sock_release() tears down a socket that belongs to an already
active link. Should the IPv4 cleanup distinguish between the sock_hold
case and the socket creation case?

> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c

[ ... ]

> +	rdma_link_register(&rxe_link_ops);
> +	err = rxe_register_notifier();
>  	if (err) {
> +		pr_err("Failed to register netdev notifier\n");
>  		rxe_destroy_wq();
> -		return err;
> +		return -1;
>  	}
> +
> +	err = rxe_namespace_init();
>  	if (err) {
> +		pr_err("Failed to register net namespace notifier\n");
> +		return -1;
>  	}

If rxe_namespace_init() fails, rdma_link_register() and
rxe_register_notifier() are not undone before the module init returns
an error. Since the kernel does not call module_exit when module_init
fails, rdma_link_unregister(), rxe_net_exit() and rxe_destroy_wq() are
never called.

Also, both error returns use -1 rather than propagating err or a proper
errno. -1 maps to EPERM which is misleading for a registration failure.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace
  2026-03-05 18:54 ` David Ahern
@ 2026-03-06  2:38   ` Yanjun.Zhu
  0 siblings, 0 replies; 12+ messages in thread
From: Yanjun.Zhu @ 2026-03-06  2:38 UTC (permalink / raw)
  To: David Ahern, jgg, leon, zyjzyj2000, linux-rdma


On 3/5/26 10:54 AM, David Ahern wrote:
> On 3/3/26 9:16 PM, Zhu Yanjun wrote:
>> When run "ip link add" command to add a rxe rdma link in a net
>> namespace, normally this rxe rdma link can not work in a net
>> name space.
>>
>> The root cause is that a sock listening on udp port 4791 is created
>> in init_net when the rdma_rxe module is loaded into kernel. That is,
>> the sock listening on udp port 4791 is created in init_net. Other net
>> namespace is difficult to use this sock.
>>
>> The following commits will solve this problem.
> you squashed all of the changes into 1 commit, so either the commit
> message needs to be fixed up or you need to do the patch series.
>
> That said, I still think the optimizations around tracking the number of
> devices in the namespace and closing the sockets are unncessary at this
> time. It brings in complications to your set which is delaying the merge
> of namespace support. If it is done, the socket refcount approach seems
> problematic; how about a simpler approach with a device count in the
> net_generic struct used per namespace (again, I really believe this
> should be a follow on patch).

I changed the "#define SK_REF_FOR_TUNNEL    1"  according to the 
comments from Claude.

The followings will appear. If I still use 2, the following will disappear.

Please use Claude to analyze. Thanks a lot.

Mar  5 18:25:18 localhost kernel: ------------[ cut here ]------------
Mar  5 18:25:18 localhost kernel: refcount_t: decrement hit 0; leaking 
memory.
Mar  5 18:25:18 localhost kernel: WARNING: lib/refcount.c:31 at 
refcount_warn_saturate+0x22/0x90, CPU#6: kworker/u32:0/12
Mar  5 18:25:18 localhost kernel: Modules linked in: rpcrdma rdma_ucm 
ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi 
scsi_transport_iscsi rdma_cm iw_cm ib_cm rdma_rxe(-) ib_uverbs ib_core 
sunrpc qrtr rfkill binfmt_misc intel_rapl_msr intel_rapl_common 
intel_uncore_frequency_common intel_pmc_core pmt_telemetry pmt_discovery 
pmt_class intel_pmc_ssram_telemetry intel_vsec kvm_intel kvm irqbypass 
rapl snd_hda_codec_generic iTCO_wdt intel_pmc_bxt snd_hda_intel joydev 
snd_hda_codec snd_hda_core pcspkr i2c_i801 snd_intel_dspcfg i2c_smbus 
snd_intel_sdw_acpi snd_hwdep snd_pcm virtio_net snd_timer virtio_balloon 
snd net_failover soundcore lpc_ich failover dm_multipath loop nfnetlink 
vsock_loopback vmw_vsock_virtio_transport_common 
vmw_vsock_vmci_transport vsock vmw_vmci zram xfs ghash_clmulni_intel 
virtio_gpu virtio_dma_buf serio_raw scsi_dh_rdac scsi_dh_emc 
scsi_dh_alua i2c_dev fuse qemu_fw_cfg [last unloaded: veth]
Mar  5 18:25:18 localhost kernel: CPU: 6 UID: 0 PID: 12 Comm: 
kworker/u32:0 Not tainted 7.0.0-rc2-net-ns+ #25 PREEMPT(lazy)
Mar  5 18:25:18 localhost kernel: Hardware name: QEMU Standard PC (Q35 + 
ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Mar  5 18:25:18 localhost kernel: Workqueue: netns cleanup_net
Mar  5 18:25:18 localhost kernel: RIP: 0010:refcount_warn_saturate+0x22/0x90
Mar  5 18:25:18 localhost kernel: Code: 90 90 90 90 90 90 90 90 f3 0f 1e 
fa c7 07 00 00 00 c0 83 fe 02 74 54 76 1b 83 fe 03 74 3c 83 fe 04 75 26 
48 8d 3d 0e 31 a6 01 <67> 48 0f b9 3a e9 d4 a4 95 00 85 f6 74 44 48 8d 
3d 09 31 a6 01 67
Mar  5 18:25:18 localhost kernel: RSP: 0018:ffffd5544006bd00 EFLAGS: 
00010246
Mar  5 18:25:18 localhost kernel: RAX: 0000000000000001 RBX: 
ffff8de482a58fc0 RCX: ffff8de4824d0000
Mar  5 18:25:18 localhost kernel: RDX: 0000000000000000 RSI: 
0000000000000004 RDI: ffffffffbcea51d0
Mar  5 18:25:18 localhost kernel: RBP: ffff8de4824ccd60 R08: 
000000097fbbc089 R09: 0000000000000001
Mar  5 18:25:18 localhost kernel: R10: 0000000000000006 R11: 
0000000000000000 R12: ffff8de4824ee000
Mar  5 18:25:18 localhost kernel: R13: ffff8de4824ccd6c R14: 
ffffffffbce81fa0 R15: 0000000000007f00
Mar  5 18:25:18 localhost kernel: FS:  0000000000000000(0000) 
GS:ffff8de63a3af000(0000) knlGS:0000000000000000
Mar  5 18:25:18 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
Mar  5 18:25:18 localhost kernel: CR2: 00007f935f994cec CR3: 
00000001036d3002 CR4: 0000000000772ef0
Mar  5 18:25:18 localhost kernel: PKRU: 55555554
Mar  5 18:25:18 localhost kernel: Call Trace:
Mar  5 18:25:18 localhost kernel: <TASK>
Mar  5 18:25:18 localhost kernel: udp_lib_unhash+0x259/0x280
Mar  5 18:25:18 localhost kernel: sk_common_release+0x3a/0x100
Mar  5 18:25:18 localhost kernel: inet_release+0x43/0x80
Mar  5 18:25:18 localhost kernel: sock_release+0x24/0x70
Mar  5 18:25:18 localhost kernel: rxe_ns_exit+0x53/0x90 [rdma_rxe]
Mar  5 18:25:18 localhost kernel: ops_undo_list+0xdb/0x220
Mar  5 18:25:18 localhost kernel: cleanup_net+0x1f6/0x370
Mar  5 18:25:18 localhost kernel: process_one_work+0x192/0x390
Mar  5 18:25:18 localhost kernel: worker_thread+0x196/0x300
Mar  5 18:25:18 localhost kernel: ? __pfx_worker_thread+0x10/0x10
Mar  5 18:25:18 localhost kernel: kthread+0xe3/0x120
Mar  5 18:25:18 localhost kernel: ? __pfx_kthread+0x10/0x10
Mar  5 18:25:18 localhost kernel: ret_from_fork+0x1a1/0x270
Mar  5 18:25:18 localhost kernel: ? __pfx_kthread+0x10/0x10
Mar  5 18:25:18 localhost kernel: ret_from_fork_asm+0x1a/0x30
Mar  5 18:25:18 localhost kernel: </TASK>
Mar  5 18:25:18 localhost kernel: ---[ end trace 0000000000000000 ]---
Mar  5 18:25:18 localhost kernel: ------------[ cut here ]------------
Mar  5 18:25:18 localhost kernel: refcount_t: underflow; use-after-free.
Mar  5 18:25:18 localhost kernel: WARNING: lib/refcount.c:28 at 
refcount_warn_saturate+0x59/0x90, CPU#6: kworker/u32:0/12
Mar  5 18:25:18 localhost kernel: Modules linked in: rpcrdma rdma_ucm 
ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi 
scsi_transport_iscsi rdma_cm iw_cm ib_cm rdma_rxe(-) ib_uverbs ib_core 
sunrpc qrtr rfkill binfmt_misc intel_rapl_msr intel_rapl_common 
intel_uncore_frequency_common intel_pmc_core pmt_telemetry pmt_discovery 
pmt_class intel_pmc_ssram_telemetry intel_vsec kvm_intel kvm irqbypass 
rapl snd_hda_codec_generic iTCO_wdt intel_pmc_bxt snd_hda_intel joydev 
snd_hda_codec snd_hda_core pcspkr i2c_i801 snd_intel_dspcfg i2c_smbus 
snd_intel_sdw_acpi snd_hwdep snd_pcm virtio_net snd_timer virtio_balloon 
snd net_failover soundcore lpc_ich failover dm_multipath loop nfnetlink 
vsock_loopback vmw_vsock_virtio_transport_common 
vmw_vsock_vmci_transport vsock vmw_vmci zram xfs ghash_clmulni_intel 
virtio_gpu virtio_dma_buf serio_raw scsi_dh_rdac scsi_dh_emc 
scsi_dh_alua i2c_dev fuse qemu_fw_cfg [last unloaded: veth]
Mar  5 18:25:18 localhost kernel: CPU: 6 UID: 0 PID: 12 Comm: 
kworker/u32:0 Tainted: G        W           7.0.0-rc2-net-ns+ #25 
PREEMPT(lazy)
Mar  5 18:25:18 localhost kernel: Tainted: [W]=WARN
Mar  5 18:25:18 localhost kernel: Hardware name: QEMU Standard PC (Q35 + 
ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Mar  5 18:25:18 localhost kernel: Workqueue: netns cleanup_net
Mar  5 18:25:18 localhost kernel: RIP: 0010:refcount_warn_saturate+0x59/0x90
Mar  5 18:25:18 localhost kernel: Code: 44 48 8d 3d 09 31 a6 01 67 48 0f 
b9 3a e9 bf a4 95 00 48 8d 3d 08 31 a6 01 67 48 0f b9 3a c3 cc cc cc cc 
48 8d 3d 07 31 a6 01 <67> 48 0f b9 3a c3 cc cc cc cc 48 8d 3d 06 31 a6 
01 67 48 0f b9 3a
Mar  5 18:25:18 localhost kernel: RSP: 0018:ffffd5544006bd58 EFLAGS: 
00010246
Mar  5 18:25:18 localhost kernel: RAX: 00000000c0000000 RBX: 
ffff8de482a58fc0 RCX: ffff8de4824d0000
Mar  5 18:25:18 localhost kernel: RDX: 00000000000000ff RSI: 
0000000000000003 RDI: ffffffffbcea5200
Mar  5 18:25:18 localhost kernel: RBP: ffff8de491630000 R08: 
000000097fbbc089 R09: 0000000000000001
Mar  5 18:25:18 localhost kernel: R10: 0000000000000006 R11: 
0000000000000000 R12: ffff8de482a58fc0
Mar  5 18:25:18 localhost kernel: R13: ffffffffbcdce910 R14: 
ffffffffbcdce910 R15: ffffd5544006bdb8
Mar  5 18:25:18 localhost kernel: FS:  0000000000000000(0000) 
GS:ffff8de63a3af000(0000) knlGS:0000000000000000
Mar  5 18:25:18 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
Mar  5 18:25:18 localhost kernel: CR2: 00007f935f994cec CR3: 
00000001036d3002 CR4: 0000000000772ef0
Mar  5 18:25:18 localhost kernel: PKRU: 55555554
Mar  5 18:25:18 localhost kernel: Call Trace:
Mar  5 18:25:18 localhost kernel: <TASK>
Mar  5 18:25:18 localhost kernel: inet_release+0x43/0x80
Mar  5 18:25:18 localhost kernel: sock_release+0x24/0x70
Mar  5 18:25:18 localhost kernel: rxe_ns_exit+0x53/0x90 [rdma_rxe]
Mar  5 18:25:18 localhost kernel: ops_undo_list+0xdb/0x220
Mar  5 18:25:18 localhost kernel: cleanup_net+0x1f6/0x370
Mar  5 18:25:18 localhost kernel: process_one_work+0x192/0x390
Mar  5 18:25:18 localhost kernel: worker_thread+0x196/0x300
Mar  5 18:25:18 localhost kernel: ? __pfx_worker_thread+0x10/0x10
Mar  5 18:25:18 localhost kernel: kthread+0xe3/0x120
Mar  5 18:25:18 localhost kernel: ? __pfx_kthread+0x10/0x10
Mar  5 18:25:18 localhost kernel: ret_from_fork+0x1a1/0x270
Mar  5 18:25:18 localhost kernel: ? __pfx_kthread+0x10/0x10
Mar  5 18:25:18 localhost kernel: ret_from_fork_asm+0x1a/0x30
Mar  5 18:25:18 localhost kernel: </TASK>
Mar  5 18:25:18 localhost kernel: ---[ end trace 0000000000000000 ]---
Mar  5 18:25:18 localhost kernel: rdma_rxe: unloaded
Mar  5 18:25:18 localhost NetworkManager[762]: <info> [1772763918.8424] 
manager: (veth1): new Veth device 
(/org/freedesktop/NetworkManager/Devices/7)
Mar  5 18:25:18 localhost NetworkManager[762]: <info> [1772763918.8428] 
manager: (veth0): new Veth device 
(/org/freedesktop/NetworkManager/Devices/8)
Mar  5 18:25:18 localhost kernel: rdma_rxe: loaded

Zhu Yanjun

>
> Also, claude has some comments about this patch. See attached. At this
> point surely you have access to some AI model that can do code reviews.
> There are prompts here https://github.com/masoncl/review-prompts.git
> that can be leveraged as I did here.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace
  2026-03-04  4:16 [PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace Zhu Yanjun
  2026-03-04  4:44 ` Zhu Yanjun
  2026-03-05 18:54 ` David Ahern
@ 2026-03-06  2:58 ` kernel test robot
  2026-03-06 21:28   ` yanjun.zhu
  2 siblings, 1 reply; 12+ messages in thread
From: kernel test robot @ 2026-03-06  2:58 UTC (permalink / raw)
  To: Zhu Yanjun, jgg, leon, zyjzyj2000, dsahern, linux-rdma
  Cc: llvm, oe-kbuild-all

Hi Zhu,

kernel test robot noticed the following build warnings:

[auto build test WARNING on rdma/for-next]
[also build test WARNING on linus/master v7.0-rc2]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Zhu-Yanjun/RDMA-rxe-Add-the-support-that-rxe-can-work-in-net-namespace/20260304-121951
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git for-next
patch link:    https://lore.kernel.org/r/20260304041607.11685-1-yanjun.zhu%40linux.dev
patch subject: [PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace
config: x86_64-randconfig-014-20260305 (https://download.01.org/0day-ci/archive/20260306/202603061015.zwXUa3OS-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260306/202603061015.zwXUa3OS-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603061015.zwXUa3OS-lkp@intel.com/

All warnings (new ones prefixed by >>, old ones prefixed by <<):

>> WARNING: modpost: drivers/infiniband/sw/rxe/rdma_rxe: section mismatch in reference: rxe_namespace_exit+0x7 (section: .exit.text) -> rxe_net_ops (section: .init.data)

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace
  2026-03-06  2:58 ` kernel test robot
@ 2026-03-06 21:28   ` yanjun.zhu
  0 siblings, 0 replies; 12+ messages in thread
From: yanjun.zhu @ 2026-03-06 21:28 UTC (permalink / raw)
  To: kernel test robot, jgg, leon, zyjzyj2000, dsahern, linux-rdma
  Cc: llvm, oe-kbuild-all

On 3/5/26 6:58 PM, kernel test robot wrote:
> Hi Zhu,
> 
> kernel test robot noticed the following build warnings:
> 
> [auto build test WARNING on rdma/for-next]
> [also build test WARNING on linus/master v7.0-rc2]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Zhu-Yanjun/RDMA-rxe-Add-the-support-that-rxe-can-work-in-net-namespace/20260304-121951
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git for-next
> patch link:    https://lore.kernel.org/r/20260304041607.11685-1-yanjun.zhu%40linux.dev
> patch subject: [PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace
> config: x86_64-randconfig-014-20260305 (https://download.01.org/0day-ci/archive/20260306/202603061015.zwXUa3OS-lkp@intel.com/config)
> compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260306/202603061015.zwXUa3OS-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202603061015.zwXUa3OS-lkp@intel.com/
> 
> All warnings (new ones prefixed by >>, old ones prefixed by <<):

With W=1, I run the latest commits, this problem does not occur.

Zhu Yanjun

> 
>>> WARNING: modpost: drivers/infiniband/sw/rxe/rdma_rxe: section mismatch in reference: rxe_namespace_exit+0x7 (section: .exit.text) -> rxe_net_ops (section: .init.data)
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-03-06 21:30 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-04  4:16 [PATCHv2 1/1] RDMA/rxe: Add the support that rxe can work in net namespace Zhu Yanjun
2026-03-04  4:44 ` Zhu Yanjun
2026-03-04 19:29   ` David Ahern
2026-03-05  3:29     ` Zhu Yanjun
2026-03-05 18:58       ` David Ahern
2026-03-05 23:15         ` Yanjun.Zhu
2026-03-05 23:41           ` David Ahern
2026-03-06 21:29             ` yanjun.zhu
2026-03-05 18:54 ` David Ahern
2026-03-06  2:38   ` Yanjun.Zhu
2026-03-06  2:58 ` kernel test robot
2026-03-06 21:28   ` yanjun.zhu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox