linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6.4-rc1 v5 0/8]  Fix the problem that rxe can not work in net namespace
@ 2023-05-08  7:56 Zhu Yanjun
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 1/8] RDMA/rxe: Creating listening sock in newlink function Zhu Yanjun
                   ` (8 more replies)
  0 siblings, 9 replies; 25+ messages in thread
From: Zhu Yanjun @ 2023-05-08  7:56 UTC (permalink / raw)
  To: zyjzyj2000, jgg, leon, linux-rdma, parav, lehrer; +Cc: Zhu Yanjun

From: Zhu Yanjun <yanjun.zhu@linux.dev>

When run "ip link add" command to add a rxe rdma link in a net
namespace, normally this rxe rdma link can not work in a net
name space.

The root cause is that a sock listening on udp port 4791 is created
in init_net when the rdma_rxe module is loaded into kernel. That is,
the sock listening on udp port 4791 is created in init_net. Other net
namespace is difficult to use this sock.

The following commits will solve this problem.

In the first commit, move the creating sock listening on udp port 4791
from module_init function to rdma link creating functions. That is,
after the module rdma_rxe is loaded, the sock will not be created.
When run "rdma link add ..." command, the sock will be created. So
when creating a rdma link in the net namespace, the sock will be
created in this net namespace.

In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
will check the sock exists in the net namespace or not. If yes, rdma
link will increase the reference count of this sock, then continue other
jobs instead of creating a new sock to listen on udp port 4791. Since the
network notifier is global, when the module rdma_rxe is loaded, this
notifier will be registered.

After the rdma link is created, the command "rdma link del" is to
delete rdma link at the same time the sock is checked. If the reference
count of this sock is greater than the sock reference count needed by
udp tunnel, the sock reference count is decreased by one. If equal, it
indicates that this rdma link is the last one. As such, the udp tunnel
is shut down and the sock is closed. The above work should be
implemented in linkdel function. But currently no dellink function in
rxe. So the 3rd commit addes dellink function pointer. And the 4th
commit implements the dellink function in rxe.

To now, it is not necessary to keep a global variable to store the sock
listening udp port 4791. This global variable can be replaced by the
functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
function udp6_lib_lookup is in the fast path, a member variable l_sk6
is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
to lookup the sock, then the sock is stored in l_sk6, in the future,it
can be used directly.

All the above work has been done in init_net. And it can also work in
the net namespace. So the init_net is replaced by the individual net
namespace. This is what the 6th commit does. Because rxe device is
dependent on the net device and the sock listening on udp port 4791,
every rxe device is in exclusive mode in the individual net namespace.
Other rdma netns operations will be considerred in the future.

In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys
functions are added. When a new net namespace is created, the init
function will initialize the sk4 and sk6 socks. Then the 2 socks will
be released when the net namespace is destroyed. The functions
rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net
namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will
handle sk6. Then sk4 and sk6 are used in the previous commits.

As the sk4 and sk6 in pernet namespace can be accessed, it is not
necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is
replaced with the sk6 in pernet namespace.

Test steps:
1) Suppose that 2 NICs are in 2 different net namespaces.

  # ip netns exec net0 ip link
  3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
     link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
     altname enp5s0

  # ip netns exec net1 ip link
  4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
     link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff

2) Add rdma link in the different net namespace
    net0:
    # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2

    net1:
    # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3

3) Run rping test.
    net0
    # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
    [1] 1737
    # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
    verbose
    count 1
    ...
    ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
    ...

4) Remove the rdma links from the net namespaces.
    net0:
    # ip netns exec net0 ss -lu
    State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
    UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
    UNCONN    0         0         [::]:4791             [::]:*

    # ip netns exec net0 rdma link del rxe0

    # ip netns exec net0 ss -lu
    State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process

    net1:
    # ip netns exec net0 ss -lu
    State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
    UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
    UNCONN    0         0         [::]:4791             [::]:*

    # ip netns exec net1 rdma link del rxe1

    # ip netns exec net0 ss -lu
    State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process

V4->V5: Rebase the commits to V6.4-rc1

V3->V4: Rebase the commits to rdma-next;

V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to
           verify rdma link is removed.
        2) Add register_pernet_subsys/unregister_pernet_subsys net namespace
        3) Replace l_sk6 with sk6 of pernet_name_space

V1->V2: Add the explicit initialization of sk6.

Zhu Yanjun (8):
  RDMA/rxe: Creating listening sock in newlink function
  RDMA/rxe: Support more rdma links in init_net
  RDMA/nldev: Add dellink function pointer
  RDMA/rxe: Implement dellink in rxe
  RDMA/rxe: Replace global variable with sock lookup functions
  RDMA/rxe: add the support of net namespace
  RDMA/rxe: Add the support of net namespace notifier
  RDMA/rxe: Replace l_sk6 with sk6 in net namespace

 drivers/infiniband/core/nldev.c     |   6 ++
 drivers/infiniband/sw/rxe/Makefile  |   3 +-
 drivers/infiniband/sw/rxe/rxe.c     |  35 +++++++-
 drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++------
 drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
 drivers/infiniband/sw/rxe/rxe_ns.c  | 134 ++++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_ns.h  |  17 ++++
 include/rdma/rdma_netlink.h         |   2 +
 8 files changed, 279 insertions(+), 40 deletions(-)
 create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
 create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.h

-- 
2.27.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v6.4-rc1 v5 1/8] RDMA/rxe: Creating listening sock in newlink function
  2023-05-08  7:56 [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
@ 2023-05-08  7:56 ` Zhu Yanjun
  2023-06-20 17:16   ` Bob Pearson
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 2/8] RDMA/rxe: Support more rdma links in init_net Zhu Yanjun
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Zhu Yanjun @ 2023-05-08  7:56 UTC (permalink / raw)
  To: zyjzyj2000, jgg, leon, linux-rdma, parav, lehrer; +Cc: Zhu Yanjun, Rain River

From: Zhu Yanjun <yanjun.zhu@linux.dev>

Originally when the module rdma_rxe is loaded, the sock listening on udp
port 4791 is created. Currently moving the creating listening port to
newlink function.

So when running "rdma link add" command, the sock listening on udp port
4791 is created.

Tested-by: Rain River <rain.1986.08.12@gmail.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/rxe.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 7a7e713de52d..89b24bc34299 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -194,6 +194,10 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
 		goto err;
 	}
 
+	err = rxe_net_init();
+	if (err)
+		return err;
+
 	err = rxe_net_add(ibdev_name, ndev);
 	if (err) {
 		rxe_err("failed to add %s\n", ndev->name);
@@ -210,12 +214,6 @@ static struct rdma_link_ops rxe_link_ops = {
 
 static int __init rxe_module_init(void)
 {
-	int err;
-
-	err = rxe_net_init();
-	if (err)
-		return err;
-
 	rdma_link_register(&rxe_link_ops);
 	pr_info("loaded\n");
 	return 0;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v6.4-rc1 v5 2/8] RDMA/rxe: Support more rdma links in init_net
  2023-05-08  7:56 [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 1/8] RDMA/rxe: Creating listening sock in newlink function Zhu Yanjun
@ 2023-05-08  7:56 ` Zhu Yanjun
  2023-06-20 17:54   ` Bob Pearson
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 3/8] RDMA/nldev: Add dellink function pointer Zhu Yanjun
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Zhu Yanjun @ 2023-05-08  7:56 UTC (permalink / raw)
  To: zyjzyj2000, jgg, leon, linux-rdma, parav, lehrer; +Cc: Zhu Yanjun, Rain River

From: Zhu Yanjun <yanjun.zhu@linux.dev>

In init_net, when several rdma links are created with the command "rdma
link add", newlink will check whether the udp port 4791 is listening or
not.
If not, creating a sock listening on udp port 4791. If yes, increasing the
reference count of the sock.

Tested-by: Rain River <rain.1986.08.12@gmail.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/rxe.c     | 12 ++++++-
 drivers/infiniband/sw/rxe/rxe_net.c | 55 +++++++++++++++++++++--------
 drivers/infiniband/sw/rxe/rxe_net.h |  1 +
 3 files changed, 52 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 89b24bc34299..c15d3c5d7a6f 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -8,6 +8,7 @@
 #include <net/addrconf.h>
 #include "rxe.h"
 #include "rxe_loc.h"
+#include "rxe_net.h"
 
 MODULE_AUTHOR("Bob Pearson, Frank Zago, John Groves, Kamal Heib");
 MODULE_DESCRIPTION("Soft RDMA transport");
@@ -207,14 +208,23 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
 	return err;
 }
 
-static struct rdma_link_ops rxe_link_ops = {
+struct rdma_link_ops rxe_link_ops = {
 	.type = "rxe",
 	.newlink = rxe_newlink,
 };
 
 static int __init rxe_module_init(void)
 {
+	int err;
+
 	rdma_link_register(&rxe_link_ops);
+
+	err = rxe_register_notifier();
+	if (err) {
+		pr_err("Failed to register netdev notifier\n");
+		return -1;
+	}
+
 	pr_info("loaded\n");
 	return 0;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 2bc7361152ea..1b98efa2cf66 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -626,13 +626,23 @@ static struct notifier_block rxe_net_notifier = {
 
 static int rxe_net_ipv4_init(void)
 {
-	recv_sockets.sk4 = rxe_setup_udp_tunnel(&init_net,
-				htons(ROCE_V2_UDP_DPORT), false);
-	if (IS_ERR(recv_sockets.sk4)) {
-		recv_sockets.sk4 = NULL;
+	struct sock *sk;
+	struct socket *sock;
+
+	rcu_read_lock();
+	sk = udp4_lib_lookup(&init_net, 0, 0, htonl(INADDR_ANY),
+			     htons(ROCE_V2_UDP_DPORT), 0);
+	rcu_read_unlock();
+	if (sk)
+		return 0;
+
+	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), false);
+	if (IS_ERR(sock)) {
 		pr_err("Failed to create IPv4 UDP tunnel\n");
+		recv_sockets.sk4 = NULL;
 		return -1;
 	}
+	recv_sockets.sk4 = sock;
 
 	return 0;
 }
@@ -640,24 +650,46 @@ static int rxe_net_ipv4_init(void)
 static int rxe_net_ipv6_init(void)
 {
 #if IS_ENABLED(CONFIG_IPV6)
+	struct sock *sk;
+	struct socket *sock;
+
+	rcu_read_lock();
+	sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any,
+			     htons(ROCE_V2_UDP_DPORT), 0);
+	rcu_read_unlock();
+	if (sk)
+		return 0;
 
-	recv_sockets.sk6 = rxe_setup_udp_tunnel(&init_net,
-						htons(ROCE_V2_UDP_DPORT), true);
-	if (PTR_ERR(recv_sockets.sk6) == -EAFNOSUPPORT) {
+	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), true);
+	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
 		recv_sockets.sk6 = NULL;
 		pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n");
 		return 0;
 	}
 
-	if (IS_ERR(recv_sockets.sk6)) {
+	if (IS_ERR(sock)) {
 		recv_sockets.sk6 = NULL;
 		pr_err("Failed to create IPv6 UDP tunnel\n");
 		return -1;
 	}
+	recv_sockets.sk6 = sock;
 #endif
 	return 0;
 }
 
+int rxe_register_notifier(void)
+{
+	int err;
+
+	err = register_netdevice_notifier(&rxe_net_notifier);
+	if (err) {
+		pr_err("Failed to register netdev notifier\n");
+		return -1;
+	}
+
+	return 0;
+}
+
 void rxe_net_exit(void)
 {
 	rxe_release_udp_tunnel(recv_sockets.sk6);
@@ -669,19 +701,12 @@ int rxe_net_init(void)
 {
 	int err;
 
-	recv_sockets.sk6 = NULL;
-
 	err = rxe_net_ipv4_init();
 	if (err)
 		return err;
 	err = rxe_net_ipv6_init();
 	if (err)
 		goto err_out;
-	err = register_netdevice_notifier(&rxe_net_notifier);
-	if (err) {
-		pr_err("Failed to register netdev notifier\n");
-		goto err_out;
-	}
 	return 0;
 err_out:
 	rxe_net_exit();
diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
index 45d80d00f86b..a222c3eeae12 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.h
+++ b/drivers/infiniband/sw/rxe/rxe_net.h
@@ -18,6 +18,7 @@ struct rxe_recv_sockets {
 
 int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
 
+int rxe_register_notifier(void);
 int rxe_net_init(void);
 void rxe_net_exit(void);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v6.4-rc1 v5 3/8] RDMA/nldev: Add dellink function pointer
  2023-05-08  7:56 [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 1/8] RDMA/rxe: Creating listening sock in newlink function Zhu Yanjun
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 2/8] RDMA/rxe: Support more rdma links in init_net Zhu Yanjun
@ 2023-05-08  7:56 ` Zhu Yanjun
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 4/8] RDMA/rxe: Implement dellink in rxe Zhu Yanjun
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Zhu Yanjun @ 2023-05-08  7:56 UTC (permalink / raw)
  To: zyjzyj2000, jgg, leon, linux-rdma, parav, lehrer; +Cc: Zhu Yanjun, Rain River

From: Zhu Yanjun <yanjun.zhu@linux.dev>

The newlink function pointer is added. And the sock listening on port 4791
is added in the newlink function. So the dellink function is needed to
remove the sock.

Tested-by: Rain River <rain.1986.08.12@gmail.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/core/nldev.c | 6 ++++++
 include/rdma/rdma_netlink.h     | 2 ++
 2 files changed, 8 insertions(+)

diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index d5d3e4f0de77..97a62685ed5b 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -1758,6 +1758,12 @@ static int nldev_dellink(struct sk_buff *skb, struct nlmsghdr *nlh,
 		return -EINVAL;
 	}
 
+	if (device->link_ops) {
+		err = device->link_ops->dellink(device);
+		if (err)
+			return err;
+	}
+
 	ib_unregister_device_and_put(device);
 	return 0;
 }
diff --git a/include/rdma/rdma_netlink.h b/include/rdma/rdma_netlink.h
index c2a79aeee113..bf9df004061f 100644
--- a/include/rdma/rdma_netlink.h
+++ b/include/rdma/rdma_netlink.h
@@ -5,6 +5,7 @@
 
 #include <linux/netlink.h>
 #include <uapi/rdma/rdma_netlink.h>
+#include <rdma/ib_verbs.h>
 
 enum {
 	RDMA_NLDEV_ATTR_EMPTY_STRING = 1,
@@ -114,6 +115,7 @@ struct rdma_link_ops {
 	struct list_head list;
 	const char *type;
 	int (*newlink)(const char *ibdev_name, struct net_device *ndev);
+	int (*dellink)(struct ib_device *dev);
 };
 
 void rdma_link_register(struct rdma_link_ops *ops);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v6.4-rc1 v5 4/8] RDMA/rxe: Implement dellink in rxe
  2023-05-08  7:56 [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
                   ` (2 preceding siblings ...)
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 3/8] RDMA/nldev: Add dellink function pointer Zhu Yanjun
@ 2023-05-08  7:56 ` Zhu Yanjun
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 5/8] RDMA/rxe: Replace global variable with sock lookup functions Zhu Yanjun
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Zhu Yanjun @ 2023-05-08  7:56 UTC (permalink / raw)
  To: zyjzyj2000, jgg, leon, linux-rdma, parav, lehrer; +Cc: Zhu Yanjun, Rain River

From: Zhu Yanjun <yanjun.zhu@linux.dev>

When running "rdma link del" command, dellink function will be called.
If the sock refcnt is greater than the refcnt needed for udp tunnel,
the sock refcnt will be decreased by 1.

If equal, the last rdma link is removed. The udp tunnel will be
destroyed.

Tested-by: Rain River <rain.1986.08.12@gmail.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/rxe.c     | 12 +++++++++++-
 drivers/infiniband/sw/rxe/rxe_net.c | 17 +++++++++++++++--
 drivers/infiniband/sw/rxe/rxe_net.h |  1 +
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index c15d3c5d7a6f..ac7e7b0a9dc9 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -168,10 +168,12 @@ void rxe_set_mtu(struct rxe_dev *rxe, unsigned int ndev_mtu)
 /* called by ifc layer to create new rxe device.
  * The caller should allocate memory for rxe by calling ib_alloc_device.
  */
+static struct rdma_link_ops rxe_link_ops;
 int rxe_add(struct rxe_dev *rxe, unsigned int mtu, const char *ibdev_name)
 {
 	rxe_init(rxe);
 	rxe_set_mtu(rxe, mtu);
+	rxe->ib_dev.link_ops = &rxe_link_ops;
 
 	return rxe_register_device(rxe, ibdev_name);
 }
@@ -208,9 +210,17 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
 	return err;
 }
 
-struct rdma_link_ops rxe_link_ops = {
+static int rxe_dellink(struct ib_device *dev)
+{
+	rxe_net_del(dev);
+
+	return 0;
+}
+
+static struct rdma_link_ops rxe_link_ops = {
 	.type = "rxe",
 	.newlink = rxe_newlink,
+	.dellink = rxe_dellink,
 };
 
 static int __init rxe_module_init(void)
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 1b98efa2cf66..6071533d67c8 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -533,6 +533,21 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev)
 	return 0;
 }
 
+#define SK_REF_FOR_TUNNEL	2
+void rxe_net_del(struct ib_device *dev)
+{
+	if (refcount_read(&recv_sockets.sk6->sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
+		__sock_put(recv_sockets.sk6->sk);
+	else
+		rxe_release_udp_tunnel(recv_sockets.sk6);
+
+	if (refcount_read(&recv_sockets.sk4->sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
+		__sock_put(recv_sockets.sk4->sk);
+	else
+		rxe_release_udp_tunnel(recv_sockets.sk4);
+}
+#undef SK_REF_FOR_TUNNEL
+
 static void rxe_port_event(struct rxe_dev *rxe,
 			   enum ib_event_type event)
 {
@@ -692,8 +707,6 @@ int rxe_register_notifier(void)
 
 void rxe_net_exit(void)
 {
-	rxe_release_udp_tunnel(recv_sockets.sk6);
-	rxe_release_udp_tunnel(recv_sockets.sk4);
 	unregister_netdevice_notifier(&rxe_net_notifier);
 }
 
diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
index a222c3eeae12..f48f22f3353b 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.h
+++ b/drivers/infiniband/sw/rxe/rxe_net.h
@@ -17,6 +17,7 @@ struct rxe_recv_sockets {
 };
 
 int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
+void rxe_net_del(struct ib_device *dev);
 
 int rxe_register_notifier(void);
 int rxe_net_init(void);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v6.4-rc1 v5 5/8] RDMA/rxe: Replace global variable with sock lookup functions
  2023-05-08  7:56 [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
                   ` (3 preceding siblings ...)
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 4/8] RDMA/rxe: Implement dellink in rxe Zhu Yanjun
@ 2023-05-08  7:56 ` Zhu Yanjun
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 6/8] RDMA/rxe: add the support of net namespace Zhu Yanjun
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Zhu Yanjun @ 2023-05-08  7:56 UTC (permalink / raw)
  To: zyjzyj2000, jgg, leon, linux-rdma, parav, lehrer; +Cc: Zhu Yanjun, Rain River

From: Zhu Yanjun <yanjun.zhu@linux.dev>

Originally a global variable is to keep the sock of udp listening
on port 4791. In fact, sock lookup functions can be used to get
the sock.

Tested-by: Rain River <rain.1986.08.12@gmail.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/rxe.c       |  1 +
 drivers/infiniband/sw/rxe/rxe_net.c   | 58 ++++++++++++++++++++-------
 drivers/infiniband/sw/rxe/rxe_net.h   |  5 ---
 drivers/infiniband/sw/rxe/rxe_verbs.h |  1 +
 4 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index ac7e7b0a9dc9..c9b3125b26d0 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -74,6 +74,7 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
 			rxe->ndev->dev_addr);
 
 	rxe->max_ucontext			= RXE_MAX_UCONTEXT;
+	rxe->l_sk6				= NULL;
 }
 
 /* initialize port attributes */
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 6071533d67c8..87af6a65a291 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -18,8 +18,6 @@
 #include "rxe_net.h"
 #include "rxe_loc.h"
 
-static struct rxe_recv_sockets recv_sockets;
-
 static struct dst_entry *rxe_find_route4(struct rxe_qp *qp,
 					 struct net_device *ndev,
 					 struct in_addr *saddr,
@@ -51,6 +49,23 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
 {
 	struct dst_entry *ndst;
 	struct flowi6 fl6 = { { 0 } };
+	struct rxe_dev *rdev;
+
+	rdev = rxe_get_dev_from_net(ndev);
+	if (!rdev->l_sk6) {
+		struct sock *sk;
+
+		rcu_read_lock();
+		sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any, htons(ROCE_V2_UDP_DPORT), 0);
+		rcu_read_unlock();
+		if (!sk) {
+			pr_info("file: %s +%d, error\n", __FILE__, __LINE__);
+			return (struct dst_entry *)sk;
+		}
+		__sock_put(sk);
+		rdev->l_sk6 = sk->sk_socket;
+	}
+
 
 	memset(&fl6, 0, sizeof(fl6));
 	fl6.flowi6_oif = ndev->ifindex;
@@ -58,8 +73,8 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
 	memcpy(&fl6.daddr, daddr, sizeof(*daddr));
 	fl6.flowi6_proto = IPPROTO_UDP;
 
-	ndst = ipv6_stub->ipv6_dst_lookup_flow(sock_net(recv_sockets.sk6->sk),
-					       recv_sockets.sk6->sk, &fl6,
+	ndst = ipv6_stub->ipv6_dst_lookup_flow(dev_net(ndev),
+					       rdev->l_sk6->sk, &fl6,
 					       NULL);
 	if (IS_ERR(ndst)) {
 		rxe_dbg_qp(qp, "no route to %pI6\n", daddr);
@@ -536,15 +551,33 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev)
 #define SK_REF_FOR_TUNNEL	2
 void rxe_net_del(struct ib_device *dev)
 {
-	if (refcount_read(&recv_sockets.sk6->sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
-		__sock_put(recv_sockets.sk6->sk);
+	struct sock *sk;
+
+	rcu_read_lock();
+	sk = udp4_lib_lookup(&init_net, 0, 0, htonl(INADDR_ANY), htons(ROCE_V2_UDP_DPORT), 0);
+	rcu_read_unlock();
+	if (!sk)
+		return;
+
+	__sock_put(sk);
+
+	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
+		__sock_put(sk);
 	else
-		rxe_release_udp_tunnel(recv_sockets.sk6);
+		rxe_release_udp_tunnel(sk->sk_socket);
+
+	rcu_read_lock();
+	sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any, htons(ROCE_V2_UDP_DPORT), 0);
+	rcu_read_unlock();
+	if (!sk)
+		return;
+
+	__sock_put(sk);
 
-	if (refcount_read(&recv_sockets.sk4->sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
-		__sock_put(recv_sockets.sk4->sk);
+	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
+		__sock_put(sk);
 	else
-		rxe_release_udp_tunnel(recv_sockets.sk4);
+		rxe_release_udp_tunnel(sk->sk_socket);
 }
 #undef SK_REF_FOR_TUNNEL
 
@@ -654,10 +687,8 @@ static int rxe_net_ipv4_init(void)
 	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), false);
 	if (IS_ERR(sock)) {
 		pr_err("Failed to create IPv4 UDP tunnel\n");
-		recv_sockets.sk4 = NULL;
 		return -1;
 	}
-	recv_sockets.sk4 = sock;
 
 	return 0;
 }
@@ -677,17 +708,14 @@ static int rxe_net_ipv6_init(void)
 
 	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), true);
 	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
-		recv_sockets.sk6 = NULL;
 		pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n");
 		return 0;
 	}
 
 	if (IS_ERR(sock)) {
-		recv_sockets.sk6 = NULL;
 		pr_err("Failed to create IPv6 UDP tunnel\n");
 		return -1;
 	}
-	recv_sockets.sk6 = sock;
 #endif
 	return 0;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
index f48f22f3353b..027b20e1bab6 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.h
+++ b/drivers/infiniband/sw/rxe/rxe_net.h
@@ -11,11 +11,6 @@
 #include <net/if_inet6.h>
 #include <linux/module.h>
 
-struct rxe_recv_sockets {
-	struct socket *sk4;
-	struct socket *sk6;
-};
-
 int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
 void rxe_net_del(struct ib_device *dev);
 
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 26a20f088692..0aa3817770a5 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -382,6 +382,7 @@ struct rxe_dev {
 
 	struct rxe_port		port;
 	struct crypto_shash	*tfm;
+	struct socket		*l_sk6;
 };
 
 static inline void rxe_counter_inc(struct rxe_dev *rxe, enum rxe_counters index)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v6.4-rc1 v5 6/8] RDMA/rxe: add the support of net namespace
  2023-05-08  7:56 [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
                   ` (4 preceding siblings ...)
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 5/8] RDMA/rxe: Replace global variable with sock lookup functions Zhu Yanjun
@ 2023-05-08  7:56 ` Zhu Yanjun
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 7/8] RDMA/rxe: Add the support of net namespace notifier Zhu Yanjun
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Zhu Yanjun @ 2023-05-08  7:56 UTC (permalink / raw)
  To: zyjzyj2000, jgg, leon, linux-rdma, parav, lehrer; +Cc: Zhu Yanjun, Rain River

From: Zhu Yanjun <yanjun.zhu@linux.dev>

Originally init_net is used to indicate the current net namespace.
Currently more net namespaces are supported.

Tested-by: Rain River <rain.1986.08.12@gmail.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/rxe.c     |  2 +-
 drivers/infiniband/sw/rxe/rxe_net.c | 33 +++++++++++++++++------------
 drivers/infiniband/sw/rxe/rxe_net.h |  2 +-
 3 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index c9b3125b26d0..ef632be05e38 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -198,7 +198,7 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
 		goto err;
 	}
 
-	err = rxe_net_init();
+	err = rxe_net_init(ndev);
 	if (err)
 		return err;
 
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 87af6a65a291..0cf164da8299 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -32,7 +32,7 @@ static struct dst_entry *rxe_find_route4(struct rxe_qp *qp,
 	memcpy(&fl.daddr, daddr, sizeof(*daddr));
 	fl.flowi4_proto = IPPROTO_UDP;
 
-	rt = ip_route_output_key(&init_net, &fl);
+	rt = ip_route_output_key(dev_net(ndev), &fl);
 	if (IS_ERR(rt)) {
 		rxe_dbg_qp(qp, "no route to %pI4\n", &daddr->s_addr);
 		return NULL;
@@ -56,7 +56,8 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
 		struct sock *sk;
 
 		rcu_read_lock();
-		sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any, htons(ROCE_V2_UDP_DPORT), 0);
+		sk = udp6_lib_lookup(dev_net(ndev), NULL, 0, &in6addr_any,
+				     htons(ROCE_V2_UDP_DPORT), 0);
 		rcu_read_unlock();
 		if (!sk) {
 			pr_info("file: %s +%d, error\n", __FILE__, __LINE__);
@@ -552,9 +553,13 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev)
 void rxe_net_del(struct ib_device *dev)
 {
 	struct sock *sk;
+	struct rxe_dev *rdev;
+
+	rdev = container_of(dev, struct rxe_dev, ib_dev);
 
 	rcu_read_lock();
-	sk = udp4_lib_lookup(&init_net, 0, 0, htonl(INADDR_ANY), htons(ROCE_V2_UDP_DPORT), 0);
+	sk = udp4_lib_lookup(dev_net(rdev->ndev), 0, 0, htonl(INADDR_ANY),
+			     htons(ROCE_V2_UDP_DPORT), 0);
 	rcu_read_unlock();
 	if (!sk)
 		return;
@@ -567,7 +572,8 @@ void rxe_net_del(struct ib_device *dev)
 		rxe_release_udp_tunnel(sk->sk_socket);
 
 	rcu_read_lock();
-	sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any, htons(ROCE_V2_UDP_DPORT), 0);
+	sk = udp6_lib_lookup(dev_net(rdev->ndev), NULL, 0, &in6addr_any,
+			     htons(ROCE_V2_UDP_DPORT), 0);
 	rcu_read_unlock();
 	if (!sk)
 		return;
@@ -639,6 +645,7 @@ static int rxe_notify(struct notifier_block *not_blk,
 	switch (event) {
 	case NETDEV_UNREGISTER:
 		ib_unregister_device_queued(&rxe->ib_dev);
+		rxe_net_del(&rxe->ib_dev);
 		break;
 	case NETDEV_UP:
 		rxe_port_up(rxe);
@@ -672,19 +679,19 @@ static struct notifier_block rxe_net_notifier = {
 	.notifier_call = rxe_notify,
 };
 
-static int rxe_net_ipv4_init(void)
+static int rxe_net_ipv4_init(struct net_device *ndev)
 {
 	struct sock *sk;
 	struct socket *sock;
 
 	rcu_read_lock();
-	sk = udp4_lib_lookup(&init_net, 0, 0, htonl(INADDR_ANY),
+	sk = udp4_lib_lookup(dev_net(ndev), 0, 0, htonl(INADDR_ANY),
 			     htons(ROCE_V2_UDP_DPORT), 0);
 	rcu_read_unlock();
 	if (sk)
 		return 0;
 
-	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), false);
+	sock = rxe_setup_udp_tunnel(dev_net(ndev), htons(ROCE_V2_UDP_DPORT), false);
 	if (IS_ERR(sock)) {
 		pr_err("Failed to create IPv4 UDP tunnel\n");
 		return -1;
@@ -693,20 +700,20 @@ static int rxe_net_ipv4_init(void)
 	return 0;
 }
 
-static int rxe_net_ipv6_init(void)
+static int rxe_net_ipv6_init(struct net_device *ndev)
 {
 #if IS_ENABLED(CONFIG_IPV6)
 	struct sock *sk;
 	struct socket *sock;
 
 	rcu_read_lock();
-	sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any,
+	sk = udp6_lib_lookup(dev_net(ndev), NULL, 0, &in6addr_any,
 			     htons(ROCE_V2_UDP_DPORT), 0);
 	rcu_read_unlock();
 	if (sk)
 		return 0;
 
-	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), true);
+	sock = rxe_setup_udp_tunnel(dev_net(ndev), htons(ROCE_V2_UDP_DPORT), true);
 	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
 		pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n");
 		return 0;
@@ -738,14 +745,14 @@ void rxe_net_exit(void)
 	unregister_netdevice_notifier(&rxe_net_notifier);
 }
 
-int rxe_net_init(void)
+int rxe_net_init(struct net_device *ndev)
 {
 	int err;
 
-	err = rxe_net_ipv4_init();
+	err = rxe_net_ipv4_init(ndev);
 	if (err)
 		return err;
-	err = rxe_net_ipv6_init();
+	err = rxe_net_ipv6_init(ndev);
 	if (err)
 		goto err_out;
 	return 0;
diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
index 027b20e1bab6..56249677d692 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.h
+++ b/drivers/infiniband/sw/rxe/rxe_net.h
@@ -15,7 +15,7 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
 void rxe_net_del(struct ib_device *dev);
 
 int rxe_register_notifier(void);
-int rxe_net_init(void);
+int rxe_net_init(struct net_device *ndev);
 void rxe_net_exit(void);
 
 #endif /* RXE_NET_H */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v6.4-rc1 v5 7/8] RDMA/rxe: Add the support of net namespace notifier
  2023-05-08  7:56 [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
                   ` (5 preceding siblings ...)
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 6/8] RDMA/rxe: add the support of net namespace Zhu Yanjun
@ 2023-05-08  7:56 ` Zhu Yanjun
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 8/8] RDMA/rxe: Replace l_sk6 with sk6 in net namespace Zhu Yanjun
  2023-06-21 21:09 ` [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work " Bob Pearson
  8 siblings, 0 replies; 25+ messages in thread
From: Zhu Yanjun @ 2023-05-08  7:56 UTC (permalink / raw)
  To: zyjzyj2000, jgg, leon, linux-rdma, parav, lehrer; +Cc: Zhu Yanjun, Rain River

From: Zhu Yanjun <yanjun.zhu@linux.dev>

The functions register_pernet_subsys/unregister_pernet_subsys register a
notifier of net namespace. When a new net namespace is created, the init
function of rxe will be called to initialize sk4 and sk6 socks. When a
net namespace is destroyed, the exit function will be called to handle
sk4 and sk6 socks.

The functions rxe_ns_pernet_sk4 and rxe_ns_pernet_sk6 are used to get
sk4 and sk6 socks.

The functions rxe_ns_pernet_set_sk4 and rxe_ns_pernet_set_sk6 are used
to set sk4 and sk6 socks.

Tested-by: Rain River <rain.1986.08.12@gmail.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/Makefile  |   3 +-
 drivers/infiniband/sw/rxe/rxe.c     |   9 ++
 drivers/infiniband/sw/rxe/rxe_net.c |  50 +++++------
 drivers/infiniband/sw/rxe/rxe_ns.c  | 134 ++++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_ns.h  |  17 ++++
 5 files changed, 187 insertions(+), 26 deletions(-)
 create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
 create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.h

diff --git a/drivers/infiniband/sw/rxe/Makefile b/drivers/infiniband/sw/rxe/Makefile
index 5395a581f4bb..8380f97674cb 100644
--- a/drivers/infiniband/sw/rxe/Makefile
+++ b/drivers/infiniband/sw/rxe/Makefile
@@ -22,4 +22,5 @@ rdma_rxe-y := \
 	rxe_mcast.o \
 	rxe_task.o \
 	rxe_net.o \
-	rxe_hw_counters.o
+	rxe_hw_counters.o \
+	rxe_ns.o
diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index ef632be05e38..96841c56ff3a 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -9,6 +9,7 @@
 #include "rxe.h"
 #include "rxe_loc.h"
 #include "rxe_net.h"
+#include "rxe_ns.h"
 
 MODULE_AUTHOR("Bob Pearson, Frank Zago, John Groves, Kamal Heib");
 MODULE_DESCRIPTION("Soft RDMA transport");
@@ -236,6 +237,12 @@ static int __init rxe_module_init(void)
 		return -1;
 	}
 
+	err = rxe_namespace_init();
+	if (err) {
+		pr_err("Failed to register net namespace notifier\n");
+		return -1;
+	}
+
 	pr_info("loaded\n");
 	return 0;
 }
@@ -246,6 +253,8 @@ static void __exit rxe_module_exit(void)
 	ib_unregister_driver(RDMA_DRIVER_RXE);
 	rxe_net_exit();
 
+	rxe_namespace_exit();
+
 	pr_info("unloaded\n");
 }
 
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 0cf164da8299..28d8171a36e8 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -17,6 +17,7 @@
 #include "rxe.h"
 #include "rxe_net.h"
 #include "rxe_loc.h"
+#include "rxe_ns.h"
 
 static struct dst_entry *rxe_find_route4(struct rxe_qp *qp,
 					 struct net_device *ndev,
@@ -557,33 +558,30 @@ void rxe_net_del(struct ib_device *dev)
 
 	rdev = container_of(dev, struct rxe_dev, ib_dev);
 
-	rcu_read_lock();
-	sk = udp4_lib_lookup(dev_net(rdev->ndev), 0, 0, htonl(INADDR_ANY),
-			     htons(ROCE_V2_UDP_DPORT), 0);
-	rcu_read_unlock();
+	sk = rxe_ns_pernet_sk4(dev_net(rdev->ndev));
 	if (!sk)
 		return;
 
-	__sock_put(sk);
 
-	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
+	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL) {
 		__sock_put(sk);
-	else
+	} else {
 		rxe_release_udp_tunnel(sk->sk_socket);
+		sk = NULL;
+		rxe_ns_pernet_set_sk4(dev_net(rdev->ndev), sk);
+	}
 
-	rcu_read_lock();
-	sk = udp6_lib_lookup(dev_net(rdev->ndev), NULL, 0, &in6addr_any,
-			     htons(ROCE_V2_UDP_DPORT), 0);
-	rcu_read_unlock();
+	sk = rxe_ns_pernet_sk6(dev_net(rdev->ndev));
 	if (!sk)
 		return;
 
-	__sock_put(sk);
-
-	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL)
+	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL) {
 		__sock_put(sk);
-	else
+	} else {
 		rxe_release_udp_tunnel(sk->sk_socket);
+		sk = NULL;
+		rxe_ns_pernet_set_sk6(dev_net(rdev->ndev), sk);
+	}
 }
 #undef SK_REF_FOR_TUNNEL
 
@@ -684,18 +682,18 @@ static int rxe_net_ipv4_init(struct net_device *ndev)
 	struct sock *sk;
 	struct socket *sock;
 
-	rcu_read_lock();
-	sk = udp4_lib_lookup(dev_net(ndev), 0, 0, htonl(INADDR_ANY),
-			     htons(ROCE_V2_UDP_DPORT), 0);
-	rcu_read_unlock();
-	if (sk)
+	sk = rxe_ns_pernet_sk4(dev_net(ndev));
+	if (sk) {
+		sock_hold(sk);
 		return 0;
+	}
 
 	sock = rxe_setup_udp_tunnel(dev_net(ndev), htons(ROCE_V2_UDP_DPORT), false);
 	if (IS_ERR(sock)) {
 		pr_err("Failed to create IPv4 UDP tunnel\n");
 		return -1;
 	}
+	rxe_ns_pernet_set_sk4(dev_net(ndev), sock->sk);
 
 	return 0;
 }
@@ -706,12 +704,11 @@ static int rxe_net_ipv6_init(struct net_device *ndev)
 	struct sock *sk;
 	struct socket *sock;
 
-	rcu_read_lock();
-	sk = udp6_lib_lookup(dev_net(ndev), NULL, 0, &in6addr_any,
-			     htons(ROCE_V2_UDP_DPORT), 0);
-	rcu_read_unlock();
-	if (sk)
+	sk = rxe_ns_pernet_sk6(dev_net(ndev));
+	if (sk) {
+		sock_hold(sk);
 		return 0;
+	}
 
 	sock = rxe_setup_udp_tunnel(dev_net(ndev), htons(ROCE_V2_UDP_DPORT), true);
 	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
@@ -723,6 +720,9 @@ static int rxe_net_ipv6_init(struct net_device *ndev)
 		pr_err("Failed to create IPv6 UDP tunnel\n");
 		return -1;
 	}
+
+	rxe_ns_pernet_set_sk6(dev_net(ndev), sock->sk);
+
 #endif
 	return 0;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_ns.c b/drivers/infiniband/sw/rxe/rxe_ns.c
new file mode 100644
index 000000000000..29d08899dcda
--- /dev/null
+++ b/drivers/infiniband/sw/rxe/rxe_ns.c
@@ -0,0 +1,134 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
+ * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
+ */
+
+#include <net/sock.h>
+#include <net/netns/generic.h>
+#include <net/net_namespace.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/pid_namespace.h>
+#include <net/udp_tunnel.h>
+
+#include "rxe_ns.h"
+
+/*
+ * Per network namespace data
+ */
+struct rxe_ns_sock {
+	struct sock __rcu *rxe_sk4;
+	struct sock __rcu *rxe_sk6;
+};
+
+/*
+ * Index to store custom data for each network namespace.
+ */
+static unsigned int rxe_pernet_id;
+
+/*
+ * Called for every existing and added network namespaces
+ */
+static int __net_init rxe_ns_init(struct net *net)
+{
+	/*
+	 * create (if not present) and access data item in network namespace
+	 * (net) using the id (net_id)
+	 */
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+
+	rcu_assign_pointer(ns_sk->rxe_sk4, NULL); /* initialize sock 4 socket */
+	rcu_assign_pointer(ns_sk->rxe_sk6, NULL); /* initialize sock 6 socket */
+	synchronize_rcu();
+
+	return 0;
+}
+
+static void __net_exit rxe_ns_exit(struct net *net)
+{
+	/*
+	 * called when the network namespace is removed
+	 */
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+	struct sock *rxe_sk4 = NULL;
+	struct sock *rxe_sk6 = NULL;
+
+	rcu_read_lock();
+	rxe_sk4 = rcu_dereference(ns_sk->rxe_sk4);
+	rxe_sk6 = rcu_dereference(ns_sk->rxe_sk6);
+	rcu_read_unlock();
+
+	/* close socket */
+	if (rxe_sk4 && rxe_sk4->sk_socket) {
+		udp_tunnel_sock_release(rxe_sk4->sk_socket);
+		rcu_assign_pointer(ns_sk->rxe_sk4, NULL);
+		synchronize_rcu();
+	}
+
+	if (rxe_sk6 && rxe_sk6->sk_socket) {
+		udp_tunnel_sock_release(rxe_sk6->sk_socket);
+		rcu_assign_pointer(ns_sk->rxe_sk6, NULL);
+		synchronize_rcu();
+	}
+}
+
+/*
+ * callback to make the module network namespace aware
+ */
+static struct pernet_operations rxe_net_ops __net_initdata = {
+	.init = rxe_ns_init,
+	.exit = rxe_ns_exit,
+	.id = &rxe_pernet_id,
+	.size = sizeof(struct rxe_ns_sock),
+};
+
+struct sock *rxe_ns_pernet_sk4(struct net *net)
+{
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+	struct sock *sk;
+
+	rcu_read_lock();
+	sk = rcu_dereference(ns_sk->rxe_sk4);
+	rcu_read_unlock();
+
+	return sk;
+}
+
+void rxe_ns_pernet_set_sk4(struct net *net, struct sock *sk)
+{
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+
+	rcu_assign_pointer(ns_sk->rxe_sk4, sk);
+	synchronize_rcu();
+}
+
+struct sock *rxe_ns_pernet_sk6(struct net *net)
+{
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+	struct sock *sk;
+
+	rcu_read_lock();
+	sk = rcu_dereference(ns_sk->rxe_sk6);
+	rcu_read_unlock();
+
+	return sk;
+}
+
+void rxe_ns_pernet_set_sk6(struct net *net, struct sock *sk)
+{
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+
+	rcu_assign_pointer(ns_sk->rxe_sk6, sk);
+	synchronize_rcu();
+}
+
+int __init rxe_namespace_init(void)
+{
+	return register_pernet_subsys(&rxe_net_ops);
+}
+
+void __exit rxe_namespace_exit(void)
+{
+	unregister_pernet_subsys(&rxe_net_ops);
+}
diff --git a/drivers/infiniband/sw/rxe/rxe_ns.h b/drivers/infiniband/sw/rxe/rxe_ns.h
new file mode 100644
index 000000000000..da5bfcea1274
--- /dev/null
+++ b/drivers/infiniband/sw/rxe/rxe_ns.h
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
+ * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
+ */
+
+#ifndef RXE_NS_H
+#define RXE_NS_H
+
+struct sock *rxe_ns_pernet_sk4(struct net *net);
+struct sock *rxe_ns_pernet_sk6(struct net *net);
+void rxe_ns_pernet_set_sk4(struct net *net, struct sock *sk);
+void rxe_ns_pernet_set_sk6(struct net *net, struct sock *sk);
+int __init rxe_namespace_init(void);
+void __exit rxe_namespace_exit(void);
+
+#endif /* RXE_NS_H */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v6.4-rc1 v5 8/8] RDMA/rxe: Replace l_sk6 with sk6 in net namespace
  2023-05-08  7:56 [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
                   ` (6 preceding siblings ...)
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 7/8] RDMA/rxe: Add the support of net namespace notifier Zhu Yanjun
@ 2023-05-08  7:56 ` Zhu Yanjun
  2023-06-21 21:09 ` [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work " Bob Pearson
  8 siblings, 0 replies; 25+ messages in thread
From: Zhu Yanjun @ 2023-05-08  7:56 UTC (permalink / raw)
  To: zyjzyj2000, jgg, leon, linux-rdma, parav, lehrer; +Cc: Zhu Yanjun, Rain River

From: Zhu Yanjun <yanjun.zhu@linux.dev>

The net namespace variable sk6 can be used. As such, l_sk6 can be
replaced with it.

Tested-by: Rain River <rain.1986.08.12@gmail.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/infiniband/sw/rxe/rxe.c       |  1 -
 drivers/infiniband/sw/rxe/rxe_net.c   | 20 +-------------------
 drivers/infiniband/sw/rxe/rxe_verbs.h |  1 -
 3 files changed, 1 insertion(+), 21 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 96841c56ff3a..b1dfba2fdf15 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -75,7 +75,6 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
 			rxe->ndev->dev_addr);
 
 	rxe->max_ucontext			= RXE_MAX_UCONTEXT;
-	rxe->l_sk6				= NULL;
 }
 
 /* initialize port attributes */
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 28d8171a36e8..812a0731bece 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -50,24 +50,6 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
 {
 	struct dst_entry *ndst;
 	struct flowi6 fl6 = { { 0 } };
-	struct rxe_dev *rdev;
-
-	rdev = rxe_get_dev_from_net(ndev);
-	if (!rdev->l_sk6) {
-		struct sock *sk;
-
-		rcu_read_lock();
-		sk = udp6_lib_lookup(dev_net(ndev), NULL, 0, &in6addr_any,
-				     htons(ROCE_V2_UDP_DPORT), 0);
-		rcu_read_unlock();
-		if (!sk) {
-			pr_info("file: %s +%d, error\n", __FILE__, __LINE__);
-			return (struct dst_entry *)sk;
-		}
-		__sock_put(sk);
-		rdev->l_sk6 = sk->sk_socket;
-	}
-
 
 	memset(&fl6, 0, sizeof(fl6));
 	fl6.flowi6_oif = ndev->ifindex;
@@ -76,7 +58,7 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
 	fl6.flowi6_proto = IPPROTO_UDP;
 
 	ndst = ipv6_stub->ipv6_dst_lookup_flow(dev_net(ndev),
-					       rdev->l_sk6->sk, &fl6,
+					       rxe_ns_pernet_sk6(dev_net(ndev)), &fl6,
 					       NULL);
 	if (IS_ERR(ndst)) {
 		rxe_dbg_qp(qp, "no route to %pI6\n", daddr);
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 0aa3817770a5..26a20f088692 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -382,7 +382,6 @@ struct rxe_dev {
 
 	struct rxe_port		port;
 	struct crypto_shash	*tfm;
-	struct socket		*l_sk6;
 };
 
 static inline void rxe_counter_inc(struct rxe_dev *rxe, enum rxe_counters index)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v6.4-rc1 v5 1/8] RDMA/rxe: Creating listening sock in newlink function
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 1/8] RDMA/rxe: Creating listening sock in newlink function Zhu Yanjun
@ 2023-06-20 17:16   ` Bob Pearson
  2023-06-20 23:40     ` Zhu Yanjun
  0 siblings, 1 reply; 25+ messages in thread
From: Bob Pearson @ 2023-06-20 17:16 UTC (permalink / raw)
  To: Zhu Yanjun, zyjzyj2000, jgg, leon, linux-rdma, parav, lehrer
  Cc: Zhu Yanjun, Rain River

On 5/8/23 02:56, Zhu Yanjun wrote:
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> 
> Originally when the module rdma_rxe is loaded, the sock listening on udp
> port 4791 is created. Currently moving the creating listening port to
> newlink function.
> 
> So when running "rdma link add" command, the sock listening on udp port
> 4791 is created.
> 
> Tested-by: Rain River <rain.1986.08.12@gmail.com>
> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
> ---
>  drivers/infiniband/sw/rxe/rxe.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index 7a7e713de52d..89b24bc34299 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -194,6 +194,10 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
>  		goto err;
>  	}
>  
> +	err = rxe_net_init();
> +	if (err)
> +		return err;
> +
If you put this here you cannot create more than one rxe device.
E.g. if you type

sudo rdma link add rxe0 type rxe netdev enp6s0
sudo rdma link add rxe1 type rxe netdev lo

the second call will fail. This worked before this patch. Maybe you will fix later but
by itself this patch breaks the driver.

Bob
>  	err = rxe_net_add(ibdev_name, ndev);
>  	if (err) {
>  		rxe_err("failed to add %s\n", ndev->name);
> @@ -210,12 +214,6 @@ static struct rdma_link_ops rxe_link_ops = {
>  
>  static int __init rxe_module_init(void)
>  {
> -	int err;
> -
> -	err = rxe_net_init();
> -	if (err)
> -		return err;
> -
>  	rdma_link_register(&rxe_link_ops);
>  	pr_info("loaded\n");
>  	return 0;


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6.4-rc1 v5 2/8] RDMA/rxe: Support more rdma links in init_net
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 2/8] RDMA/rxe: Support more rdma links in init_net Zhu Yanjun
@ 2023-06-20 17:54   ` Bob Pearson
  2023-06-20 23:51     ` Zhu Yanjun
  0 siblings, 1 reply; 25+ messages in thread
From: Bob Pearson @ 2023-06-20 17:54 UTC (permalink / raw)
  To: Zhu Yanjun, zyjzyj2000, jgg, leon, linux-rdma, parav, lehrer
  Cc: Zhu Yanjun, Rain River

On 5/8/23 02:56, Zhu Yanjun wrote:
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> 
> In init_net, when several rdma links are created with the command "rdma
> link add", newlink will check whether the udp port 4791 is listening or
> not.
> If not, creating a sock listening on udp port 4791. If yes, increasing the
> reference count of the sock.
> 
> Tested-by: Rain River <rain.1986.08.12@gmail.com>
> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
> ---
>  drivers/infiniband/sw/rxe/rxe.c     | 12 ++++++-
>  drivers/infiniband/sw/rxe/rxe_net.c | 55 +++++++++++++++++++++--------
>  drivers/infiniband/sw/rxe/rxe_net.h |  1 +
>  3 files changed, 52 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index 89b24bc34299..c15d3c5d7a6f 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -8,6 +8,7 @@
>  #include <net/addrconf.h>
>  #include "rxe.h"
>  #include "rxe_loc.h"
> +#include "rxe_net.h"
>  
>  MODULE_AUTHOR("Bob Pearson, Frank Zago, John Groves, Kamal Heib");
>  MODULE_DESCRIPTION("Soft RDMA transport");
> @@ -207,14 +208,23 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
>  	return err;
>  }
>  
> -static struct rdma_link_ops rxe_link_ops = {
> +struct rdma_link_ops rxe_link_ops = {
>  	.type = "rxe",
>  	.newlink = rxe_newlink,
>  };
>  
>  static int __init rxe_module_init(void)
>  {
> +	int err;
> +
>  	rdma_link_register(&rxe_link_ops);
> +
> +	err = rxe_register_notifier();
> +	if (err) {
> +		pr_err("Failed to register netdev notifier\n");
> +		return -1;
> +	}
> +
>  	pr_info("loaded\n");
>  	return 0;
>  }
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> index 2bc7361152ea..1b98efa2cf66 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.c
> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> @@ -626,13 +626,23 @@ static struct notifier_block rxe_net_notifier = {
>  
>  static int rxe_net_ipv4_init(void)
>  {
> -	recv_sockets.sk4 = rxe_setup_udp_tunnel(&init_net,
> -				htons(ROCE_V2_UDP_DPORT), false);
> -	if (IS_ERR(recv_sockets.sk4)) {
> -		recv_sockets.sk4 = NULL;
> +	struct sock *sk;
> +	struct socket *sock;
> +
> +	rcu_read_lock();
> +	sk = udp4_lib_lookup(&init_net, 0, 0, htonl(INADDR_ANY),
> +			     htons(ROCE_V2_UDP_DPORT), 0);
> +	rcu_read_unlock();
> +	if (sk)
> +		return 0;
After this patch 2/8 attempting to execute
sudo rdma link add rxe[n] type rxe netdev exxxx
more than once now succeeds and both devices show up.
I would suggest that you merge patch 1/8 and 2/8 so patches don't break the
driver.

Bob
> +
> +	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), false);
> +	if (IS_ERR(sock)) {
>  		pr_err("Failed to create IPv4 UDP tunnel\n");
> +		recv_sockets.sk4 = NULL;
>  		return -1;
>  	}
> +	recv_sockets.sk4 = sock;
>  
>  	return 0;
>  }
> @@ -640,24 +650,46 @@ static int rxe_net_ipv4_init(void)
>  static int rxe_net_ipv6_init(void)
>  {
>  #if IS_ENABLED(CONFIG_IPV6)
> +	struct sock *sk;
> +	struct socket *sock;
> +
> +	rcu_read_lock();
> +	sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any,
> +			     htons(ROCE_V2_UDP_DPORT), 0);
> +	rcu_read_unlock();
> +	if (sk)
> +		return 0;
>  
> -	recv_sockets.sk6 = rxe_setup_udp_tunnel(&init_net,
> -						htons(ROCE_V2_UDP_DPORT), true);
> -	if (PTR_ERR(recv_sockets.sk6) == -EAFNOSUPPORT) {
> +	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), true);
> +	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
>  		recv_sockets.sk6 = NULL;
>  		pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n");
>  		return 0;
>  	}
>  
> -	if (IS_ERR(recv_sockets.sk6)) {
> +	if (IS_ERR(sock)) {
>  		recv_sockets.sk6 = NULL;
>  		pr_err("Failed to create IPv6 UDP tunnel\n");
>  		return -1;
>  	}
> +	recv_sockets.sk6 = sock;
>  #endif
>  	return 0;
>  }
>  
> +int rxe_register_notifier(void)
> +{
> +	int err;
> +
> +	err = register_netdevice_notifier(&rxe_net_notifier);
> +	if (err) {
> +		pr_err("Failed to register netdev notifier\n");
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
>  void rxe_net_exit(void)
>  {
>  	rxe_release_udp_tunnel(recv_sockets.sk6);
> @@ -669,19 +701,12 @@ int rxe_net_init(void)
>  {
>  	int err;
>  
> -	recv_sockets.sk6 = NULL;
> -
>  	err = rxe_net_ipv4_init();
>  	if (err)
>  		return err;
>  	err = rxe_net_ipv6_init();
>  	if (err)
>  		goto err_out;
> -	err = register_netdevice_notifier(&rxe_net_notifier);
> -	if (err) {
> -		pr_err("Failed to register netdev notifier\n");
> -		goto err_out;
> -	}
>  	return 0;
>  err_out:
>  	rxe_net_exit();
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
> index 45d80d00f86b..a222c3eeae12 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.h
> +++ b/drivers/infiniband/sw/rxe/rxe_net.h
> @@ -18,6 +18,7 @@ struct rxe_recv_sockets {
>  
>  int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
>  
> +int rxe_register_notifier(void);
>  int rxe_net_init(void);
>  void rxe_net_exit(void);
>  


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6.4-rc1 v5 1/8] RDMA/rxe: Creating listening sock in newlink function
  2023-06-20 17:16   ` Bob Pearson
@ 2023-06-20 23:40     ` Zhu Yanjun
  0 siblings, 0 replies; 25+ messages in thread
From: Zhu Yanjun @ 2023-06-20 23:40 UTC (permalink / raw)
  To: Bob Pearson, Zhu Yanjun, zyjzyj2000, jgg, leon, linux-rdma, parav,
	lehrer
  Cc: Rain River


在 2023/6/21 1:16, Bob Pearson 写道:
> On 5/8/23 02:56, Zhu Yanjun wrote:
>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>
>> Originally when the module rdma_rxe is loaded, the sock listening on udp
>> port 4791 is created. Currently moving the creating listening port to
>> newlink function.
>>
>> So when running "rdma link add" command, the sock listening on udp port
>> 4791 is created.
>>
>> Tested-by: Rain River <rain.1986.08.12@gmail.com>
>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>> ---
>>   drivers/infiniband/sw/rxe/rxe.c | 10 ++++------
>>   1 file changed, 4 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
>> index 7a7e713de52d..89b24bc34299 100644
>> --- a/drivers/infiniband/sw/rxe/rxe.c
>> +++ b/drivers/infiniband/sw/rxe/rxe.c
>> @@ -194,6 +194,10 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
>>   		goto err;
>>   	}
>>   
>> +	err = rxe_net_init();
>> +	if (err)
>> +		return err;
>> +
> If you put this here you cannot create more than one rxe device.
> E.g. if you type
>
> sudo rdma link add rxe0 type rxe netdev enp6s0
> sudo rdma link add rxe1 type rxe netdev lo
>
> the second call will fail. This worked before this patch. Maybe you will fix later but
> by itself this patch breaks the driver.

Hi, Bob

Thanks a lot for your code review.

I made tests. The followings are results. If we add the secode rxe1, the 
second rxe can be created.

# rdma link add rxe0 type rxe netdev eno12399np0
# rdma link add rxe1 type rxe netdev ens7f1np1
# rdma link

link rxe0/1 state ACTIVE physical_state LINK_UP netdev eno12399np0
link rxe1/1 state ACTIVE physical_state LINK_UP netdev ens7f1np1

And the followings are the port 4791 after rxe devices are created.

# ss -lun
State              Recv-Q Send-Q                            Local 
Address:Port                            Peer Address:Port             
Process
...
UNCONN             0 0 0.0.0.0:4791                                 
0.0.0.0:*
...
UNCONN             0 0 [::]:4791                                    [::]:*
..

# rdma link del rxe0
# rdma link del rxe1

After the rxe devices are removed, the port 4791 is removed.

# ss -lun | grep 4791
State              Recv-Q Send-Q                            Local 
Address:Port                            Peer Address:Port             
Process

Zhu Yanjun

>
> Bob
>>   	err = rxe_net_add(ibdev_name, ndev);
>>   	if (err) {
>>   		rxe_err("failed to add %s\n", ndev->name);
>> @@ -210,12 +214,6 @@ static struct rdma_link_ops rxe_link_ops = {
>>   
>>   static int __init rxe_module_init(void)
>>   {
>> -	int err;
>> -
>> -	err = rxe_net_init();
>> -	if (err)
>> -		return err;
>> -
>>   	rdma_link_register(&rxe_link_ops);
>>   	pr_info("loaded\n");
>>   	return 0;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6.4-rc1 v5 2/8] RDMA/rxe: Support more rdma links in init_net
  2023-06-20 17:54   ` Bob Pearson
@ 2023-06-20 23:51     ` Zhu Yanjun
  0 siblings, 0 replies; 25+ messages in thread
From: Zhu Yanjun @ 2023-06-20 23:51 UTC (permalink / raw)
  To: Bob Pearson, Zhu Yanjun, zyjzyj2000, jgg, leon, linux-rdma, parav,
	lehrer
  Cc: Rain River


在 2023/6/21 1:54, Bob Pearson 写道:
> On 5/8/23 02:56, Zhu Yanjun wrote:
>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>
>> In init_net, when several rdma links are created with the command "rdma
>> link add", newlink will check whether the udp port 4791 is listening or
>> not.
>> If not, creating a sock listening on udp port 4791. If yes, increasing the
>> reference count of the sock.
>>
>> Tested-by: Rain River <rain.1986.08.12@gmail.com>
>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>> ---
>>   drivers/infiniband/sw/rxe/rxe.c     | 12 ++++++-
>>   drivers/infiniband/sw/rxe/rxe_net.c | 55 +++++++++++++++++++++--------
>>   drivers/infiniband/sw/rxe/rxe_net.h |  1 +
>>   3 files changed, 52 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
>> index 89b24bc34299..c15d3c5d7a6f 100644
>> --- a/drivers/infiniband/sw/rxe/rxe.c
>> +++ b/drivers/infiniband/sw/rxe/rxe.c
>> @@ -8,6 +8,7 @@
>>   #include <net/addrconf.h>
>>   #include "rxe.h"
>>   #include "rxe_loc.h"
>> +#include "rxe_net.h"
>>   
>>   MODULE_AUTHOR("Bob Pearson, Frank Zago, John Groves, Kamal Heib");
>>   MODULE_DESCRIPTION("Soft RDMA transport");
>> @@ -207,14 +208,23 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
>>   	return err;
>>   }
>>   
>> -static struct rdma_link_ops rxe_link_ops = {
>> +struct rdma_link_ops rxe_link_ops = {
>>   	.type = "rxe",
>>   	.newlink = rxe_newlink,
>>   };
>>   
>>   static int __init rxe_module_init(void)
>>   {
>> +	int err;
>> +
>>   	rdma_link_register(&rxe_link_ops);
>> +
>> +	err = rxe_register_notifier();
>> +	if (err) {
>> +		pr_err("Failed to register netdev notifier\n");
>> +		return -1;
>> +	}
>> +
>>   	pr_info("loaded\n");
>>   	return 0;
>>   }
>> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
>> index 2bc7361152ea..1b98efa2cf66 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_net.c
>> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
>> @@ -626,13 +626,23 @@ static struct notifier_block rxe_net_notifier = {
>>   
>>   static int rxe_net_ipv4_init(void)
>>   {
>> -	recv_sockets.sk4 = rxe_setup_udp_tunnel(&init_net,
>> -				htons(ROCE_V2_UDP_DPORT), false);
>> -	if (IS_ERR(recv_sockets.sk4)) {
>> -		recv_sockets.sk4 = NULL;
>> +	struct sock *sk;
>> +	struct socket *sock;
>> +
>> +	rcu_read_lock();
>> +	sk = udp4_lib_lookup(&init_net, 0, 0, htonl(INADDR_ANY),
>> +			     htons(ROCE_V2_UDP_DPORT), 0);
>> +	rcu_read_unlock();
>> +	if (sk)
>> +		return 0;
> After this patch 2/8 attempting to execute
> sudo rdma link add rxe[n] type rxe netdev exxxx
> more than once now succeeds and both devices show up.
> I would suggest that you merge patch 1/8 and 2/8 so patches don't break the
> driver.

I split the steps to implement net namespace into several commits. So we 
can find out

what we have done to implement net namespace. The viewer can easily 
catch the steps.

Your suggestions seem to make sense. Let me consider sorting out the 
commits.

Zhu Yanjun

>
> Bob
>> +
>> +	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), false);
>> +	if (IS_ERR(sock)) {
>>   		pr_err("Failed to create IPv4 UDP tunnel\n");
>> +		recv_sockets.sk4 = NULL;
>>   		return -1;
>>   	}
>> +	recv_sockets.sk4 = sock;
>>   
>>   	return 0;
>>   }
>> @@ -640,24 +650,46 @@ static int rxe_net_ipv4_init(void)
>>   static int rxe_net_ipv6_init(void)
>>   {
>>   #if IS_ENABLED(CONFIG_IPV6)
>> +	struct sock *sk;
>> +	struct socket *sock;
>> +
>> +	rcu_read_lock();
>> +	sk = udp6_lib_lookup(&init_net, NULL, 0, &in6addr_any,
>> +			     htons(ROCE_V2_UDP_DPORT), 0);
>> +	rcu_read_unlock();
>> +	if (sk)
>> +		return 0;
>>   
>> -	recv_sockets.sk6 = rxe_setup_udp_tunnel(&init_net,
>> -						htons(ROCE_V2_UDP_DPORT), true);
>> -	if (PTR_ERR(recv_sockets.sk6) == -EAFNOSUPPORT) {
>> +	sock = rxe_setup_udp_tunnel(&init_net, htons(ROCE_V2_UDP_DPORT), true);
>> +	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
>>   		recv_sockets.sk6 = NULL;
>>   		pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n");
>>   		return 0;
>>   	}
>>   
>> -	if (IS_ERR(recv_sockets.sk6)) {
>> +	if (IS_ERR(sock)) {
>>   		recv_sockets.sk6 = NULL;
>>   		pr_err("Failed to create IPv6 UDP tunnel\n");
>>   		return -1;
>>   	}
>> +	recv_sockets.sk6 = sock;
>>   #endif
>>   	return 0;
>>   }
>>   
>> +int rxe_register_notifier(void)
>> +{
>> +	int err;
>> +
>> +	err = register_netdevice_notifier(&rxe_net_notifier);
>> +	if (err) {
>> +		pr_err("Failed to register netdev notifier\n");
>> +		return -1;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>   void rxe_net_exit(void)
>>   {
>>   	rxe_release_udp_tunnel(recv_sockets.sk6);
>> @@ -669,19 +701,12 @@ int rxe_net_init(void)
>>   {
>>   	int err;
>>   
>> -	recv_sockets.sk6 = NULL;
>> -
>>   	err = rxe_net_ipv4_init();
>>   	if (err)
>>   		return err;
>>   	err = rxe_net_ipv6_init();
>>   	if (err)
>>   		goto err_out;
>> -	err = register_netdevice_notifier(&rxe_net_notifier);
>> -	if (err) {
>> -		pr_err("Failed to register netdev notifier\n");
>> -		goto err_out;
>> -	}
>>   	return 0;
>>   err_out:
>>   	rxe_net_exit();
>> diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
>> index 45d80d00f86b..a222c3eeae12 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_net.h
>> +++ b/drivers/infiniband/sw/rxe/rxe_net.h
>> @@ -18,6 +18,7 @@ struct rxe_recv_sockets {
>>   
>>   int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
>>   
>> +int rxe_register_notifier(void);
>>   int rxe_net_init(void);
>>   void rxe_net_exit(void);
>>   

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace
  2023-05-08  7:56 [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
                   ` (7 preceding siblings ...)
  2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 8/8] RDMA/rxe: Replace l_sk6 with sk6 in net namespace Zhu Yanjun
@ 2023-06-21 21:09 ` Bob Pearson
  2023-06-21 21:27   ` Bob Pearson
                     ` (3 more replies)
  8 siblings, 4 replies; 25+ messages in thread
From: Bob Pearson @ 2023-06-21 21:09 UTC (permalink / raw)
  To: Zhu Yanjun, zyjzyj2000, jgg, leon, linux-rdma, parav, lehrer; +Cc: Zhu Yanjun

On 5/8/23 02:56, Zhu Yanjun wrote:
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> 
> When run "ip link add" command to add a rxe rdma link in a net
> namespace, normally this rxe rdma link can not work in a net
> name space.
> 
> The root cause is that a sock listening on udp port 4791 is created
> in init_net when the rdma_rxe module is loaded into kernel. That is,
> the sock listening on udp port 4791 is created in init_net. Other net
> namespace is difficult to use this sock.
> 
> The following commits will solve this problem.
> 
> In the first commit, move the creating sock listening on udp port 4791
> from module_init function to rdma link creating functions. That is,
> after the module rdma_rxe is loaded, the sock will not be created.
> When run "rdma link add ..." command, the sock will be created. So
> when creating a rdma link in the net namespace, the sock will be
> created in this net namespace.
> 
> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
> will check the sock exists in the net namespace or not. If yes, rdma
> link will increase the reference count of this sock, then continue other
> jobs instead of creating a new sock to listen on udp port 4791. Since the
> network notifier is global, when the module rdma_rxe is loaded, this
> notifier will be registered.
> 
> After the rdma link is created, the command "rdma link del" is to
> delete rdma link at the same time the sock is checked. If the reference
> count of this sock is greater than the sock reference count needed by
> udp tunnel, the sock reference count is decreased by one. If equal, it
> indicates that this rdma link is the last one. As such, the udp tunnel
> is shut down and the sock is closed. The above work should be
> implemented in linkdel function. But currently no dellink function in
> rxe. So the 3rd commit addes dellink function pointer. And the 4th
> commit implements the dellink function in rxe.
> 
> To now, it is not necessary to keep a global variable to store the sock
> listening udp port 4791. This global variable can be replaced by the
> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
> function udp6_lib_lookup is in the fast path, a member variable l_sk6
> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
> to lookup the sock, then the sock is stored in l_sk6, in the future,it
> can be used directly.
> 
> All the above work has been done in init_net. And it can also work in
> the net namespace. So the init_net is replaced by the individual net
> namespace. This is what the 6th commit does. Because rxe device is
> dependent on the net device and the sock listening on udp port 4791,
> every rxe device is in exclusive mode in the individual net namespace.
> Other rdma netns operations will be considerred in the future.
> 
> In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys
> functions are added. When a new net namespace is created, the init
> function will initialize the sk4 and sk6 socks. Then the 2 socks will
> be released when the net namespace is destroyed. The functions
> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net
> namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will
> handle sk6. Then sk4 and sk6 are used in the previous commits.
> 
> As the sk4 and sk6 in pernet namespace can be accessed, it is not
> necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is
> replaced with the sk6 in pernet namespace.
> 
> Test steps:
> 1) Suppose that 2 NICs are in 2 different net namespaces.
> 
>   # ip netns exec net0 ip link
>   3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
>      link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
>      altname enp5s0
> 
>   # ip netns exec net1 ip link
>   4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
>      link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff
> 
> 2) Add rdma link in the different net namespace
>     net0:
>     # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2
> 
>     net1:
>     # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
> 
> 3) Run rping test.
>     net0
>     # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
>     [1] 1737
>     # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
>     verbose
>     count 1
>     ...
>     ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
>     ...
> 
> 4) Remove the rdma links from the net namespaces.
>     net0:
>     # ip netns exec net0 ss -lu
>     State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>     UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>     UNCONN    0         0         [::]:4791             [::]:*
> 
>     # ip netns exec net0 rdma link del rxe0
> 
>     # ip netns exec net0 ss -lu
>     State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
> 
>     net1:
>     # ip netns exec net0 ss -lu
>     State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>     UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>     UNCONN    0         0         [::]:4791             [::]:*
> 
>     # ip netns exec net1 rdma link del rxe1
> 
>     # ip netns exec net0 ss -lu
>     State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
> 
> V4->V5: Rebase the commits to V6.4-rc1
> 
> V3->V4: Rebase the commits to rdma-next;
> 
> V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to
>            verify rdma link is removed.
>         2) Add register_pernet_subsys/unregister_pernet_subsys net namespace
>         3) Replace l_sk6 with sk6 of pernet_name_space
> 
> V1->V2: Add the explicit initialization of sk6.
> 
> Zhu Yanjun (8):
>   RDMA/rxe: Creating listening sock in newlink function
>   RDMA/rxe: Support more rdma links in init_net
>   RDMA/nldev: Add dellink function pointer
>   RDMA/rxe: Implement dellink in rxe
>   RDMA/rxe: Replace global variable with sock lookup functions
>   RDMA/rxe: add the support of net namespace
>   RDMA/rxe: Add the support of net namespace notifier
>   RDMA/rxe: Replace l_sk6 with sk6 in net namespace
> 
>  drivers/infiniband/core/nldev.c     |   6 ++
>  drivers/infiniband/sw/rxe/Makefile  |   3 +-
>  drivers/infiniband/sw/rxe/rxe.c     |  35 +++++++-
>  drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++------
>  drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
>  drivers/infiniband/sw/rxe/rxe_ns.c  | 134 ++++++++++++++++++++++++++++ip netns add test
>  drivers/infiniband/sw/rxe/rxe_ns.h  |  17 ++++
>  include/rdma/rdma_netlink.h         |   2 +
>  8 files changed, 279 insertions(+), 40 deletions(-)
>  create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
>  create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns add test
> 

Zhu,

I did some simple experiments on netns functionality.

With your patch set applied and rxe0 created on enp6s0 and rxe1 created on lo in the default namespace

	# sudo ip netns add test
	# ip netns
	test
	# sudo ip netns exec test ip link
	1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
	    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
	# sudo ip netns exec test ip link set dev lo up
	# sudo ip netns exec test ip link
	1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
	    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
	# sudo ip netns exec test ip addr add dev lo fe80::0200:00ff:fe00:0000/64
		[rxe doesn't work unless this IPV6 address is set]
	# sudo ip netns exec test ip addr
	1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
	    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
	    inet 127.0.0.1/8 scope host lo
	       valid_lft forever preferred_lft forever
	    inet6 fe80::200:ff:fe00:0/64 scope link 
	       valid_lft forever preferred_lft forever
	    inet6 ::1/128 scope host 
	       valid_lft forever preferred_lft forever
	# sudo ip netns exec test ls /sys/class/infiniband
	rxe0  rxe1
		[These show up even though the ndevs do *not* belong to the test namespace! Probably OK.]
	# sudo ip netns exec test rdma link add rxe2 type rxe netdev lo
	# ls /sys/class/infiniband
	rxe0  rxe1  rxe2
		[The new rxe device shows up in the default namespace. At least we're consistent.]
	# ib_send_bw -d rxe0 ... 192.168.0.27
		[Works. Didn't break the existing rxe devices. Expected]
	# ib_send_bw -d rxe1 ... 127.0.0.1
		[Works. Expected]
	# ib_send_bw -d rxe2 ... 127.0.0.1
	IB device rxe2 not found
 	 Unable to find the Infiniband/RoCE device
		[Not work. Expected.]
	# sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
	IB device rxe2 not found
	 Unable to find the Infiniband/RoCE device
		[Also not work. Turns out rxe2 device is gone after failure. Not expected.]
	# sudo ip netns exec test rdma link add rxe2 type rxe netdev lo
	# ls /sys/class/infiniband
	rxe0  rxe1  rxe2
		[Good. It's back]
	# sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
		[Works in test namespace! Expected.]
	# sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1
		[Also works. Definitely not expected.]

My take, it sort of works. But there are some serious issues. You shouldn't be able to use the
rxe2 device in the default namespace. It would be nice if you couldn't see the rxe devices in each
other's namespaces (Like ip link or ip addr hide other namespace's devices.)

Bob

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace
  2023-06-21 21:09 ` [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work " Bob Pearson
@ 2023-06-21 21:27   ` Bob Pearson
  2023-06-23  7:15     ` Zhu Yanjun
  2023-06-22  3:46   ` Zhu Yanjun
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: Bob Pearson @ 2023-06-21 21:27 UTC (permalink / raw)
  To: Zhu Yanjun, zyjzyj2000, jgg, leon, linux-rdma, parav, lehrer; +Cc: Zhu Yanjun

On 6/21/23 16:09, Bob Pearson wrote:
> On 5/8/23 02:56, Zhu Yanjun wrote:
>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>
>> When run "ip link add" command to add a rxe rdma link in a net
>> namespace, normally this rxe rdma link can not work in a net
>> name space.
>>
>> The root cause is that a sock listening on udp port 4791 is created
>> in init_net when the rdma_rxe module is loaded into kernel. That is,
>> the sock listening on udp port 4791 is created in init_net. Other net
>> namespace is difficult to use this sock.
>>
>> The following commits will solve this problem.
>>
>> In the first commit, move the creating sock listening on udp port 4791
>> from module_init function to rdma link creating functions. That is,
>> after the module rdma_rxe is loaded, the sock will not be created.
>> When run "rdma link add ..." command, the sock will be created. So
>> when creating a rdma link in the net namespace, the sock will be
>> created in this net namespace.
>>
>> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
>> will check the sock exists in the net namespace or not. If yes, rdma
>> link will increase the reference count of this sock, then continue other
>> jobs instead of creating a new sock to listen on udp port 4791. Since the
>> network notifier is global, when the module rdma_rxe is loaded, this
>> notifier will be registered.
>>
>> After the rdma link is created, the command "rdma link del" is to
>> delete rdma link at the same time the sock is checked. If the reference
>> count of this sock is greater than the sock reference count needed by
>> udp tunnel, the sock reference count is decreased by one. If equal, it
>> indicates that this rdma link is the last one. As such, the udp tunnel
>> is shut down and the sock is closed. The above work should be
>> implemented in linkdel function. But currently no dellink function in
>> rxe. So the 3rd commit addes dellink function pointer. And the 4th
>> commit implements the dellink function in rxe.
>>
>> To now, it is not necessary to keep a global variable to store the sock
>> listening udp port 4791. This global variable can be replaced by the
>> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
>> function udp6_lib_lookup is in the fast path, a member variable l_sk6
>> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
>> to lookup the sock, then the sock is stored in l_sk6, in the future,it
>> can be used directly.
>>
>> All the above work has been done in init_net. And it can also work in
>> the net namespace. So the init_net is replaced by the individual net
>> namespace. This is what the 6th commit does. Because rxe device is
>> dependent on the net device and the sock listening on udp port 4791,
>> every rxe device is in exclusive mode in the individual net namespace.
>> Other rdma netns operations will be considerred in the future.
>>
>> In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys
>> functions are added. When a new net namespace is created, the init
>> function will initialize the sk4 and sk6 socks. Then the 2 socks will
>> be released when the net namespace is destroyed. The functions
>> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net
>> namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will
>> handle sk6. Then sk4 and sk6 are used in the previous commits.
>>
>> As the sk4 and sk6 in pernet namespace can be accessed, it is not
>> necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is
>> replaced with the sk6 in pernet namespace.
>>
>> Test steps:
>> 1) Suppose that 2 NICs are in 2 different net namespaces.
>>
>>   # ip netns exec net0 ip link
>>   3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
>>      link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
>>      altname enp5s0
>>
>>   # ip netns exec net1 ip link
>>   4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
>>      link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff
>>
>> 2) Add rdma link in the different net namespace
>>     net0:
>>     # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2
>>
>>     net1:
>>     # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
>>
>> 3) Run rping test.
>>     net0
>>     # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
>>     [1] 1737
>>     # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
>>     verbose
>>     count 1
>>     ...
>>     ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
>>     ...
>>
>> 4) Remove the rdma links from the net namespaces.
>>     net0:
>>     # ip netns exec net0 ss -lu
>>     State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>     UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>>     UNCONN    0         0         [::]:4791             [::]:*
>>
>>     # ip netns exec net0 rdma link del rxe0
>>
>>     # ip netns exec net0 ss -lu
>>     State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>
>>     net1:
>>     # ip netns exec net0 ss -lu
>>     State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>     UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>>     UNCONN    0         0         [::]:4791             [::]:*
>>
>>     # ip netns exec net1 rdma link del rxe1
>>
>>     # ip netns exec net0 ss -lu
>>     State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>
>> V4->V5: Rebase the commits to V6.4-rc1
>>
>> V3->V4: Rebase the commits to rdma-next;
>>
>> V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to
>>            verify rdma link is removed.
>>         2) Add register_pernet_subsys/unregister_pernet_subsys net namespace
>>         3) Replace l_sk6 with sk6 of pernet_name_space
>>
>> V1->V2: Add the explicit initialization of sk6.
>>
>> Zhu Yanjun (8):
>>   RDMA/rxe: Creating listening sock in newlink function
>>   RDMA/rxe: Support more rdma links in init_net
>>   RDMA/nldev: Add dellink function pointer
>>   RDMA/rxe: Implement dellink in rxe
>>   RDMA/rxe: Replace global variable with sock lookup functions
>>   RDMA/rxe: add the support of net namespace
>>   RDMA/rxe: Add the support of net namespace notifier
>>   RDMA/rxe: Replace l_sk6 with sk6 in net namespace
>>
>>  drivers/infiniband/core/nldev.c     |   6 ++
>>  drivers/infiniband/sw/rxe/Makefile  |   3 +-
>>  drivers/infiniband/sw/rxe/rxe.c     |  35 +++++++-
>>  drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++------
>>  drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
>>  drivers/infiniband/sw/rxe/rxe_ns.c  | 134 ++++++++++++++++++++++++++++ip netns add test
>>  drivers/infiniband/sw/rxe/rxe_ns.h  |  17 ++++
>>  include/rdma/rdma_netlink.h         |   2 +
>>  8 files changed, 279 insertions(+), 40 deletions(-)
>>  create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
>>  create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns add test
>>
> 
> Zhu,
> 
> I did some simple experiments on netns functionality.
> 
> With your patch set applied and rxe0 created on enp6s0 and rxe1 created on lo in the default namespace
> 
> 	# sudo ip netns add test
> 	# ip netns
> 	test
> 	# sudo ip netns exec test ip link
> 	1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
> 	    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 	# sudo ip netns exec test ip link set dev lo up
> 	# sudo ip netns exec test ip link
> 	1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
> 	    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 	# sudo ip netns exec test ip addr add dev lo fe80::0200:00ff:fe00:0000/64
> 		[rxe doesn't work unless this IPV6 address is set]
> 	# sudo ip netns exec test ip addr
> 	1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
> 	    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 	    inet 127.0.0.1/8 scope host lo
> 	       valid_lft forever preferred_lft forever
> 	    inet6 fe80::200:ff:fe00:0/64 scope link 
> 	       valid_lft forever preferred_lft forever
> 	    inet6 ::1/128 scope host 
> 	       valid_lft forever preferred_lft forever
> 	# sudo ip netns exec test ls /sys/class/infiniband
> 	rxe0  rxe1
> 		[These show up even though the ndevs do *not* belong to the test namespace! Probably OK.]
> 	# sudo ip netns exec test rdma link add rxe2 type rxe netdev lo
> 	# ls /sys/class/infiniband
> 	rxe0  rxe1  rxe2
> 		[The new rxe device shows up in the default namespace. At least we're consistent.]
> 	# ib_send_bw -d rxe0 ... 192.168.0.27
> 		[Works. Didn't break the existing rxe devices. Expected]
> 	# ib_send_bw -d rxe1 ... 127.0.0.1
> 		[Works. Expected]
> 	# ib_send_bw -d rxe2 ... 127.0.0.1
> 	IB device rxe2 not found
>  	 Unable to find the Infiniband/RoCE device
> 		[Not work. Expected.]
> 	# sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
> 	IB device rxe2 not found
> 	 Unable to find the Infiniband/RoCE device
> 		[Also not work. Turns out rxe2 device is gone after failure. Not expected.]
> 	# sudo ip netns exec test rdma link add rxe2 type rxe netdev lo
> 	# ls /sys/class/infiniband
> 	rxe0  rxe1  rxe2
> 		[Good. It's back]
> 	# sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
> 		[Works in test namespace! Expected.]
> 	# sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1
> 		[Also works. Definitely not expected.]
> 
> My take, it sort of works. But there are some serious issues. You shouldn't be able to use the
> rxe2 device in the default namespace. It would be nice if you couldn't see the rxe devices in each
> other's namespaces (Like ip link or ip addr hide other namespace's devices.)
> 
> Bob
Forgot to mention. It also is definitely not good that a process in the default namespace can destroy
a rxe device in the test namespace by trying to use it.

Bob

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace
  2023-06-21 21:09 ` [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work " Bob Pearson
  2023-06-21 21:27   ` Bob Pearson
@ 2023-06-22  3:46   ` Zhu Yanjun
  2023-06-23  7:09   ` Zhu Yanjun
  2024-11-12  9:33   ` Cyclinder Kuo
  3 siblings, 0 replies; 25+ messages in thread
From: Zhu Yanjun @ 2023-06-22  3:46 UTC (permalink / raw)
  To: Bob Pearson, Zhu Yanjun, zyjzyj2000, jgg, leon, linux-rdma, parav,
	lehrer


在 2023/6/22 5:09, Bob Pearson 写道:
> On 5/8/23 02:56, Zhu Yanjun wrote:
>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>
>> When run "ip link add" command to add a rxe rdma link in a net
>> namespace, normally this rxe rdma link can not work in a net
>> name space.
>>
>> The root cause is that a sock listening on udp port 4791 is created
>> in init_net when the rdma_rxe module is loaded into kernel. That is,
>> the sock listening on udp port 4791 is created in init_net. Other net
>> namespace is difficult to use this sock.
>>
>> The following commits will solve this problem.
>>
>> In the first commit, move the creating sock listening on udp port 4791
>> from module_init function to rdma link creating functions. That is,
>> after the module rdma_rxe is loaded, the sock will not be created.
>> When run "rdma link add ..." command, the sock will be created. So
>> when creating a rdma link in the net namespace, the sock will be
>> created in this net namespace.
>>
>> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
>> will check the sock exists in the net namespace or not. If yes, rdma
>> link will increase the reference count of this sock, then continue other
>> jobs instead of creating a new sock to listen on udp port 4791. Since the
>> network notifier is global, when the module rdma_rxe is loaded, this
>> notifier will be registered.
>>
>> After the rdma link is created, the command "rdma link del" is to
>> delete rdma link at the same time the sock is checked. If the reference
>> count of this sock is greater than the sock reference count needed by
>> udp tunnel, the sock reference count is decreased by one. If equal, it
>> indicates that this rdma link is the last one. As such, the udp tunnel
>> is shut down and the sock is closed. The above work should be
>> implemented in linkdel function. But currently no dellink function in
>> rxe. So the 3rd commit addes dellink function pointer. And the 4th
>> commit implements the dellink function in rxe.
>>
>> To now, it is not necessary to keep a global variable to store the sock
>> listening udp port 4791. This global variable can be replaced by the
>> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
>> function udp6_lib_lookup is in the fast path, a member variable l_sk6
>> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
>> to lookup the sock, then the sock is stored in l_sk6, in the future,it
>> can be used directly.
>>
>> All the above work has been done in init_net. And it can also work in
>> the net namespace. So the init_net is replaced by the individual net
>> namespace. This is what the 6th commit does. Because rxe device is
>> dependent on the net device and the sock listening on udp port 4791,
>> every rxe device is in exclusive mode in the individual net namespace.
>> Other rdma netns operations will be considerred in the future.
>>
>> In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys
>> functions are added. When a new net namespace is created, the init
>> function will initialize the sk4 and sk6 socks. Then the 2 socks will
>> be released when the net namespace is destroyed. The functions
>> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net
>> namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will
>> handle sk6. Then sk4 and sk6 are used in the previous commits.
>>
>> As the sk4 and sk6 in pernet namespace can be accessed, it is not
>> necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is
>> replaced with the sk6 in pernet namespace.
>>
>> Test steps:
>> 1) Suppose that 2 NICs are in 2 different net namespaces.
>>
>>    # ip netns exec net0 ip link
>>    3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
>>       link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
>>       altname enp5s0
>>
>>    # ip netns exec net1 ip link
>>    4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
>>       link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff
>>
>> 2) Add rdma link in the different net namespace
>>      net0:
>>      # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2
>>
>>      net1:
>>      # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
>>
>> 3) Run rping test.
>>      net0
>>      # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
>>      [1] 1737
>>      # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
>>      verbose
>>      count 1
>>      ...
>>      ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
>>      ...
>>
>> 4) Remove the rdma links from the net namespaces.
>>      net0:
>>      # ip netns exec net0 ss -lu
>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>      UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>>      UNCONN    0         0         [::]:4791             [::]:*
>>
>>      # ip netns exec net0 rdma link del rxe0
>>
>>      # ip netns exec net0 ss -lu
>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>
>>      net1:
>>      # ip netns exec net0 ss -lu
>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>      UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>>      UNCONN    0         0         [::]:4791             [::]:*
>>
>>      # ip netns exec net1 rdma link del rxe1
>>
>>      # ip netns exec net0 ss -lu
>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>
>> V4->V5: Rebase the commits to V6.4-rc1
>>
>> V3->V4: Rebase the commits to rdma-next;
>>
>> V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to
>>             verify rdma link is removed.
>>          2) Add register_pernet_subsys/unregister_pernet_subsys net namespace
>>          3) Replace l_sk6 with sk6 of pernet_name_space
>>
>> V1->V2: Add the explicit initialization of sk6.
>>
>> Zhu Yanjun (8):
>>    RDMA/rxe: Creating listening sock in newlink function
>>    RDMA/rxe: Support more rdma links in init_net
>>    RDMA/nldev: Add dellink function pointer
>>    RDMA/rxe: Implement dellink in rxe
>>    RDMA/rxe: Replace global variable with sock lookup functions
>>    RDMA/rxe: add the support of net namespace
>>    RDMA/rxe: Add the support of net namespace notifier
>>    RDMA/rxe: Replace l_sk6 with sk6 in net namespace
>>
>>   drivers/infiniband/core/nldev.c     |   6 ++
>>   drivers/infiniband/sw/rxe/Makefile  |   3 +-
>>   drivers/infiniband/sw/rxe/rxe.c     |  35 +++++++-
>>   drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++------
>>   drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
>>   drivers/infiniband/sw/rxe/rxe_ns.c  | 134 ++++++++++++++++++++++++++++ip netns add test
>>   drivers/infiniband/sw/rxe/rxe_ns.h  |  17 ++++
>>   include/rdma/rdma_netlink.h         |   2 +
>>   8 files changed, 279 insertions(+), 40 deletions(-)
>>   create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
>>   create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns add test
>>
> Zhu,
>
> I did some simple experiments on netns functionality.
>
> With your patch set applied and rxe0 created on enp6s0 and rxe1 created on lo in the default namespace
>
> 	# sudo ip netns add test
> 	# ip netns
> 	test
> 	# sudo ip netns exec test ip link
> 	1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
> 	    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 	# sudo ip netns exec test ip link set dev lo up
> 	# sudo ip netns exec test ip link
> 	1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
> 	    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 	# sudo ip netns exec test ip addr add dev lo fe80::0200:00ff:fe00:0000/64
> 		[rxe doesn't work unless this IPV6 address is set]
> 	# sudo ip netns exec test ip addr
> 	1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
> 	    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 	    inet 127.0.0.1/8 scope host lo
> 	       valid_lft forever preferred_lft forever
> 	    inet6 fe80::200:ff:fe00:0/64 scope link
> 	       valid_lft forever preferred_lft forever
> 	    inet6 ::1/128 scope host
> 	       valid_lft forever preferred_lft forever
> 	# sudo ip netns exec test ls /sys/class/infiniband
> 	rxe0  rxe1
> 		[These show up even though the ndevs do *not* belong to the test namespace! Probably OK.]
> 	# sudo ip netns exec test rdma link add rxe2 type rxe netdev lo
> 	# ls /sys/class/infiniband
> 	rxe0  rxe1  rxe2
> 		[The new rxe device shows up in the default namespace. At least we're consistent.]
> 	# ib_send_bw -d rxe0 ... 192.168.0.27
> 		[Works. Didn't break the existing rxe devices. Expected]
> 	# ib_send_bw -d rxe1 ... 127.0.0.1
> 		[Works. Expected]
> 	# ib_send_bw -d rxe2 ... 127.0.0.1
> 	IB device rxe2 not found
>   	 Unable to find the Infiniband/RoCE device
> 		[Not work. Expected.]
> 	# sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
> 	IB device rxe2 not found
> 	 Unable to find the Infiniband/RoCE device
> 		[Also not work. Turns out rxe2 device is gone after failure. Not expected.]
> 	# sudo ip netns exec test rdma link add rxe2 type rxe netdev lo
> 	# ls /sys/class/infiniband
> 	rxe0  rxe1  rxe2
> 		[Good. It's back]
> 	# sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
> 		[Works in test namespace! Expected.]
> 	# sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1
> 		[Also works. Definitely not expected.]
>
> My take, it sort of works. But there are some serious issues. You shouldn't be able to use the
> rxe2 device in the default namespace. It would be nice if you couldn't see the rxe devices in each
> other's namespaces (Like ip link or ip addr hide other namespace's devices.)

Thanks, Bob. I will delve into your tests and reply you tomorrow.

Zhu Yanjun

>
> Bob

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace
  2023-06-21 21:09 ` [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work " Bob Pearson
  2023-06-21 21:27   ` Bob Pearson
  2023-06-22  3:46   ` Zhu Yanjun
@ 2023-06-23  7:09   ` Zhu Yanjun
  2024-11-12  9:33   ` Cyclinder Kuo
  3 siblings, 0 replies; 25+ messages in thread
From: Zhu Yanjun @ 2023-06-23  7:09 UTC (permalink / raw)
  To: Bob Pearson, Zhu Yanjun, zyjzyj2000, jgg, leon, linux-rdma, parav,
	lehrer


在 2023/6/22 5:09, Bob Pearson 写道:
> On 5/8/23 02:56, Zhu Yanjun wrote:
>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>
>> When run "ip link add" command to add a rxe rdma link in a net
>> namespace, normally this rxe rdma link can not work in a net
>> name space.
>>
>> The root cause is that a sock listening on udp port 4791 is created
>> in init_net when the rdma_rxe module is loaded into kernel. That is,
>> the sock listening on udp port 4791 is created in init_net. Other net
>> namespace is difficult to use this sock.
>>
>> The following commits will solve this problem.
>>
>> In the first commit, move the creating sock listening on udp port 4791
>> from module_init function to rdma link creating functions. That is,
>> after the module rdma_rxe is loaded, the sock will not be created.
>> When run "rdma link add ..." command, the sock will be created. So
>> when creating a rdma link in the net namespace, the sock will be
>> created in this net namespace.
>>
>> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
>> will check the sock exists in the net namespace or not. If yes, rdma
>> link will increase the reference count of this sock, then continue other
>> jobs instead of creating a new sock to listen on udp port 4791. Since the
>> network notifier is global, when the module rdma_rxe is loaded, this
>> notifier will be registered.
>>
>> After the rdma link is created, the command "rdma link del" is to
>> delete rdma link at the same time the sock is checked. If the reference
>> count of this sock is greater than the sock reference count needed by
>> udp tunnel, the sock reference count is decreased by one. If equal, it
>> indicates that this rdma link is the last one. As such, the udp tunnel
>> is shut down and the sock is closed. The above work should be
>> implemented in linkdel function. But currently no dellink function in
>> rxe. So the 3rd commit addes dellink function pointer. And the 4th
>> commit implements the dellink function in rxe.
>>
>> To now, it is not necessary to keep a global variable to store the sock
>> listening udp port 4791. This global variable can be replaced by the
>> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
>> function udp6_lib_lookup is in the fast path, a member variable l_sk6
>> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
>> to lookup the sock, then the sock is stored in l_sk6, in the future,it
>> can be used directly.
>>
>> All the above work has been done in init_net. And it can also work in
>> the net namespace. So the init_net is replaced by the individual net
>> namespace. This is what the 6th commit does. Because rxe device is
>> dependent on the net device and the sock listening on udp port 4791,
>> every rxe device is in exclusive mode in the individual net namespace.
>> Other rdma netns operations will be considerred in the future.
>>
>> In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys
>> functions are added. When a new net namespace is created, the init
>> function will initialize the sk4 and sk6 socks. Then the 2 socks will
>> be released when the net namespace is destroyed. The functions
>> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net
>> namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will
>> handle sk6. Then sk4 and sk6 are used in the previous commits.
>>
>> As the sk4 and sk6 in pernet namespace can be accessed, it is not
>> necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is
>> replaced with the sk6 in pernet namespace.
>>
>> Test steps:
>> 1) Suppose that 2 NICs are in 2 different net namespaces.
>>
>>    # ip netns exec net0 ip link
>>    3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
>>       link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
>>       altname enp5s0
>>
>>    # ip netns exec net1 ip link
>>    4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
>>       link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff
>>
>> 2) Add rdma link in the different net namespace
>>      net0:
>>      # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2
>>
>>      net1:
>>      # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
>>
>> 3) Run rping test.
>>      net0
>>      # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
>>      [1] 1737
>>      # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
>>      verbose
>>      count 1
>>      ...
>>      ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
>>      ...
>>
>> 4) Remove the rdma links from the net namespaces.
>>      net0:
>>      # ip netns exec net0 ss -lu
>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>      UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>>      UNCONN    0         0         [::]:4791             [::]:*
>>
>>      # ip netns exec net0 rdma link del rxe0
>>
>>      # ip netns exec net0 ss -lu
>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>
>>      net1:
>>      # ip netns exec net0 ss -lu
>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>      UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>>      UNCONN    0         0         [::]:4791             [::]:*
>>
>>      # ip netns exec net1 rdma link del rxe1
>>
>>      # ip netns exec net0 ss -lu
>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>
>> V4->V5: Rebase the commits to V6.4-rc1
>>
>> V3->V4: Rebase the commits to rdma-next;
>>
>> V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to
>>             verify rdma link is removed.
>>          2) Add register_pernet_subsys/unregister_pernet_subsys net namespace
>>          3) Replace l_sk6 with sk6 of pernet_name_space
>>
>> V1->V2: Add the explicit initialization of sk6.
>>
>> Zhu Yanjun (8):
>>    RDMA/rxe: Creating listening sock in newlink function
>>    RDMA/rxe: Support more rdma links in init_net
>>    RDMA/nldev: Add dellink function pointer
>>    RDMA/rxe: Implement dellink in rxe
>>    RDMA/rxe: Replace global variable with sock lookup functions
>>    RDMA/rxe: add the support of net namespace
>>    RDMA/rxe: Add the support of net namespace notifier
>>    RDMA/rxe: Replace l_sk6 with sk6 in net namespace
>>
>>   drivers/infiniband/core/nldev.c     |   6 ++
>>   drivers/infiniband/sw/rxe/Makefile  |   3 +-
>>   drivers/infiniband/sw/rxe/rxe.c     |  35 +++++++-
>>   drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++------
>>   drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
>>   drivers/infiniband/sw/rxe/rxe_ns.c  | 134 ++++++++++++++++++++++++++++ip netns add test
>>   drivers/infiniband/sw/rxe/rxe_ns.h  |  17 ++++
>>   include/rdma/rdma_netlink.h         |   2 +
>>   8 files changed, 279 insertions(+), 40 deletions(-)
>>   create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
>>   create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns add test
>>
> Zhu,
>
> I did some simple experiments on netns functionality.
>
> With your patch set applied and rxe0 created on enp6s0 and rxe1 created on lo in the default namespace
>
> 	# sudo ip netns add test
> 	# ip netns
> 	test
> 	# sudo ip netns exec test ip link
> 	1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
> 	    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 	# sudo ip netns exec test ip link set dev lo up
> 	# sudo ip netns exec test ip link
> 	1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
> 	    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 	# sudo ip netns exec test ip addr add dev lo fe80::0200:00ff:fe00:0000/64
> 		[rxe doesn't work unless this IPV6 address is set]
> 	# sudo ip netns exec test ip addr
> 	1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
> 	    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 	    inet 127.0.0.1/8 scope host lo
> 	       valid_lft forever preferred_lft forever
> 	    inet6 fe80::200:ff:fe00:0/64 scope link
> 	       valid_lft forever preferred_lft forever
> 	    inet6 ::1/128 scope host
> 	       valid_lft forever preferred_lft forever
> 	# sudo ip netns exec test ls /sys/class/infiniband
> 	rxe0  rxe1
> 		[These show up even though the ndevs do *not* belong to the test namespace! Probably OK.]
> 	# sudo ip netns exec test rdma link add rxe2 type rxe netdev lo
> 	# ls /sys/class/infiniband
> 	rxe0  rxe1  rxe2
> 		[The new rxe device shows up in the default namespace. At least we're consistent.]
> 	# ib_send_bw -d rxe0 ... 192.168.0.27
> 		[Works. Didn't break the existing rxe devices. Expected]
> 	# ib_send_bw -d rxe1 ... 127.0.0.1
> 		[Works. Expected]
> 	# ib_send_bw -d rxe2 ... 127.0.0.1
> 	IB device rxe2 not found
>   	 Unable to find the Infiniband/RoCE device
> 		[Not work. Expected.]
> 	# sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
> 	IB device rxe2 not found
> 	 Unable to find the Infiniband/RoCE device
> 		[Also not work. Turns out rxe2 device is gone after failure. Not expected.]
> 	# sudo ip netns exec test rdma link add rxe2 type rxe netdev lo
> 	# ls /sys/class/infiniband
> 	rxe0  rxe1  rxe2
> 		[Good. It's back]
> 	# sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
> 		[Works in test namespace! Expected.]
> 	# sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1
> 		[Also works. Definitely not expected.]
>
> My take, it sort of works. But there are some serious issues. You shouldn't be able to use the
> rxe2 device in the default namespace. It would be nice if you couldn't see the rxe devices in each
> other's namespaces (Like ip link or ip addr hide other namespace's devices.)

Thanks a lot for your testing. Bob.

There are 2 modes in rdma net namespace. One is shared mode, the other 
is exclusive mode.

Share mode is the defaut mode. So what you have found is normal.

The followings are the explanations of both shared and exclusive modes.

     rdma system set netns shared|exclusive, specifies the RDMA 
subsystem mode. Either exclusive or shared.  When user wants

        to assign dedicated RDMA device to a particular network 
namespace, exclusive mode should
        be set before creating any network namespace. If there are 
active network namespaces and
        if one or more RDMA devices exist, changing mode from shared to 
exclusive returns error
        code EBUSY.

        When RDMA subsystem is in shared mode, RDMA device is accessible 
in all network namespace.
        When RDMA device isolation among multiple network namespaces is 
not needed, shared mode
        can be used.

        It is preferred to not change the subsystem mode when there is 
active RDMA traffic

        running, even though it is supported.

Zhu Yanjun

>
> Bob

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace
  2023-06-21 21:27   ` Bob Pearson
@ 2023-06-23  7:15     ` Zhu Yanjun
  2023-06-23 12:59       ` Bob Pearson
  0 siblings, 1 reply; 25+ messages in thread
From: Zhu Yanjun @ 2023-06-23  7:15 UTC (permalink / raw)
  To: Bob Pearson, Zhu Yanjun, zyjzyj2000, jgg, leon, linux-rdma, parav,
	lehrer


在 2023/6/22 5:27, Bob Pearson 写道:
> On 6/21/23 16:09, Bob Pearson wrote:
>> On 5/8/23 02:56, Zhu Yanjun wrote:
>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>
>>> When run "ip link add" command to add a rxe rdma link in a net
>>> namespace, normally this rxe rdma link can not work in a net
>>> name space.
>>>
>>> The root cause is that a sock listening on udp port 4791 is created
>>> in init_net when the rdma_rxe module is loaded into kernel. That is,
>>> the sock listening on udp port 4791 is created in init_net. Other net
>>> namespace is difficult to use this sock.
>>>
>>> The following commits will solve this problem.
>>>
>>> In the first commit, move the creating sock listening on udp port 4791
>>> from module_init function to rdma link creating functions. That is,
>>> after the module rdma_rxe is loaded, the sock will not be created.
>>> When run "rdma link add ..." command, the sock will be created. So
>>> when creating a rdma link in the net namespace, the sock will be
>>> created in this net namespace.
>>>
>>> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
>>> will check the sock exists in the net namespace or not. If yes, rdma
>>> link will increase the reference count of this sock, then continue other
>>> jobs instead of creating a new sock to listen on udp port 4791. Since the
>>> network notifier is global, when the module rdma_rxe is loaded, this
>>> notifier will be registered.
>>>
>>> After the rdma link is created, the command "rdma link del" is to
>>> delete rdma link at the same time the sock is checked. If the reference
>>> count of this sock is greater than the sock reference count needed by
>>> udp tunnel, the sock reference count is decreased by one. If equal, it
>>> indicates that this rdma link is the last one. As such, the udp tunnel
>>> is shut down and the sock is closed. The above work should be
>>> implemented in linkdel function. But currently no dellink function in
>>> rxe. So the 3rd commit addes dellink function pointer. And the 4th
>>> commit implements the dellink function in rxe.
>>>
>>> To now, it is not necessary to keep a global variable to store the sock
>>> listening udp port 4791. This global variable can be replaced by the
>>> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
>>> function udp6_lib_lookup is in the fast path, a member variable l_sk6
>>> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
>>> to lookup the sock, then the sock is stored in l_sk6, in the future,it
>>> can be used directly.
>>>
>>> All the above work has been done in init_net. And it can also work in
>>> the net namespace. So the init_net is replaced by the individual net
>>> namespace. This is what the 6th commit does. Because rxe device is
>>> dependent on the net device and the sock listening on udp port 4791,
>>> every rxe device is in exclusive mode in the individual net namespace.
>>> Other rdma netns operations will be considerred in the future.
>>>
>>> In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys
>>> functions are added. When a new net namespace is created, the init
>>> function will initialize the sk4 and sk6 socks. Then the 2 socks will
>>> be released when the net namespace is destroyed. The functions
>>> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net
>>> namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will
>>> handle sk6. Then sk4 and sk6 are used in the previous commits.
>>>
>>> As the sk4 and sk6 in pernet namespace can be accessed, it is not
>>> necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is
>>> replaced with the sk6 in pernet namespace.
>>>
>>> Test steps:
>>> 1) Suppose that 2 NICs are in 2 different net namespaces.
>>>
>>>    # ip netns exec net0 ip link
>>>    3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
>>>       link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
>>>       altname enp5s0
>>>
>>>    # ip netns exec net1 ip link
>>>    4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
>>>       link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff
>>>
>>> 2) Add rdma link in the different net namespace
>>>      net0:
>>>      # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2
>>>
>>>      net1:
>>>      # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
>>>
>>> 3) Run rping test.
>>>      net0
>>>      # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
>>>      [1] 1737
>>>      # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
>>>      verbose
>>>      count 1
>>>      ...
>>>      ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
>>>      ...
>>>
>>> 4) Remove the rdma links from the net namespaces.
>>>      net0:
>>>      # ip netns exec net0 ss -lu
>>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>>      UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>>>      UNCONN    0         0         [::]:4791             [::]:*
>>>
>>>      # ip netns exec net0 rdma link del rxe0
>>>
>>>      # ip netns exec net0 ss -lu
>>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>>
>>>      net1:
>>>      # ip netns exec net0 ss -lu
>>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>>      UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>>>      UNCONN    0         0         [::]:4791             [::]:*
>>>
>>>      # ip netns exec net1 rdma link del rxe1
>>>
>>>      # ip netns exec net0 ss -lu
>>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>>
>>> V4->V5: Rebase the commits to V6.4-rc1
>>>
>>> V3->V4: Rebase the commits to rdma-next;
>>>
>>> V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to
>>>             verify rdma link is removed.
>>>          2) Add register_pernet_subsys/unregister_pernet_subsys net namespace
>>>          3) Replace l_sk6 with sk6 of pernet_name_space
>>>
>>> V1->V2: Add the explicit initialization of sk6.
>>>
>>> Zhu Yanjun (8):
>>>    RDMA/rxe: Creating listening sock in newlink function
>>>    RDMA/rxe: Support more rdma links in init_net
>>>    RDMA/nldev: Add dellink function pointer
>>>    RDMA/rxe: Implement dellink in rxe
>>>    RDMA/rxe: Replace global variable with sock lookup functions
>>>    RDMA/rxe: add the support of net namespace
>>>    RDMA/rxe: Add the support of net namespace notifier
>>>    RDMA/rxe: Replace l_sk6 with sk6 in net namespace
>>>
>>>   drivers/infiniband/core/nldev.c     |   6 ++
>>>   drivers/infiniband/sw/rxe/Makefile  |   3 +-
>>>   drivers/infiniband/sw/rxe/rxe.c     |  35 +++++++-
>>>   drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++------
>>>   drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
>>>   drivers/infiniband/sw/rxe/rxe_ns.c  | 134 ++++++++++++++++++++++++++++ip netns add test
>>>   drivers/infiniband/sw/rxe/rxe_ns.h  |  17 ++++
>>>   include/rdma/rdma_netlink.h         |   2 +
>>>   8 files changed, 279 insertions(+), 40 deletions(-)
>>>   create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
>>>   create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns add test
>>>
>> Zhu,
>>
>> I did some simple experiments on netns functionality.
>>
>> With your patch set applied and rxe0 created on enp6s0 and rxe1 created on lo in the default namespace
>>
>> 	# sudo ip netns add test
>> 	# ip netns
>> 	test
>> 	# sudo ip netns exec test ip link
>> 	1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
>> 	    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>> 	# sudo ip netns exec test ip link set dev lo up
>> 	# sudo ip netns exec test ip link
>> 	1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
>> 	    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>> 	# sudo ip netns exec test ip addr add dev lo fe80::0200:00ff:fe00:0000/64
>> 		[rxe doesn't work unless this IPV6 address is set]
>> 	# sudo ip netns exec test ip addr
>> 	1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
>> 	    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>> 	    inet 127.0.0.1/8 scope host lo
>> 	       valid_lft forever preferred_lft forever
>> 	    inet6 fe80::200:ff:fe00:0/64 scope link
>> 	       valid_lft forever preferred_lft forever
>> 	    inet6 ::1/128 scope host
>> 	       valid_lft forever preferred_lft forever
>> 	# sudo ip netns exec test ls /sys/class/infiniband
>> 	rxe0  rxe1
>> 		[These show up even though the ndevs do *not* belong to the test namespace! Probably OK.]
>> 	# sudo ip netns exec test rdma link add rxe2 type rxe netdev lo
>> 	# ls /sys/class/infiniband
>> 	rxe0  rxe1  rxe2
>> 		[The new rxe device shows up in the default namespace. At least we're consistent.]
>> 	# ib_send_bw -d rxe0 ... 192.168.0.27
>> 		[Works. Didn't break the existing rxe devices. Expected]
>> 	# ib_send_bw -d rxe1 ... 127.0.0.1
>> 		[Works. Expected]
>> 	# ib_send_bw -d rxe2 ... 127.0.0.1
>> 	IB device rxe2 not found
>>   	 Unable to find the Infiniband/RoCE device
>> 		[Not work. Expected.]
>> 	# sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
>> 	IB device rxe2 not found
>> 	 Unable to find the Infiniband/RoCE device
>> 		[Also not work. Turns out rxe2 device is gone after failure. Not expected.]
>> 	# sudo ip netns exec test rdma link add rxe2 type rxe netdev lo
>> 	# ls /sys/class/infiniband
>> 	rxe0  rxe1  rxe2
>> 		[Good. It's back]
>> 	# sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
>> 		[Works in test namespace! Expected.]
>> 	# sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1
>> 		[Also works. Definitely not expected.]
>>
>> My take, it sort of works. But there are some serious issues. You shouldn't be able to use the
>> rxe2 device in the default namespace. It would be nice if you couldn't see the rxe devices in each
>> other's namespaces (Like ip link or ip addr hide other namespace's devices.)
>>
>> Bob
> Forgot to mention. It also is definitely not good that a process in the default namespace can destroy
> a rxe device in the test namespace by trying to use it.

Thanks a lot.

I am not sure if it is correct or not to destroy a rxe device outside 
this this net namespace.

Because to irdma/mlx5 rdma devices, we can also destroy them with the 
command "modprobe -v irdma/mlx5..." outside of the net namespace.

I am not sure if this is correct or not.

Zhu Yanjun

>
> Bob

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace
  2023-06-23  7:15     ` Zhu Yanjun
@ 2023-06-23 12:59       ` Bob Pearson
  2023-06-23 23:50         ` Zhu Yanjun
  0 siblings, 1 reply; 25+ messages in thread
From: Bob Pearson @ 2023-06-23 12:59 UTC (permalink / raw)
  To: Zhu Yanjun, Zhu Yanjun, zyjzyj2000, jgg, leon, linux-rdma, parav,
	lehrer

On 6/23/23 02:15, Zhu Yanjun wrote:
> 
> 在 2023/6/22 5:27, Bob Pearson 写道:
>> On 6/21/23 16:09, Bob Pearson wrote:
>>> On 5/8/23 02:56, Zhu Yanjun wrote:
>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>
>>>> When run "ip link add" command to add a rxe rdma link in a net
>>>> namespace, normally this rxe rdma link can not work in a net
>>>> name space.
>>>>
>>>> The root cause is that a sock listening on udp port 4791 is created
>>>> in init_net when the rdma_rxe module is loaded into kernel. That is,
>>>> the sock listening on udp port 4791 is created in init_net. Other net
>>>> namespace is difficult to use this sock.
>>>>
>>>> The following commits will solve this problem.
>>>>
>>>> In the first commit, move the creating sock listening on udp port 4791
>>>> from module_init function to rdma link creating functions. That is,
>>>> after the module rdma_rxe is loaded, the sock will not be created.
>>>> When run "rdma link add ..." command, the sock will be created. So
>>>> when creating a rdma link in the net namespace, the sock will be
>>>> created in this net namespace.
>>>>
>>>> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
>>>> will check the sock exists in the net namespace or not. If yes, rdma
>>>> link will increase the reference count of this sock, then continue other
>>>> jobs instead of creating a new sock to listen on udp port 4791. Since the
>>>> network notifier is global, when the module rdma_rxe is loaded, this
>>>> notifier will be registered.
>>>>
>>>> After the rdma link is created, the command "rdma link del" is to
>>>> delete rdma link at the same time the sock is checked. If the reference
>>>> count of this sock is greater than the sock reference count needed by
>>>> udp tunnel, the sock reference count is decreased by one. If equal, it
>>>> indicates that this rdma link is the last one. As such, the udp tunnel
>>>> is shut down and the sock is closed. The above work should be
>>>> implemented in linkdel function. But currently no dellink function in
>>>> rxe. So the 3rd commit addes dellink function pointer. And the 4th
>>>> commit implements the dellink function in rxe.
>>>>
>>>> To now, it is not necessary to keep a global variable to store the sock
>>>> listening udp port 4791. This global variable can be replaced by the
>>>> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
>>>> function udp6_lib_lookup is in the fast path, a member variable l_sk6
>>>> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
>>>> to lookup the sock, then the sock is stored in l_sk6, in the future,it
>>>> can be used directly.
>>>>
>>>> All the above work has been done in init_net. And it can also work in
>>>> the net namespace. So the init_net is replaced by the individual net
>>>> namespace. This is what the 6th commit does. Because rxe device is
>>>> dependent on the net device and the sock listening on udp port 4791,
>>>> every rxe device is in exclusive mode in the individual net namespace.
>>>> Other rdma netns operations will be considerred in the future.
>>>>
>>>> In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys
>>>> functions are added. When a new net namespace is created, the init
>>>> function will initialize the sk4 and sk6 socks. Then the 2 socks will
>>>> be released when the net namespace is destroyed. The functions
>>>> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net
>>>> namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will
>>>> handle sk6. Then sk4 and sk6 are used in the previous commits.
>>>>
>>>> As the sk4 and sk6 in pernet namespace can be accessed, it is not
>>>> necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is
>>>> replaced with the sk6 in pernet namespace.
>>>>
>>>> Test steps:
>>>> 1) Suppose that 2 NICs are in 2 different net namespaces.
>>>>
>>>>    # ip netns exec net0 ip link
>>>>    3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
>>>>       link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
>>>>       altname enp5s0
>>>>
>>>>    # ip netns exec net1 ip link
>>>>    4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
>>>>       link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff
>>>>
>>>> 2) Add rdma link in the different net namespace
>>>>      net0:
>>>>      # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2
>>>>
>>>>      net1:
>>>>      # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
>>>>
>>>> 3) Run rping test.
>>>>      net0
>>>>      # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
>>>>      [1] 1737
>>>>      # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
>>>>      verbose
>>>>      count 1
>>>>      ...
>>>>      ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
>>>>      ...
>>>>
>>>> 4) Remove the rdma links from the net namespaces.
>>>>      net0:
>>>>      # ip netns exec net0 ss -lu
>>>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>>>      UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>>>>      UNCONN    0         0         [::]:4791             [::]:*
>>>>
>>>>      # ip netns exec net0 rdma link del rxe0
>>>>
>>>>      # ip netns exec net0 ss -lu
>>>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>>>
>>>>      net1:
>>>>      # ip netns exec net0 ss -lu
>>>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>>>      UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>>>>      UNCONN    0         0         [::]:4791             [::]:*
>>>>
>>>>      # ip netns exec net1 rdma link del rxe1
>>>>
>>>>      # ip netns exec net0 ss -lu
>>>>      State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>>>
>>>> V4->V5: Rebase the commits to V6.4-rc1
>>>>
>>>> V3->V4: Rebase the commits to rdma-next;
>>>>
>>>> V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to
>>>>             verify rdma link is removed.
>>>>          2) Add register_pernet_subsys/unregister_pernet_subsys net namespace
>>>>          3) Replace l_sk6 with sk6 of pernet_name_space
>>>>
>>>> V1->V2: Add the explicit initialization of sk6.
>>>>
>>>> Zhu Yanjun (8):
>>>>    RDMA/rxe: Creating listening sock in newlink function
>>>>    RDMA/rxe: Support more rdma links in init_net
>>>>    RDMA/nldev: Add dellink function pointer
>>>>    RDMA/rxe: Implement dellink in rxe
>>>>    RDMA/rxe: Replace global variable with sock lookup functions
>>>>    RDMA/rxe: add the support of net namespace
>>>>    RDMA/rxe: Add the support of net namespace notifier
>>>>    RDMA/rxe: Replace l_sk6 with sk6 in net namespace
>>>>
>>>>   drivers/infiniband/core/nldev.c     |   6 ++
>>>>   drivers/infiniband/sw/rxe/Makefile  |   3 +-
>>>>   drivers/infiniband/sw/rxe/rxe.c     |  35 +++++++-
>>>>   drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++------
>>>>   drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
>>>>   drivers/infiniband/sw/rxe/rxe_ns.c  | 134 ++++++++++++++++++++++++++++ip netns add test
>>>>   drivers/infiniband/sw/rxe/rxe_ns.h  |  17 ++++
>>>>   include/rdma/rdma_netlink.h         |   2 +
>>>>   8 files changed, 279 insertions(+), 40 deletions(-)
>>>>   create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
>>>>   create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns add test
>>>>
>>> Zhu,
>>>
>>> I did some simple experiments on netns functionality.
>>>
>>> With your patch set applied and rxe0 created on enp6s0 and rxe1 created on lo in the default namespace
>>>
>>>     # sudo ip netns add test
>>>     # ip netns
>>>     test
>>>     # sudo ip netns exec test ip link
>>>     1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
>>>         link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>     # sudo ip netns exec test ip link set dev lo up
>>>     # sudo ip netns exec test ip link
>>>     1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
>>>         link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>     # sudo ip netns exec test ip addr add dev lo fe80::0200:00ff:fe00:0000/64
>>>         [rxe doesn't work unless this IPV6 address is set]
>>>     # sudo ip netns exec test ip addr
>>>     1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
>>>         link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>         inet 127.0.0.1/8 scope host lo
>>>            valid_lft forever preferred_lft forever
>>>         inet6 fe80::200:ff:fe00:0/64 scope link
>>>            valid_lft forever preferred_lft forever
>>>         inet6 ::1/128 scope host
>>>            valid_lft forever preferred_lft forever
>>>     # sudo ip netns exec test ls /sys/class/infiniband
>>>     rxe0  rxe1
>>>         [These show up even though the ndevs do *not* belong to the test namespace! Probably OK.]
>>>     # sudo ip netns exec test rdma link add rxe2 type rxe netdev lo
>>>     # ls /sys/class/infiniband
>>>     rxe0  rxe1  rxe2
>>>         [The new rxe device shows up in the default namespace. At least we're consistent.]
>>>     # ib_send_bw -d rxe0 ... 192.168.0.27
>>>         [Works. Didn't break the existing rxe devices. Expected]
>>>     # ib_send_bw -d rxe1 ... 127.0.0.1
>>>         [Works. Expected]
>>>     # ib_send_bw -d rxe2 ... 127.0.0.1
>>>     IB device rxe2 not found
>>>        Unable to find the Infiniband/RoCE device
>>>         [Not work. Expected.]
>>>     # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
>>>     IB device rxe2 not found
>>>      Unable to find the Infiniband/RoCE device
>>>         [Also not work. Turns out rxe2 device is gone after failure. Not expected.]
>>>     # sudo ip netns exec test rdma link add rxe2 type rxe netdev lo
>>>     # ls /sys/class/infiniband
>>>     rxe0  rxe1  rxe2
>>>         [Good. It's back]
>>>     # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
>>>         [Works in test namespace! Expected.]
>>>     # sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1
>>>         [Also works. Definitely not expected.]
>>>
>>> My take, it sort of works. But there are some serious issues. You shouldn't be able to use the
>>> rxe2 device in the default namespace. It would be nice if you couldn't see the rxe devices in each
>>> other's namespaces (Like ip link or ip addr hide other namespace's devices.)
>>>
>>> Bob
>> Forgot to mention. It also is definitely not good that a process in the default namespace can destroy
>> a rxe device in the test namespace by trying to use it.
> 
> Thanks a lot.
> 
> I am not sure if it is correct or not to destroy a rxe device outside this this net namespace.
> 
> Because to irdma/mlx5 rdma devices, we can also destroy them with the command "modprobe -v irdma/mlx5..." outside of the net namespace.
> 
> I am not sure if this is correct or not.
> 
> Zhu Yanjun
> 
>>
>> Bob

I didn' intentionally destroy lo2. I just tried to access the rxe device but it failed.
The rxe device was destroyed as a side effect of failing to open it.

Bob

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace
  2023-06-23 12:59       ` Bob Pearson
@ 2023-06-23 23:50         ` Zhu Yanjun
  2023-06-24 17:03           ` Pearson, Robert B
  0 siblings, 1 reply; 25+ messages in thread
From: Zhu Yanjun @ 2023-06-23 23:50 UTC (permalink / raw)
  To: Bob Pearson, Zhu Yanjun, zyjzyj2000, jgg, leon, linux-rdma, parav,
	lehrer


在 2023/6/23 20:59, Bob Pearson 写道:
> On 6/23/23 02:15, Zhu Yanjun wrote:
>> 在 2023/6/22 5:27, Bob Pearson 写道:
>>> On 6/21/23 16:09, Bob Pearson wrote:
>>>> On 5/8/23 02:56, Zhu Yanjun wrote:
>>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>
>>>>> When run "ip link add" command to add a rxe rdma link in a net
>>>>> namespace, normally this rxe rdma link can not work in a net
>>>>> name space.
>>>>>
>>>>> The root cause is that a sock listening on udp port 4791 is created
>>>>> in init_net when the rdma_rxe module is loaded into kernel. That is,
>>>>> the sock listening on udp port 4791 is created in init_net. Other net
>>>>> namespace is difficult to use this sock.
>>>>>
>>>>> The following commits will solve this problem.
>>>>>
>>>>> In the first commit, move the creating sock listening on udp port 4791
>>>>> from module_init function to rdma link creating functions. That is,
>>>>> after the module rdma_rxe is loaded, the sock will not be created.
>>>>> When run "rdma link add ..." command, the sock will be created. So
>>>>> when creating a rdma link in the net namespace, the sock will be
>>>>> created in this net namespace.
>>>>>
>>>>> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
>>>>> will check the sock exists in the net namespace or not. If yes, rdma
>>>>> link will increase the reference count of this sock, then continue other
>>>>> jobs instead of creating a new sock to listen on udp port 4791. Since the
>>>>> network notifier is global, when the module rdma_rxe is loaded, this
>>>>> notifier will be registered.
>>>>>
>>>>> After the rdma link is created, the command "rdma link del" is to
>>>>> delete rdma link at the same time the sock is checked. If the reference
>>>>> count of this sock is greater than the sock reference count needed by
>>>>> udp tunnel, the sock reference count is decreased by one. If equal, it
>>>>> indicates that this rdma link is the last one. As such, the udp tunnel
>>>>> is shut down and the sock is closed. The above work should be
>>>>> implemented in linkdel function. But currently no dellink function in
>>>>> rxe. So the 3rd commit addes dellink function pointer. And the 4th
>>>>> commit implements the dellink function in rxe.
>>>>>
>>>>> To now, it is not necessary to keep a global variable to store the sock
>>>>> listening udp port 4791. This global variable can be replaced by the
>>>>> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
>>>>> function udp6_lib_lookup is in the fast path, a member variable l_sk6
>>>>> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
>>>>> to lookup the sock, then the sock is stored in l_sk6, in the future,it
>>>>> can be used directly.
>>>>>
>>>>> All the above work has been done in init_net. And it can also work in
>>>>> the net namespace. So the init_net is replaced by the individual net
>>>>> namespace. This is what the 6th commit does. Because rxe device is
>>>>> dependent on the net device and the sock listening on udp port 4791,
>>>>> every rxe device is in exclusive mode in the individual net namespace.
>>>>> Other rdma netns operations will be considerred in the future.
>>>>>
>>>>> In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys
>>>>> functions are added. When a new net namespace is created, the init
>>>>> function will initialize the sk4 and sk6 socks. Then the 2 socks will
>>>>> be released when the net namespace is destroyed. The functions
>>>>> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net
>>>>> namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will
>>>>> handle sk6. Then sk4 and sk6 are used in the previous commits.
>>>>>
>>>>> As the sk4 and sk6 in pernet namespace can be accessed, it is not
>>>>> necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is
>>>>> replaced with the sk6 in pernet namespace.
>>>>>
>>>>> Test steps:
>>>>> 1) Suppose that 2 NICs are in 2 different net namespaces.
>>>>>
>>>>>     # ip netns exec net0 ip link
>>>>>     3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
>>>>>        link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
>>>>>        altname enp5s0
>>>>>
>>>>>     # ip netns exec net1 ip link
>>>>>     4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
>>>>>        link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff
>>>>>
>>>>> 2) Add rdma link in the different net namespace
>>>>>       net0:
>>>>>       # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2
>>>>>
>>>>>       net1:
>>>>>       # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
>>>>>
>>>>> 3) Run rping test.
>>>>>       net0
>>>>>       # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
>>>>>       [1] 1737
>>>>>       # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
>>>>>       verbose
>>>>>       count 1
>>>>>       ...
>>>>>       ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
>>>>>       ...
>>>>>
>>>>> 4) Remove the rdma links from the net namespaces.
>>>>>       net0:
>>>>>       # ip netns exec net0 ss -lu
>>>>>       State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>>>>       UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>>>>>       UNCONN    0         0         [::]:4791             [::]:*
>>>>>
>>>>>       # ip netns exec net0 rdma link del rxe0
>>>>>
>>>>>       # ip netns exec net0 ss -lu
>>>>>       State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>>>>
>>>>>       net1:
>>>>>       # ip netns exec net0 ss -lu
>>>>>       State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>>>>       UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
>>>>>       UNCONN    0         0         [::]:4791             [::]:*
>>>>>
>>>>>       # ip netns exec net1 rdma link del rxe1
>>>>>
>>>>>       # ip netns exec net0 ss -lu
>>>>>       State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
>>>>>
>>>>> V4->V5: Rebase the commits to V6.4-rc1
>>>>>
>>>>> V3->V4: Rebase the commits to rdma-next;
>>>>>
>>>>> V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to
>>>>>              verify rdma link is removed.
>>>>>           2) Add register_pernet_subsys/unregister_pernet_subsys net namespace
>>>>>           3) Replace l_sk6 with sk6 of pernet_name_space
>>>>>
>>>>> V1->V2: Add the explicit initialization of sk6.
>>>>>
>>>>> Zhu Yanjun (8):
>>>>>     RDMA/rxe: Creating listening sock in newlink function
>>>>>     RDMA/rxe: Support more rdma links in init_net
>>>>>     RDMA/nldev: Add dellink function pointer
>>>>>     RDMA/rxe: Implement dellink in rxe
>>>>>     RDMA/rxe: Replace global variable with sock lookup functions
>>>>>     RDMA/rxe: add the support of net namespace
>>>>>     RDMA/rxe: Add the support of net namespace notifier
>>>>>     RDMA/rxe: Replace l_sk6 with sk6 in net namespace
>>>>>
>>>>>    drivers/infiniband/core/nldev.c     |   6 ++
>>>>>    drivers/infiniband/sw/rxe/Makefile  |   3 +-
>>>>>    drivers/infiniband/sw/rxe/rxe.c     |  35 +++++++-
>>>>>    drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++------
>>>>>    drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
>>>>>    drivers/infiniband/sw/rxe/rxe_ns.c  | 134 ++++++++++++++++++++++++++++ip netns add test
>>>>>    drivers/infiniband/sw/rxe/rxe_ns.h  |  17 ++++
>>>>>    include/rdma/rdma_netlink.h         |   2 +
>>>>>    8 files changed, 279 insertions(+), 40 deletions(-)
>>>>>    create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
>>>>>    create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns add test
>>>>>
>>>> Zhu,
>>>>
>>>> I did some simple experiments on netns functionality.
>>>>
>>>> With your patch set applied and rxe0 created on enp6s0 and rxe1 created on lo in the default namespace
>>>>
>>>>      # sudo ip netns add test
>>>>      # ip netns
>>>>      test
>>>>      # sudo ip netns exec test ip link
>>>>      1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
>>>>          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>>      # sudo ip netns exec test ip link set dev lo up
>>>>      # sudo ip netns exec test ip link
>>>>      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
>>>>          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>>      # sudo ip netns exec test ip addr add dev lo fe80::0200:00ff:fe00:0000/64
>>>>          [rxe doesn't work unless this IPV6 address is set]
>>>>      # sudo ip netns exec test ip addr
>>>>      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
>>>>          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>>          inet 127.0.0.1/8 scope host lo
>>>>             valid_lft forever preferred_lft forever
>>>>          inet6 fe80::200:ff:fe00:0/64 scope link
>>>>             valid_lft forever preferred_lft forever
>>>>          inet6 ::1/128 scope host
>>>>             valid_lft forever preferred_lft forever
>>>>      # sudo ip netns exec test ls /sys/class/infiniband
>>>>      rxe0  rxe1
>>>>          [These show up even though the ndevs do *not* belong to the test namespace! Probably OK.]
>>>>      # sudo ip netns exec test rdma link add rxe2 type rxe netdev lo
>>>>      # ls /sys/class/infiniband
>>>>      rxe0  rxe1  rxe2
>>>>          [The new rxe device shows up in the default namespace. At least we're consistent.]
>>>>      # ib_send_bw -d rxe0 ... 192.168.0.27
>>>>          [Works. Didn't break the existing rxe devices. Expected]
>>>>      # ib_send_bw -d rxe1 ... 127.0.0.1
>>>>          [Works. Expected]
>>>>      # ib_send_bw -d rxe2 ... 127.0.0.1
>>>>      IB device rxe2 not found
>>>>         Unable to find the Infiniband/RoCE device
>>>>          [Not work. Expected.]
>>>>      # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
>>>>      IB device rxe2 not found
>>>>       Unable to find the Infiniband/RoCE device
>>>>          [Also not work. Turns out rxe2 device is gone after failure. Not expected.]
>>>>      # sudo ip netns exec test rdma link add rxe2 type rxe netdev lo
>>>>      # ls /sys/class/infiniband
>>>>      rxe0  rxe1  rxe2
>>>>          [Good. It's back]
>>>>      # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
>>>>          [Works in test namespace! Expected.]
>>>>      # sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1
>>>>          [Also works. Definitely not expected.]
>>>>
>>>> My take, it sort of works. But there are some serious issues. You shouldn't be able to use the
>>>> rxe2 device in the default namespace. It would be nice if you couldn't see the rxe devices in each
>>>> other's namespaces (Like ip link or ip addr hide other namespace's devices.)
>>>>
>>>> Bob
>>> Forgot to mention. It also is definitely not good that a process in the default namespace can destroy
>>> a rxe device in the test namespace by trying to use it.
>> Thanks a lot.
>>
>> I am not sure if it is correct or not to destroy a rxe device outside this this net namespace.
>>
>> Because to irdma/mlx5 rdma devices, we can also destroy them with the command "modprobe -v irdma/mlx5..." outside of the net namespace.
>>
>> I am not sure if this is correct or not.
>>
>> Zhu Yanjun
>>
>>> Bob
> I didn' intentionally destroy lo2. I just tried to access the rxe device but it failed.
> The rxe device was destroyed as a side effect of failing to open it.

The GID of rxe can not be generated with lo. This is a problem. Now 
Chuck Lever <cel@kernel.org> will fix it.

Not sure if the problem that you confronted is related with this. Please 
use physical NIC to make tests again.

Thanks a lot.

Zhu Yanjun

>
> Bob

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace
  2023-06-23 23:50         ` Zhu Yanjun
@ 2023-06-24 17:03           ` Pearson, Robert B
  2023-06-24 17:32             ` Chuck Lever III
  2023-06-25  0:53             ` Zhu Yanjun
  0 siblings, 2 replies; 25+ messages in thread
From: Pearson, Robert B @ 2023-06-24 17:03 UTC (permalink / raw)
  To: Zhu Yanjun, Bob Pearson, Zhu Yanjun, zyjzyj2000@gmail.com,
	jgg@ziepe.ca, leon@kernel.org, linux-rdma@vger.kernel.org,
	parav@nvidia.com, lehrer@gmail.com

-----Original Message-----
From: Zhu Yanjun <yanjun.zhu@linux.dev> 
Sent: Friday, June 23, 2023 6:51 PM
To: Bob Pearson <rpearsonhpe@gmail.com>; Zhu Yanjun <yanjun.zhu@intel.com>; zyjzyj2000@gmail.com; jgg@ziepe.ca; leon@kernel.org; linux-rdma@vger.kernel.org; parav@nvidia.com; lehrer@gmail.com
Subject: Re: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace


在 2023/6/23 20:59, Bob Pearson 写道:
> On 6/23/23 02:15, Zhu Yanjun wrote:
>> 在 2023/6/22 5:27, Bob Pearson 写道:
>>> On 6/21/23 16:09, Bob Pearson wrote:
>>>> On 5/8/23 02:56, Zhu Yanjun wrote:
>>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>
>>>>> When run "ip link add" command to add a rxe rdma link in a net 
>>>>> namespace, normally this rxe rdma link can not work in a net name 
>>>>> space.
>>>>>
>>>>> The root cause is that a sock listening on udp port 4791 is 
>>>>> created in init_net when the rdma_rxe module is loaded into 
>>>>> kernel. That is, the sock listening on udp port 4791 is created in 
>>>>> init_net. Other net namespace is difficult to use this sock.
>>>>>
>>>>> The following commits will solve this problem.
>>>>>
>>>>> In the first commit, move the creating sock listening on udp port 
>>>>> 4791 from module_init function to rdma link creating functions. 
>>>>> That is, after the module rdma_rxe is loaded, the sock will not be created.
>>>>> When run "rdma link add ..." command, the sock will be created. So 
>>>>> when creating a rdma link in the net namespace, the sock will be 
>>>>> created in this net namespace.
>>>>>
>>>>> In the second commit, the functions udp4_lib_lookup and 
>>>>> udp6_lib_lookup will check the sock exists in the net namespace or 
>>>>> not. If yes, rdma link will increase the reference count of this 
>>>>> sock, then continue other jobs instead of creating a new sock to 
>>>>> listen on udp port 4791. Since the network notifier is global, 
>>>>> when the module rdma_rxe is loaded, this notifier will be registered.
>>>>>
>>>>> After the rdma link is created, the command "rdma link del" is to 
>>>>> delete rdma link at the same time the sock is checked. If the 
>>>>> reference count of this sock is greater than the sock reference 
>>>>> count needed by udp tunnel, the sock reference count is decreased 
>>>>> by one. If equal, it indicates that this rdma link is the last 
>>>>> one. As such, the udp tunnel is shut down and the sock is closed. 
>>>>> The above work should be implemented in linkdel function. But 
>>>>> currently no dellink function in rxe. So the 3rd commit addes 
>>>>> dellink function pointer. And the 4th commit implements the dellink function in rxe.
>>>>>
>>>>> To now, it is not necessary to keep a global variable to store the 
>>>>> sock listening udp port 4791. This global variable can be replaced 
>>>>> by the functions udp4_lib_lookup and udp6_lib_lookup totally. 
>>>>> Because the function udp6_lib_lookup is in the fast path, a member 
>>>>> variable l_sk6 is added to store the sock. If l_sk6 is NULL, 
>>>>> udp6_lib_lookup is called to lookup the sock, then the sock is 
>>>>> stored in l_sk6, in the future,it can be used directly.
>>>>>
>>>>> All the above work has been done in init_net. And it can also work 
>>>>> in the net namespace. So the init_net is replaced by the 
>>>>> individual net namespace. This is what the 6th commit does. 
>>>>> Because rxe device is dependent on the net device and the sock 
>>>>> listening on udp port 4791, every rxe device is in exclusive mode in the individual net namespace.
>>>>> Other rdma netns operations will be considerred in the future.
>>>>>
>>>>> In the 7th commit, the 
>>>>> register_pernet_subsys/unregister_pernet_subsys
>>>>> functions are added. When a new net namespace is created, the init 
>>>>> function will initialize the sk4 and sk6 socks. Then the 2 socks 
>>>>> will be released when the net namespace is destroyed. The 
>>>>> functions
>>>>> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in 
>>>>> the net namespace. The functions 
>>>>> rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will handle sk6. Then sk4 and sk6 are used in the previous commits.
>>>>>
>>>>> As the sk4 and sk6 in pernet namespace can be accessed, it is not 
>>>>> necessary to add a new l_sk6. As such, in the 8th commit, the 
>>>>> l_sk6 is replaced with the sk6 in pernet namespace.
>>>>>
>>>>> Test steps:
>>>>> 1) Suppose that 2 NICs are in 2 different net namespaces.
>>>>>
>>>>>     # ip netns exec net0 ip link
>>>>>     3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq 
>>>>> state UP
>>>>>        link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
>>>>>        altname enp5s0
>>>>>
>>>>>     # ip netns exec net1 ip link
>>>>>     4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc 
>>>>> fq_codel
>>>>>        link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff
>>>>>
>>>>> 2) Add rdma link in the different net namespace
>>>>>       net0:
>>>>>       # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2
>>>>>
>>>>>       net1:
>>>>>       # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
>>>>>
>>>>> 3) Run rping test.
>>>>>       net0
>>>>>       # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
>>>>>       [1] 1737
>>>>>       # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
>>>>>       verbose
>>>>>       count 1
>>>>>       ...
>>>>>       ping data: rdma-ping-0: 
>>>>> ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
>>>>>       ...
>>>>>
>>>>> 4) Remove the rdma links from the net namespaces.
>>>>>       net0:
>>>>>       # ip netns exec net0 ss -lu
>>>>>       State     Recv-Q    Send-Q    Local Address:Port    Peer 
>>>>> Address:Port    Process
>>>>>       UNCONN    0         0         0.0.0.0:4791          
>>>>> 0.0.0.0:*
>>>>>       UNCONN    0         0         [::]:4791             [::]:*
>>>>>
>>>>>       # ip netns exec net0 rdma link del rxe0
>>>>>
>>>>>       # ip netns exec net0 ss -lu
>>>>>       State     Recv-Q    Send-Q    Local Address:Port    Peer 
>>>>> Address:Port    Process
>>>>>
>>>>>       net1:
>>>>>       # ip netns exec net0 ss -lu
>>>>>       State     Recv-Q    Send-Q    Local Address:Port    Peer 
>>>>> Address:Port    Process
>>>>>       UNCONN    0         0         0.0.0.0:4791          
>>>>> 0.0.0.0:*
>>>>>       UNCONN    0         0         [::]:4791             [::]:*
>>>>>
>>>>>       # ip netns exec net1 rdma link del rxe1
>>>>>
>>>>>       # ip netns exec net0 ss -lu
>>>>>       State     Recv-Q    Send-Q    Local Address:Port    Peer 
>>>>> Address:Port    Process
>>>>>
>>>>> V4->V5: Rebase the commits to V6.4-rc1
>>>>>
>>>>> V3->V4: Rebase the commits to rdma-next;
>>>>>
>>>>> V2->V3: 1) Add "rdma link del" example in the cover letter, and 
>>>>> V2->use "ss -lu" to
>>>>>              verify rdma link is removed.
>>>>>           2) Add register_pernet_subsys/unregister_pernet_subsys 
>>>>> net namespace
>>>>>           3) Replace l_sk6 with sk6 of pernet_name_space
>>>>>
>>>>> V1->V2: Add the explicit initialization of sk6.
>>>>>
>>>>> Zhu Yanjun (8):
>>>>>     RDMA/rxe: Creating listening sock in newlink function
>>>>>     RDMA/rxe: Support more rdma links in init_net
>>>>>     RDMA/nldev: Add dellink function pointer
>>>>>     RDMA/rxe: Implement dellink in rxe
>>>>>     RDMA/rxe: Replace global variable with sock lookup functions
>>>>>     RDMA/rxe: add the support of net namespace
>>>>>     RDMA/rxe: Add the support of net namespace notifier
>>>>>     RDMA/rxe: Replace l_sk6 with sk6 in net namespace
>>>>>
>>>>>    drivers/infiniband/core/nldev.c     |   6 ++
>>>>>    drivers/infiniband/sw/rxe/Makefile  |   3 +-
>>>>>    drivers/infiniband/sw/rxe/rxe.c     |  35 +++++++-
>>>>>    drivers/infiniband/sw/rxe/rxe_net.c | 113 
>>>>> +++++++++++++++++------
>>>>>    drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
>>>>>    drivers/infiniband/sw/rxe/rxe_ns.c  | 134 
>>>>> ++++++++++++++++++++++++++++ip netns add test
>>>>>    drivers/infiniband/sw/rxe/rxe_ns.h  |  17 ++++
>>>>>    include/rdma/rdma_netlink.h         |   2 +
>>>>>    8 files changed, 279 insertions(+), 40 deletions(-)
>>>>>    create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
>>>>>    create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns 
>>>>> add test
>>>>>
>>>> Zhu,
>>>>
>>>> I did some simple experiments on netns functionality.
>>>>
>>>> With your patch set applied and rxe0 created on enp6s0 and rxe1 
>>>> created on lo in the default namespace
>>>>
>>>>      # sudo ip netns add test
>>>>      # ip netns
>>>>      test
>>>>      # sudo ip netns exec test ip link
>>>>      1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT 
>>>> group default qlen 1000
>>>>          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>>      # sudo ip netns exec test ip link set dev lo up
>>>>      # sudo ip netns exec test ip link
>>>>      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state 
>>>> UNKNOWN mode DEFAULT group default qlen 1000
>>>>          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>>      # sudo ip netns exec test ip addr add dev lo 
>>>> fe80::0200:00ff:fe00:0000/64
>>>>          [rxe doesn't work unless this IPV6 address is set]
>>>>      # sudo ip netns exec test ip addr
>>>>      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state 
>>>> UNKNOWN group default qlen 1000
>>>>          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>>          inet 127.0.0.1/8 scope host lo
>>>>             valid_lft forever preferred_lft forever
>>>>          inet6 fe80::200:ff:fe00:0/64 scope link
>>>>             valid_lft forever preferred_lft forever
>>>>          inet6 ::1/128 scope host
>>>>             valid_lft forever preferred_lft forever
>>>>      # sudo ip netns exec test ls /sys/class/infiniband
>>>>      rxe0  rxe1
>>>>          [These show up even though the ndevs do *not* belong to 
>>>> the test namespace! Probably OK.]
>>>>      # sudo ip netns exec test rdma link add rxe2 type rxe netdev 
>>>> lo
>>>>      # ls /sys/class/infiniband
>>>>      rxe0  rxe1  rxe2
>>>>          [The new rxe device shows up in the default namespace. At 
>>>> least we're consistent.]
>>>>      # ib_send_bw -d rxe0 ... 192.168.0.27
>>>>          [Works. Didn't break the existing rxe devices. Expected]
>>>>      # ib_send_bw -d rxe1 ... 127.0.0.1
>>>>          [Works. Expected]
>>>>      # ib_send_bw -d rxe2 ... 127.0.0.1
>>>>      IB device rxe2 not found
>>>>         Unable to find the Infiniband/RoCE device
>>>>          [Not work. Expected.]
>>>>      # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
>>>>      IB device rxe2 not found
>>>>       Unable to find the Infiniband/RoCE device
>>>>          [Also not work. Turns out rxe2 device is gone after 
>>>> failure. Not expected.]
>>>>      # sudo ip netns exec test rdma link add rxe2 type rxe netdev 
>>>> lo
>>>>      # ls /sys/class/infiniband
>>>>      rxe0  rxe1  rxe2
>>>>          [Good. It's back]
>>>>      # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
>>>>          [Works in test namespace! Expected.]
>>>>      # sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1
>>>>          [Also works. Definitely not expected.]
>>>>
>>>> My take, it sort of works. But there are some serious issues. You 
>>>> shouldn't be able to use the
>>>> rxe2 device in the default namespace. It would be nice if you 
>>>> couldn't see the rxe devices in each other's namespaces (Like ip 
>>>> link or ip addr hide other namespace's devices.)
>>>>
>>>> Bob
>>> Forgot to mention. It also is definitely not good that a process in 
>>> the default namespace can destroy a rxe device in the test namespace by trying to use it.
>> Thanks a lot.
>>
>> I am not sure if it is correct or not to destroy a rxe device outside this this net namespace.
>>
>> Because to irdma/mlx5 rdma devices, we can also destroy them with the command "modprobe -v irdma/mlx5..." outside of the net namespace.
>>
>> I am not sure if this is correct or not.
>>
>> Zhu Yanjun
>>
>>> Bob
> I didn' intentionally destroy lo2. I just tried to access the rxe device but it failed.
> The rxe device was destroyed as a side effect of failing to open it.

The GID of rxe can not be generated with lo. This is a problem. Now Chuck Lever <cel@kernel.org> will fix it.

Not sure if the problem that you confronted is related with this. Please use physical NIC to make tests again.

Thanks a lot.

Zhu Yanjun

>
> Bob

That was why I added the IPV6 address by hand. That created the gid table entry. This is
also a problem for all ethernet devices for distros that mangle the MAC address when creating the
IPV6 address as a security measure. These include Ubuntu which I use. So I have to always add
an IPV6 address based on the MAC address for any ethernet device.

Bob

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace
  2023-06-24 17:03           ` Pearson, Robert B
@ 2023-06-24 17:32             ` Chuck Lever III
  2023-06-25  0:54               ` Zhu Yanjun
  2023-06-25  0:53             ` Zhu Yanjun
  1 sibling, 1 reply; 25+ messages in thread
From: Chuck Lever III @ 2023-06-24 17:32 UTC (permalink / raw)
  To: Zhu Yanjun, Bob Pearson
  Cc: Zhu Yanjun, zyjzyj2000@gmail.com, jgg@ziepe.ca, leon@kernel.org,
	linux-rdma@vger.kernel.org, parav@nvidia.com, lehrer@gmail.com


> The GID of rxe can not be generated with lo. This is a problem.

I agree, and would like to see a fix. It's obviously going to be a very
useful use case for CI environments for upper layer storage protocols
such as NFS and SMB, for instance.


> Now Chuck Lever <cel@kernel.org> will fix it.

My understanding is that, because RoCE allows more than one port per egress
device, the mechanism for enabling rxe-on-lo is going to be different than
it is for iWARP -- or it might not be possible at all. That's why my siw
patches do not implement a fix for rxe.

Jason needs to outline a mechanism for it so we can see what needs to be
done. At which point, any interested party should be able to fix it.


--
Chuck Lever



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace
  2023-06-24 17:03           ` Pearson, Robert B
  2023-06-24 17:32             ` Chuck Lever III
@ 2023-06-25  0:53             ` Zhu Yanjun
  1 sibling, 0 replies; 25+ messages in thread
From: Zhu Yanjun @ 2023-06-25  0:53 UTC (permalink / raw)
  To: Pearson, Robert B, Bob Pearson, Zhu Yanjun, zyjzyj2000@gmail.com,
	jgg@ziepe.ca, leon@kernel.org, linux-rdma@vger.kernel.org,
	parav@nvidia.com, lehrer@gmail.com


在 2023/6/25 1:03, Pearson, Robert B 写道:
> -----Original Message-----
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> Sent: Friday, June 23, 2023 6:51 PM
> To: Bob Pearson <rpearsonhpe@gmail.com>; Zhu Yanjun <yanjun.zhu@intel.com>; zyjzyj2000@gmail.com; jgg@ziepe.ca; leon@kernel.org; linux-rdma@vger.kernel.org; parav@nvidia.com; lehrer@gmail.com
> Subject: Re: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace
>
>
> 在 2023/6/23 20:59, Bob Pearson 写道:
>> On 6/23/23 02:15, Zhu Yanjun wrote:
>>> 在 2023/6/22 5:27, Bob Pearson 写道:
>>>> On 6/21/23 16:09, Bob Pearson wrote:
>>>>> On 5/8/23 02:56, Zhu Yanjun wrote:
>>>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>>
>>>>>> When run "ip link add" command to add a rxe rdma link in a net
>>>>>> namespace, normally this rxe rdma link can not work in a net name
>>>>>> space.
>>>>>>
>>>>>> The root cause is that a sock listening on udp port 4791 is
>>>>>> created in init_net when the rdma_rxe module is loaded into
>>>>>> kernel. That is, the sock listening on udp port 4791 is created in
>>>>>> init_net. Other net namespace is difficult to use this sock.
>>>>>>
>>>>>> The following commits will solve this problem.
>>>>>>
>>>>>> In the first commit, move the creating sock listening on udp port
>>>>>> 4791 from module_init function to rdma link creating functions.
>>>>>> That is, after the module rdma_rxe is loaded, the sock will not be created.
>>>>>> When run "rdma link add ..." command, the sock will be created. So
>>>>>> when creating a rdma link in the net namespace, the sock will be
>>>>>> created in this net namespace.
>>>>>>
>>>>>> In the second commit, the functions udp4_lib_lookup and
>>>>>> udp6_lib_lookup will check the sock exists in the net namespace or
>>>>>> not. If yes, rdma link will increase the reference count of this
>>>>>> sock, then continue other jobs instead of creating a new sock to
>>>>>> listen on udp port 4791. Since the network notifier is global,
>>>>>> when the module rdma_rxe is loaded, this notifier will be registered.
>>>>>>
>>>>>> After the rdma link is created, the command "rdma link del" is to
>>>>>> delete rdma link at the same time the sock is checked. If the
>>>>>> reference count of this sock is greater than the sock reference
>>>>>> count needed by udp tunnel, the sock reference count is decreased
>>>>>> by one. If equal, it indicates that this rdma link is the last
>>>>>> one. As such, the udp tunnel is shut down and the sock is closed.
>>>>>> The above work should be implemented in linkdel function. But
>>>>>> currently no dellink function in rxe. So the 3rd commit addes
>>>>>> dellink function pointer. And the 4th commit implements the dellink function in rxe.
>>>>>>
>>>>>> To now, it is not necessary to keep a global variable to store the
>>>>>> sock listening udp port 4791. This global variable can be replaced
>>>>>> by the functions udp4_lib_lookup and udp6_lib_lookup totally.
>>>>>> Because the function udp6_lib_lookup is in the fast path, a member
>>>>>> variable l_sk6 is added to store the sock. If l_sk6 is NULL,
>>>>>> udp6_lib_lookup is called to lookup the sock, then the sock is
>>>>>> stored in l_sk6, in the future,it can be used directly.
>>>>>>
>>>>>> All the above work has been done in init_net. And it can also work
>>>>>> in the net namespace. So the init_net is replaced by the
>>>>>> individual net namespace. This is what the 6th commit does.
>>>>>> Because rxe device is dependent on the net device and the sock
>>>>>> listening on udp port 4791, every rxe device is in exclusive mode in the individual net namespace.
>>>>>> Other rdma netns operations will be considerred in the future.
>>>>>>
>>>>>> In the 7th commit, the
>>>>>> register_pernet_subsys/unregister_pernet_subsys
>>>>>> functions are added. When a new net namespace is created, the init
>>>>>> function will initialize the sk4 and sk6 socks. Then the 2 socks
>>>>>> will be released when the net namespace is destroyed. The
>>>>>> functions
>>>>>> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in
>>>>>> the net namespace. The functions
>>>>>> rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will handle sk6. Then sk4 and sk6 are used in the previous commits.
>>>>>>
>>>>>> As the sk4 and sk6 in pernet namespace can be accessed, it is not
>>>>>> necessary to add a new l_sk6. As such, in the 8th commit, the
>>>>>> l_sk6 is replaced with the sk6 in pernet namespace.
>>>>>>
>>>>>> Test steps:
>>>>>> 1) Suppose that 2 NICs are in 2 different net namespaces.
>>>>>>
>>>>>>      # ip netns exec net0 ip link
>>>>>>      3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq
>>>>>> state UP
>>>>>>         link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
>>>>>>         altname enp5s0
>>>>>>
>>>>>>      # ip netns exec net1 ip link
>>>>>>      4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
>>>>>> fq_codel
>>>>>>         link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff
>>>>>>
>>>>>> 2) Add rdma link in the different net namespace
>>>>>>        net0:
>>>>>>        # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2
>>>>>>
>>>>>>        net1:
>>>>>>        # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
>>>>>>
>>>>>> 3) Run rping test.
>>>>>>        net0
>>>>>>        # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
>>>>>>        [1] 1737
>>>>>>        # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
>>>>>>        verbose
>>>>>>        count 1
>>>>>>        ...
>>>>>>        ping data: rdma-ping-0:
>>>>>> ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
>>>>>>        ...
>>>>>>
>>>>>> 4) Remove the rdma links from the net namespaces.
>>>>>>        net0:
>>>>>>        # ip netns exec net0 ss -lu
>>>>>>        State     Recv-Q    Send-Q    Local Address:Port    Peer
>>>>>> Address:Port    Process
>>>>>>        UNCONN    0         0         0.0.0.0:4791
>>>>>> 0.0.0.0:*
>>>>>>        UNCONN    0         0         [::]:4791             [::]:*
>>>>>>
>>>>>>        # ip netns exec net0 rdma link del rxe0
>>>>>>
>>>>>>        # ip netns exec net0 ss -lu
>>>>>>        State     Recv-Q    Send-Q    Local Address:Port    Peer
>>>>>> Address:Port    Process
>>>>>>
>>>>>>        net1:
>>>>>>        # ip netns exec net0 ss -lu
>>>>>>        State     Recv-Q    Send-Q    Local Address:Port    Peer
>>>>>> Address:Port    Process
>>>>>>        UNCONN    0         0         0.0.0.0:4791
>>>>>> 0.0.0.0:*
>>>>>>        UNCONN    0         0         [::]:4791             [::]:*
>>>>>>
>>>>>>        # ip netns exec net1 rdma link del rxe1
>>>>>>
>>>>>>        # ip netns exec net0 ss -lu
>>>>>>        State     Recv-Q    Send-Q    Local Address:Port    Peer
>>>>>> Address:Port    Process
>>>>>>
>>>>>> V4->V5: Rebase the commits to V6.4-rc1
>>>>>>
>>>>>> V3->V4: Rebase the commits to rdma-next;
>>>>>>
>>>>>> V2->V3: 1) Add "rdma link del" example in the cover letter, and
>>>>>> V2->use "ss -lu" to
>>>>>>               verify rdma link is removed.
>>>>>>            2) Add register_pernet_subsys/unregister_pernet_subsys
>>>>>> net namespace
>>>>>>            3) Replace l_sk6 with sk6 of pernet_name_space
>>>>>>
>>>>>> V1->V2: Add the explicit initialization of sk6.
>>>>>>
>>>>>> Zhu Yanjun (8):
>>>>>>      RDMA/rxe: Creating listening sock in newlink function
>>>>>>      RDMA/rxe: Support more rdma links in init_net
>>>>>>      RDMA/nldev: Add dellink function pointer
>>>>>>      RDMA/rxe: Implement dellink in rxe
>>>>>>      RDMA/rxe: Replace global variable with sock lookup functions
>>>>>>      RDMA/rxe: add the support of net namespace
>>>>>>      RDMA/rxe: Add the support of net namespace notifier
>>>>>>      RDMA/rxe: Replace l_sk6 with sk6 in net namespace
>>>>>>
>>>>>>     drivers/infiniband/core/nldev.c     |   6 ++
>>>>>>     drivers/infiniband/sw/rxe/Makefile  |   3 +-
>>>>>>     drivers/infiniband/sw/rxe/rxe.c     |  35 +++++++-
>>>>>>     drivers/infiniband/sw/rxe/rxe_net.c | 113
>>>>>> +++++++++++++++++------
>>>>>>     drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
>>>>>>     drivers/infiniband/sw/rxe/rxe_ns.c  | 134
>>>>>> ++++++++++++++++++++++++++++ip netns add test
>>>>>>     drivers/infiniband/sw/rxe/rxe_ns.h  |  17 ++++
>>>>>>     include/rdma/rdma_netlink.h         |   2 +
>>>>>>     8 files changed, 279 insertions(+), 40 deletions(-)
>>>>>>     create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
>>>>>>     create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns
>>>>>> add test
>>>>>>
>>>>> Zhu,
>>>>>
>>>>> I did some simple experiments on netns functionality.
>>>>>
>>>>> With your patch set applied and rxe0 created on enp6s0 and rxe1
>>>>> created on lo in the default namespace
>>>>>
>>>>>       # sudo ip netns add test
>>>>>       # ip netns
>>>>>       test
>>>>>       # sudo ip netns exec test ip link
>>>>>       1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT
>>>>> group default qlen 1000
>>>>>           link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>>>       # sudo ip netns exec test ip link set dev lo up
>>>>>       # sudo ip netns exec test ip link
>>>>>       1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state
>>>>> UNKNOWN mode DEFAULT group default qlen 1000
>>>>>           link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>>>       # sudo ip netns exec test ip addr add dev lo
>>>>> fe80::0200:00ff:fe00:0000/64
>>>>>           [rxe doesn't work unless this IPV6 address is set]
>>>>>       # sudo ip netns exec test ip addr
>>>>>       1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state
>>>>> UNKNOWN group default qlen 1000
>>>>>           link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>>>           inet 127.0.0.1/8 scope host lo
>>>>>              valid_lft forever preferred_lft forever
>>>>>           inet6 fe80::200:ff:fe00:0/64 scope link
>>>>>              valid_lft forever preferred_lft forever
>>>>>           inet6 ::1/128 scope host
>>>>>              valid_lft forever preferred_lft forever
>>>>>       # sudo ip netns exec test ls /sys/class/infiniband
>>>>>       rxe0  rxe1
>>>>>           [These show up even though the ndevs do *not* belong to
>>>>> the test namespace! Probably OK.]
>>>>>       # sudo ip netns exec test rdma link add rxe2 type rxe netdev
>>>>> lo
>>>>>       # ls /sys/class/infiniband
>>>>>       rxe0  rxe1  rxe2
>>>>>           [The new rxe device shows up in the default namespace. At
>>>>> least we're consistent.]
>>>>>       # ib_send_bw -d rxe0 ... 192.168.0.27
>>>>>           [Works. Didn't break the existing rxe devices. Expected]
>>>>>       # ib_send_bw -d rxe1 ... 127.0.0.1
>>>>>           [Works. Expected]
>>>>>       # ib_send_bw -d rxe2 ... 127.0.0.1
>>>>>       IB device rxe2 not found
>>>>>          Unable to find the Infiniband/RoCE device
>>>>>           [Not work. Expected.]
>>>>>       # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
>>>>>       IB device rxe2 not found
>>>>>        Unable to find the Infiniband/RoCE device
>>>>>           [Also not work. Turns out rxe2 device is gone after
>>>>> failure. Not expected.]
>>>>>       # sudo ip netns exec test rdma link add rxe2 type rxe netdev
>>>>> lo
>>>>>       # ls /sys/class/infiniband
>>>>>       rxe0  rxe1  rxe2
>>>>>           [Good. It's back]
>>>>>       # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
>>>>>           [Works in test namespace! Expected.]
>>>>>       # sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1
>>>>>           [Also works. Definitely not expected.]
>>>>>
>>>>> My take, it sort of works. But there are some serious issues. You
>>>>> shouldn't be able to use the
>>>>> rxe2 device in the default namespace. It would be nice if you
>>>>> couldn't see the rxe devices in each other's namespaces (Like ip
>>>>> link or ip addr hide other namespace's devices.)
>>>>>
>>>>> Bob
>>>> Forgot to mention. It also is definitely not good that a process in
>>>> the default namespace can destroy a rxe device in the test namespace by trying to use it.
>>> Thanks a lot.
>>>
>>> I am not sure if it is correct or not to destroy a rxe device outside this this net namespace.
>>>
>>> Because to irdma/mlx5 rdma devices, we can also destroy them with the command "modprobe -v irdma/mlx5..." outside of the net namespace.
>>>
>>> I am not sure if this is correct or not.
>>>
>>> Zhu Yanjun
>>>
>>>> Bob
>> I didn' intentionally destroy lo2. I just tried to access the rxe device but it failed.
>> The rxe device was destroyed as a side effect of failing to open it.
> The GID of rxe can not be generated with lo. This is a problem. Now Chuck Lever <cel@kernel.org> will fix it.
>
> Not sure if the problem that you confronted is related with this. Please use physical NIC to make tests again.
>
> Thanks a lot.
>
> Zhu Yanjun
>
>> Bob
> That was why I added the IPV6 address by hand. That created the gid table entry. This is
> also a problem for all ethernet devices for distros that mangle the MAC address when creating the
> IPV6 address as a security measure. These include Ubuntu which I use. So I have to always add
> an IPV6 address based on the MAC address for any ethernet device.


Chuck Lever <cel@kernel.org> will fix this problem. Before the fix for 
this problem is merged, it is not good to make tests with rxe-on-lo.

It is difficult for us to tell the failure from lo or net name space.

So please use physical NIC instead of lo now.

Zhu Yanjun


>
> Bob

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace
  2023-06-24 17:32             ` Chuck Lever III
@ 2023-06-25  0:54               ` Zhu Yanjun
  0 siblings, 0 replies; 25+ messages in thread
From: Zhu Yanjun @ 2023-06-25  0:54 UTC (permalink / raw)
  To: Chuck Lever III, Bob Pearson
  Cc: Zhu Yanjun, zyjzyj2000@gmail.com, jgg@ziepe.ca, leon@kernel.org,
	linux-rdma@vger.kernel.org, parav@nvidia.com, lehrer@gmail.com


在 2023/6/25 1:32, Chuck Lever III 写道:
>> The GID of rxe can not be generated with lo. This is a problem.
> I agree, and would like to see a fix. It's obviously going to be a very
> useful use case for CI environments for upper layer storage protocols
> such as NFS and SMB, for instance.
>
>
>> Now Chuck Lever <cel@kernel.org> will fix it.
> My understanding is that, because RoCE allows more than one port per egress
> device, the mechanism for enabling rxe-on-lo is going to be different than
> it is for iWARP -- or it might not be possible at all. That's why my siw
> patches do not implement a fix for rxe.
>
> Jason needs to outline a mechanism for it so we can see what needs to be
> done. At which point, any interested party should be able to fix it.

Look forward to the fix.

Zhu Yanjun

>
>
> --
> Chuck Lever
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace
  2023-06-21 21:09 ` [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work " Bob Pearson
                     ` (2 preceding siblings ...)
  2023-06-23  7:09   ` Zhu Yanjun
@ 2024-11-12  9:33   ` Cyclinder Kuo
  3 siblings, 0 replies; 25+ messages in thread
From: Cyclinder Kuo @ 2024-11-12  9:33 UTC (permalink / raw)
  To: rpearsonhpe
  Cc: jgg, lehrer, leon, linux-rdma, parav, yanjun.zhu, yanjun.zhu,
	zyjzyj2000

> > From: Zhu Yanjun <yanjun.zhu@linux.dev>
> > 
> > When run "ip link add" command to add a rxe rdma link in a net
> > namespace, normally this rxe rdma link can not work in a net
> > name space.
> > 
> > The root cause is that a sock listening on udp port 4791 is created
> > in init_net when the rdma_rxe module is loaded into kernel. That is,
> > the sock listening on udp port 4791 is created in init_net. Other net
> > namespace is difficult to use this sock.
> > 
> > The following commits will solve this problem.
> > 
> > In the first commit, move the creating sock listening on udp port 4791
> > from module_init function to rdma link creating functions. That is,
> > after the module rdma_rxe is loaded, the sock will not be created.
> > When run "rdma link add ..." command, the sock will be created. So
> > when creating a rdma link in the net namespace, the sock will be
> > created in this net namespace.
> > 
> > In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup
> > will check the sock exists in the net namespace or not. If yes, rdma
> > link will increase the reference count of this sock, then continue other
> > jobs instead of creating a new sock to listen on udp port 4791. Since the
> > network notifier is global, when the module rdma_rxe is loaded, this
> > notifier will be registered.
> > 
> > After the rdma link is created, the command "rdma link del" is to
> > delete rdma link at the same time the sock is checked. If the reference
> > count of this sock is greater than the sock reference count needed by
> > udp tunnel, the sock reference count is decreased by one. If equal, it
> > indicates that this rdma link is the last one. As such, the udp tunnel
> > is shut down and the sock is closed. The above work should be
> > implemented in linkdel function. But currently no dellink function in
> > rxe. So the 3rd commit addes dellink function pointer. And the 4th
> > commit implements the dellink function in rxe.
> > 
> > To now, it is not necessary to keep a global variable to store the sock
> > listening udp port 4791. This global variable can be replaced by the
> > functions udp4_lib_lookup and udp6_lib_lookup totally. Because the
> > function udp6_lib_lookup is in the fast path, a member variable l_sk6
> > is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called
> > to lookup the sock, then the sock is stored in l_sk6, in the future,it
> > can be used directly.
> > 
> > All the above work has been done in init_net. And it can also work in
> > the net namespace. So the init_net is replaced by the individual net
> > namespace. This is what the 6th commit does. Because rxe device is
> > dependent on the net device and the sock listening on udp port 4791,
> > every rxe device is in exclusive mode in the individual net namespace.
> > Other rdma netns operations will be considerred in the future.
> > 
> > In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys
> > functions are added. When a new net namespace is created, the init
> > function will initialize the sk4 and sk6 socks. Then the 2 socks will
> > be released when the net namespace is destroyed. The functions
> > rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net
> > namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will
> > handle sk6. Then sk4 and sk6 are used in the previous commits.
> > 
> > As the sk4 and sk6 in pernet namespace can be accessed, it is not
> > necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is
> > replaced with the sk6 in pernet namespace.
> > 
> > Test steps:
> > 1) Suppose that 2 NICs are in 2 different net namespaces.
> > 
> >   # ip netns exec net0 ip link
> >   3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
> >      link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
> >      altname enp5s0
> > 
> >   # ip netns exec net1 ip link
> >   4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
> >      link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff
> > 
> > 2) Add rdma link in the different net namespace
> >     net0:
> >     # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2
> > 
> >     net1:
> >     # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
> > 
> > 3) Run rping test.
> >     net0
> >     # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
> >     [1] 1737
> >     # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
> >     verbose
> >     count 1
> >     ...
> >     ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
> >     ...
> > 
> > 4) Remove the rdma links from the net namespaces.
> >     net0:
> >     # ip netns exec net0 ss -lu
> >     State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
> >     UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
> >     UNCONN    0         0         [::]:4791             [::]:*
> > 
> >     # ip netns exec net0 rdma link del rxe0
> > 
> >     # ip netns exec net0 ss -lu
> >     State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
> > 
> >     net1:
> >     # ip netns exec net0 ss -lu
> >     State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
> >     UNCONN    0         0         0.0.0.0:4791          0.0.0.0:*
> >     UNCONN    0         0         [::]:4791             [::]:*
> > 
> >     # ip netns exec net1 rdma link del rxe1
> > 
> >     # ip netns exec net0 ss -lu
> >     State     Recv-Q    Send-Q    Local Address:Port    Peer Address:Port    Process
> > 
> > V4->V5: Rebase the commits to V6.4-rc1
> > 
> > V3->V4: Rebase the commits to rdma-next;
> > 
> > V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to
> >            verify rdma link is removed.
> >         2) Add register_pernet_subsys/unregister_pernet_subsys net namespace
> >         3) Replace l_sk6 with sk6 of pernet_name_space
> > 
> > V1->V2: Add the explicit initialization of sk6.
> > 
> > Zhu Yanjun (8):
> >   RDMA/rxe: Creating listening sock in newlink function
> >   RDMA/rxe: Support more rdma links in init_net
> >   RDMA/nldev: Add dellink function pointer
> >   RDMA/rxe: Implement dellink in rxe
> >   RDMA/rxe: Replace global variable with sock lookup functions
> >   RDMA/rxe: add the support of net namespace
> >   RDMA/rxe: Add the support of net namespace notifier
> >   RDMA/rxe: Replace l_sk6 with sk6 in net namespace
> > 
> >  drivers/infiniband/core/nldev.c     |   6 ++
> >  drivers/infiniband/sw/rxe/Makefile  |   3 +-
> >  drivers/infiniband/sw/rxe/rxe.c     |  35 +++++++-
> >  drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++------
> >  drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
> >  drivers/infiniband/sw/rxe/rxe_ns.c  | 134 ++++++++++++++++++++++++++++ip netns add test
> >  drivers/infiniband/sw/rxe/rxe_ns.h  |  17 ++++
> >  include/rdma/rdma_netlink.h         |   2 +
> >  8 files changed, 279 insertions(+), 40 deletions(-)
> >  create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
> >  create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns add test
> > 
> 
> Zhu,
> 
> I did some simple experiments on netns functionality.
> 
> With your patch set applied and rxe0 created on enp6s0 and rxe1 created on lo in the default namespace
> 
> 	# sudo ip netns add test
> 	# ip netns
> 	test
> 	# sudo ip netns exec test ip link
> 	1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
> 	    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 	# sudo ip netns exec test ip link set dev lo up
> 	# sudo ip netns exec test ip link
> 	1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
> 	    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 	# sudo ip netns exec test ip addr add dev lo fe80::0200:00ff:fe00:0000/64
> 		[rxe doesn't work unless this IPV6 address is set]
> 	# sudo ip netns exec test ip addr
> 	1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
> 	    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 	    inet 127.0.0.1/8 scope host lo
> 	       valid_lft forever preferred_lft forever
> 	    inet6 fe80::200:ff:fe00:0/64 scope link 
> 	       valid_lft forever preferred_lft forever
> 	    inet6 ::1/128 scope host 
> 	       valid_lft forever preferred_lft forever
> 	# sudo ip netns exec test ls /sys/class/infiniband
> 	rxe0  rxe1
> 		[These show up even though the ndevs do *not* belong to the test namespace! Probably OK.]
> 	# sudo ip netns exec test rdma link add rxe2 type rxe netdev lo
> 	# ls /sys/class/infiniband
> 	rxe0  rxe1  rxe2
> 		[The new rxe device shows up in the default namespace. At least we're consistent.]
> 	# ib_send_bw -d rxe0 ... 192.168.0.27
> 		[Works. Didn't break the existing rxe devices. Expected]
> 	# ib_send_bw -d rxe1 ... 127.0.0.1
> 		[Works. Expected]
> 	# ib_send_bw -d rxe2 ... 127.0.0.1
> 	IB device rxe2 not found
>  	 Unable to find the Infiniband/RoCE device
> 		[Not work. Expected.]
> 	# sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
> 	IB device rxe2 not found
> 	 Unable to find the Infiniband/RoCE device
> 		[Also not work. Turns out rxe2 device is gone after failure. Not expected.]
> 	# sudo ip netns exec test rdma link add rxe2 type rxe netdev lo
> 	# ls /sys/class/infiniband
> 	rxe0  rxe1  rxe2
> 		[Good. It's back]
> 	# sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
> 		[Works in test namespace! Expected.]
> 	# sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1
> 		[Also works. Definitely not expected.]
> 
> My take, it sort of works. But there are some serious issues. You shouldn't be able to use the
> rxe2 device in the default namespace. It would be nice if you couldn't see the rxe devices in each
> other's namespaces (Like ip link or ip addr hide other namespace's devices.)

Hi mates:

If rdma's system is shared, should every net namespace see all rdma devices?

Thanks for your work. I'm working on cloud native web related and I really need this patch :).

Nowadays AI + Cloud Native is becoming more and more popular and there are more and more AI clusters running on Kubernetes. Most AI clusters use RDMA technology to communicate across nodes between training tasks, which speeds up training and reduces network latency. However, RDMA hardware devices are very expensive and costly for those who want to experience or test the entire AI training process (involving scheduling, using RDMA networks, and using nccl-like communication libraries, etc.). So we plan to develop a Kubernetes CNI plugin that can run RDMA locally, which will mount the rdma device on the host inside the container so that the AI tasks can also use the RDMA network.

I understand that Soft Roce is able to virtualize rdma devices without RDMA hardware, and I have verified this locally, being able to test it with tools like ib_send_bw. However, I further moved the NIC on the host or the NIC on the host using macvlan technology to the network namespace inside the container, and then manually added the rdma device inside the container using the rdma link add command, and then finally found out that I couldn't see the virtualized rdma device inside the container by using the ibv_devices command, and I switched the system mode of the rdma respectively(shared or exclusive), and neither of them worked. I've also seen other developers report similar issues, refer [nccl-tests container can't use Soft-RoCE interfaces](https://github.com/NVIDIA/deepops/issues/772).

I guess the Linux kernel upstream doesn't support soft roce containerization yet. I read the linux kernel upstream code and related patches carefully and confirmed my guess. So I contacted Yanjun, I really hope this patch can be merged into upstream, thanks again for your work, looking forward to working further.

Best Regards
Cyclinder

> Bob

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-11-12  9:34 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-08  7:56 [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace Zhu Yanjun
2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 1/8] RDMA/rxe: Creating listening sock in newlink function Zhu Yanjun
2023-06-20 17:16   ` Bob Pearson
2023-06-20 23:40     ` Zhu Yanjun
2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 2/8] RDMA/rxe: Support more rdma links in init_net Zhu Yanjun
2023-06-20 17:54   ` Bob Pearson
2023-06-20 23:51     ` Zhu Yanjun
2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 3/8] RDMA/nldev: Add dellink function pointer Zhu Yanjun
2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 4/8] RDMA/rxe: Implement dellink in rxe Zhu Yanjun
2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 5/8] RDMA/rxe: Replace global variable with sock lookup functions Zhu Yanjun
2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 6/8] RDMA/rxe: add the support of net namespace Zhu Yanjun
2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 7/8] RDMA/rxe: Add the support of net namespace notifier Zhu Yanjun
2023-05-08  7:56 ` [PATCH v6.4-rc1 v5 8/8] RDMA/rxe: Replace l_sk6 with sk6 in net namespace Zhu Yanjun
2023-06-21 21:09 ` [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work " Bob Pearson
2023-06-21 21:27   ` Bob Pearson
2023-06-23  7:15     ` Zhu Yanjun
2023-06-23 12:59       ` Bob Pearson
2023-06-23 23:50         ` Zhu Yanjun
2023-06-24 17:03           ` Pearson, Robert B
2023-06-24 17:32             ` Chuck Lever III
2023-06-25  0:54               ` Zhu Yanjun
2023-06-25  0:53             ` Zhu Yanjun
2023-06-22  3:46   ` Zhu Yanjun
2023-06-23  7:09   ` Zhu Yanjun
2024-11-12  9:33   ` Cyclinder Kuo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).